Item: Apache Cassandra
Rating: 8
Author: yixiang Shan

Overall Satisfaction with Cassandra

Use Cases and Deployment Scope

We use Cassandra to build a fully functional POC (with the continuous production level volume of feeding data) for a shipment cloud concept for Fedex's EMEA region. This solution is composed of two parts, we use an IMDG product to keep the latest transaction of all shipments' latest "status" while we use Cassandra as our long-term transaction storage to keep all historical shipment status update events. On top of those InMemory and NoSQL storage, we built one unified RESTful based service, which depends on the user's query needs, either/and/or query the IMDG for the latest status of the shipment or query the Cassandra for the history of the shipment. Also, the Cassandra is used as the "backup" of the IMDG, in case the IMDG part is fully crashed (the worst scenario). Thanks to the time series way of persisting the data in Cassandra, we still can extract the "latest" status of a shipment from Cassandra's full transaction history with reasonable performance (slower than IMDG but much quicker than the traditional relational database).

Pros and Cons

Pros

Cassandra is very strong for saving the time series based transaction data model, simply by reversing the time series order when creating the data table, we can very quickly fetch the "latest" records even from millions of associated transactions because the latest record is always at the top of the search. By combining with the TTL feature of the Cassandra column, it is easy to "auto" delete the old data.
Cassandra combines the key-value store from Amazon's DynamoDB with the column family data model from the Google's BigTable, which makes it easy to manage both structured and non-structured data model efficiently.
By using the DataStax Enterprise version provided Solr integration, it can even solve some ad-hoc query needs which may not be fully taken into account at the beginning of the project when the data table is created. This extremely adds more room to play for a large enterprise or project which does require some flexibility in the practical context.
The linear scalability provided by Cassandra, allowing us to easily scale up/down the cluster by simply adding/removing the servers.
The throughput for both the read/write performance of Cassandra is quite good.

Cons

Managing the big cluster of Cassandra , even with the DataStax Enterprise Version, is still quite challenging for a maintenance team, considering the frequent version upgrade (even in the rolling fashion) and more frequent auto-repair, for me on this area, a powerful tool should be provided to "automate" this process as much as possible.
The TTL design is good, however the pain is if the TTL is set on some data already inserted, it can not be simply updated. Unless that data is reinserted again, this fact causes a lot of issues in case the business strategy is changed which requires the purge strategy to be updated also.
As the nature of Cassandra is still Java based, the GC sometimes eats some performance, if Cassandra can allow using more non-Heap memory space, to reduce the GC efforts which will free more power on the hardware.
The default indexing strategy for JSON formatted data in the DataStax's Solr integration is not available. At this moment we have to implement our own to support our JSON text stored. We extract the key field from our data which might be required to be ad-hoc searched, converting them into the JSON format (only one level Map), and save them into the Cassandra column. On top of that we want Solr to index the key of each token.

Return on Investment

The open source version of Cassandra is only suggested for learning the basic concepts and play with its core features. Unless you really want to invest a lot in your developers and architects knowing every detail of Cassandra, I prefer the DataStax enterprise version. Although the license cost is relatively high, I think they it is worth it. I'm thinking about the support, the monitoring tool OpsCenter, and the integration of Solr and Spark (for data analysis).
Cassandra didn't fully replace our old and traditional relation database Oracle. In addition, it opens another door for us to deal with some special business use cases that NoSQL database can do better in a more feasible and efficient way.

Alternatives Considered

MongoDB and HBase

We evaluated MongoDB also, but don't like the single point failure possibility. The HBase coupled us too tightly to the Hadoop world while we prefer more technical flexibility. Also HBase is designed for "cold"/old historical data lake use cases and is not typically used for web and mobile applications due to its performance concern. Cassandra, by contrast, offers the availability and performance necessary for developing highly available applications. Furthermore, the Hadoop technology stack is typically deployed in a single location, while in the big international enterprise context, we demand the feasibility for deployment across countries and continents, hence finally we are favor of Cassandra

Likelihood to Renew

In our POC Cassandra satisfies all our needs and expectations. We would like to do an additional POC to test its cross-continent cluster level replication features, measuring the performance and data consistency level to help us finally decide how to move to production.

Likelihood to Recommend

For the scenarios which need ACID support, maybe Cassandra is not the best, but for an insert only (time series based) transaction case and requirements to cope with the unpredictable data model/structure changes of the future, then Cassandra is one of the best options. If you only use the open source version of Cassandra, then without Solr integrated, you need to know your search query before you create the table, if that's not possible then Cassandra or other NoSQL DB might not your right choice.

Comments

Please log in to join the conversation

Cassandra, put into the real business context