Item: Apache Cassandra
Rating: 10
Author: David Prinzing

Overall Satisfaction with Cassandra

Use Cases and Deployment Scope

Cassandra is the only database used by Algorithmic Ads. We use it for both real-time transactions and analytics. The primary application accessing Cassandra is a light-weight Java application that provides a RESTful web services API for all our other applications. The API is a focal point for integration and includes both business logic and data. The same API is used both internally and by our customers. We rely on Cassandra for its amazing performance, linear scalability, and continuous availability.

Pros and Cons

Pros

Continuous availability: as a fully distributed database (no master nodes), we can update nodes with rolling restarts and accommodate minor outages without impacting our customer services.
Linear scalability: for every unit of compute that you add, you get an equivalent unit of capacity. The same application can scale from a single developer's laptop to a web-scale service with billions of rows in a table.
Amazing performance: if you design your data model correctly, bearing in mind the queries you need to answer, you can get answers in milliseconds.
Time-series data: Cassandra excels at recording, processing, and retrieving time-series data. It's a simple matter to version everything and simply record what happens, rather than going back and editing things. Then, you can compute things from the recorded history.

Cons

Cassandra is a poor choice for implementing application queues.
NoSQL requires thinking differently, and can be challenging for people with strong relational database backgrounds to understand. The CQL language helps with this, but it pays to understand how the engine works under the hood. That said, the benefits outweigh the challenge of the learning curve!
Database compactions and anti-entropy repair can be burdensome on a busy cluster. Significant improvements have been made in recent versions, but it remains as an operational challenge.

Return on Investment

Open source Apache Cassandra is free, the infrastructure to run it is cheap, and the expertise to use it is not. You'll be investing in your developers and devops team members, and they're worth it! Cassandra is incredibly cost-effective and it positions your applications to grow to web-scale.
DataStax Enterprise merits serious consideration. There are licensing fees, but it's worth it for (1) production support (especially if your own team is new to Cassandra), (2) stable releases, (3) sophisticated operational tools like OpsCenter, (4) integration with Apache Solr for geospatial, faceted, full-text search, and (5) integration with Apache Spark for machine learning and streaming analytics.

Alternatives Considered

Amazon DynamoDB, HBase, MongoDB, PostgreSQL, Riak and VoltDB

Four years ago, I needed to choose a web-scale database. Having used relational databases for years (PostgreSQL is my favorite), I needed something that could perform well at scale with no downtime. I considered VoltDB for its in-memory speed, but it's limited in scale. I considered MongoDB as a popular NoSQL alternative, but preferred Cassandra's performance and peer-to-peer architecture. Riak was attractive because it was also derived from Amazon's pioneering research paper in 2007, but it's a key-value store, and I preferred Cassandra's columnar data model. I considered HBase as a prominent member of the Hadoop family of projects, but read in various blog posts that Cassandra outperformed it, especially when handling lots of small transactions. I also considered Amazon's DynamoDB (especially since we run our applications on AWS), but decided against it because of the vendor lock-in. I can run Cassandra anywhere. I love the performance, scalability, availability, and ease of operation, and I've never looked back.

Likelihood to Renew

I've used Cassandra for 4 years now, on 3 major projects (one of them truly web-scale), and I'm deeply satisfied. These days, it's my go-to database. That said, technology moves quickly, and it's good to keep abreast of new developments...

Likelihood to Recommend

Cassandra excels in a broad range of applications -- especially if you understand its data model and write your applications accordingly. It's an excellent choice for time-series data, and a poor choice for application queues. It performs the best if you can simply record history and compute from it, rather than going back and editing or deleting things a lot.

Comments

Please log in to join the conversation

Cassandra, hands-on review, after 4 years of serious use

Software Version