Step 1: Set Up C
First, You need to download, build and install C. This is a must-do, no matter if you are using Windows or Linux. See here for more details.
If you are on Windows make sure to read about MinGW too. You will also need to know how to access your cluster with SSH (Linux) or Remote Desktop Connection (Windows).
Make sure that your system has enough RAM and CPU power available, especially when using YCSB (Hint: check out the c_measurements_suite tool).
Also do not forget to have your client libraries ready. There is an excellent tutorial by DataStax on installing Cassandra with their Java driver. Once installed, open a new terminal window/prompt in your C* installation directory, type cqlsh and then use CTRL+D to exit cqlsh after it has been loaded into your shell environment.
Step 2: Basic Key/Value Pairs
Whether you’re a seasoned developer or just getting your feet wet, making sense of Cassandra can be hard.
Luckily, Cassandra uses a simple, flexible data model that makes it easy to take advantage of its full power. With everything else in place, you need only learn two basic concepts: keys and columns.
There is no limit to how many columns exist per key—but there are limits on how many keys and column families exist in a given cluster at any given time!
Step 3: Store Data in Structs
There are four main things to consider when working with structs:
- The fields (what goes in your struct)
- How these fields get their values
- Serialization
- Performance considerations.
Step 4: Add Secondary Indexes
A secondary index adds an additional way to query a row, based on a column that’s not part of your primary key.
Creating these indexes doesn’t impact performance much (so it’s always safe to create as many as you want).
If a secondary index is present, you can then query across both columns at once in order to return rows that match both criteria (i.e., faster).
A simple example would be creating an email address index for searching by email. If you create one and have created your keys using UUIDs (as we recommend), then when adding indexes, there will be no collisions between them. We recommend doing so whenever possible.
Step 5: Create Custom Compaction Strategies
Compaction is one of Apache Cassandra’s most important features, but it can also be tricky to get right.
As your data grows and begins to take up more space on disk, you may find that your compaction strategy isn’t cutting it anymore; at a certain point, compaction can actually make matters worse as it will eat up all available disk space!
Both are great ways to configure and monitor your database in real-time.
Step 6: Put C in Production with Docker
The name Docker has been on everyone’s lips over these past few years. But why? And why is it one of your best options for deploying Cassandra in production?
This chapter will dive into Docker and look at ways you can use it to deploy a production-ready C cluster.
Step 7: Connect to Your Datastore from Client Applications
Now that you have set up your datastore cluster and can access it from your application, you need to teach your client applications how to connect.
While there are many ways to do so, for beginners, I recommend using either JDBC (Java Database Connectivity) or Thrift. If you’re using JDBC, make sure to use an asynchronous data source.
The advantage of using JDBC is that it allows client-side scripting languages such as Python or Ruby to talk directly with your datastore.
Step 8: Use Query Batching to Improve Performance
Have you noticed that your queries run slowly sometimes?
Query batching is one of many techniques for optimizing performance in distributed systems and it’s very simple to enable.
If you aren’t familiar with query batching, I suggest reading CASSANDRA-9352 as a quick introduction and then configuring your servers to use batching by setting batch_size in Configuration.
We generally suggest a value of 100, but there are no hard rules here—you can try experimenting with different values to find what works best for your use case.
As always, measure after changing any setting like this! The new value should be applied to all servers in your cluster.
Step 9: Optimize for Column Families Instead of Tables and Rows
If you’re used to relational databases and looking at data as rows and tables, it may be a little difficult to wrap your head around column families.
A better way to think about it is that when you look at a table in a relational database, there are two different dimensions for storing data: Rows are one dimension, and columns are another. But in NoSQL databases like Apache Cassandra, rows and columns tend to blend together (that’s why they’re called column families).
This makes your design more flexible because you don’t have to decide on how many columns you want for each row upfront.
Step 10: Understand When to Use NODESET versus KEYSET Queries
Querying is such a fundamental skill in Apache Cassandra that you need to know how to do it right. There are two ways to perform queries in Apache Cassandra: NODESET and KEYSET.
With NODESET queries, all of your data is returned from one specific node at a time which makes querying faster and more efficient as you don’t have to contact all of your nodes if you’re looking for smaller subsets of data.