AWS Database Blog
Compress and conquer with Amazon Keyspaces (for Apache Cassandra)
Amazon Keyspaces (for Apache Cassandra) is a scalable, highly available, and managed Apache Cassandra-compatible database service that enables you to run Cassandra workloads more easily by using a serverless, pay-as-you-go solution.
With Amazon Keyspaces, you don’t have to worry about configuring and optimizing your Cassandra cluster for your mission-critical, operational workloads. Amazon Keyspaces provides you with single-digit-millisecond response times at any scale. You can build applications with virtually unlimited throughput and storage that can serve thousands of requests per second. To help deliver fast performance, Amazon Keyspaces has a 1 MB row-size quota (for more information, see Quotas for Amazon Keyspaces (for Apache Cassandra).
However, you may already have rows larger than 1 MB in your existing Cassandra tables. To reduce the size of these rows, you can compress one or more large columns. Compressing large columns reduces your storage costs, improves performance, reduces I/O and network usage, and enables you to fit the data within the Amazon Keyspaces row quota.
In this post, I show you how to compress your data using freely available compression tools and store that compressed data in Amazon Keyspaces.
Prerequisites
To get started, let’s first create one keyspace and two tables.
Amazon Keyspaces stores data durably across multiple AWS Availability Zones using a replication factor of three for high availability. You don’t have to specify a replication factor when creating a new keyspace in Amazon Keyspaces; the service configures these settings automatically.
The following code creates one keyspace:
Compression algorithms such as Snappy, LZ4, GZIP, and ZSTD generate binary output that you can store in a BLOB column type in your table. For this post, I use Snappy compression to compress my data. I decided to use the Snappy compression algorithm for the following reasons:
- Fast compression and decompression rate between 200–560 MB per second
- Low-memory-footprint;
SnappyOutputStream
uses only 32 KB+ by default - Portable across various operating systems; Snappy-Java contains native libraries built for Window, Mac, and Linux (I did all tests on an Amazon Elastic Compute Cloud (Amazon EC2) instance with 8 vCPUs and 16 GB of RAM)
- Simple API
- Free for commercial and non-commercial use
- Compression ratio is 2.073
Now that you have the keyspace, you need to create two tables to benchmark using Snappy compression with Amazon Keyspaces data. The following script creates two key-value tables with a timeuuid
partition key. Choosing timeuuid
is ideal for natural distribution, conflict-free partition key values, and evenly distributes our workload across the table.
The following code creates two tables with the timeuuid and blob columns:
Preparing the data
To work through our tests in this post, you need to download JSON objects with different sizes. For this post, I use 11,876 JSON objects that I downloaded from OpenFDA, but you can use any source of data available to you.
For our use case, I prepared 937 JSON objects less or equal to 1 KB, 10,168 JSON objects between 1–4 KB, 527 JSON objects between 4–20 KB, and 244 JSON objects between 20–67 KB.
Using Snappy and Amazon Keyspaces
Let’s first import org.xerial.snapy.Snappy
and a Cassandra driver in your Java project and then use Snappy.compress(byte[])
and Snappy.uncompress(byte[])
to compress and decompress bytes. For example:
Running the write performance test
To run the write performance tests, complete the following steps:
- Download the maven project from the AWS Samples GitHub.
- From the top directory, run
mvn install
. - Configure
resources/config.properties
:- Set
contactPoint
to the service endpoint. For example,us-east-1.amazonaws.com
. - Set
port
to9142
. - Set
region
to your Region. For example,us-east-1
. - Set
input_jsons
to a path to your JSON file. For example,resources/device-enforcement-0001-of-0001.json
. - Set
output_partitions_compressed
to a path to compressed partitions file to read compressed data back by ID. For example,resources/compressed_partitions.out
. - Set
output_partitions_uncompressed
to a path of your uncompressed partitions file to read uncompressed data back by ID. For example,resources/uncompressed_partitions.out
.
- Set
- Run the write performance test:
Running the performance test might incur costs to your account. For more information, see Amazon Keyspaces pricing.
Running the read performance test
To run the read performance tests, enter the following code:
Running the performance test might incur costs to your account. For more information, see Amazon Keyspaces pricing.
Configuring write and read latency metrics
In this section, we walk you through configuring write and read latency metrics in Amazon CloudWatch.
Write latency
To add the write latencies for both tables, complete the following steps:
-
- On the Amazon Keyspaces console, on the Tables page, choose
table_with_compressed_json
ortable_with_uncompressed_json
. - On the Capacity tab, choose Add to CloudWatch.
- Choose Widget Action and Write units per second.
- On the CloudWatch dashboard, choose Edit.
- Choose All Metrics and AWS/Cassandra.
- Choose Keyspace, Operation, and TableName.
- On the drop-down menu, choose the INSERT operation for
table_with_compressed_json and table_with_uncompressed_json
with the metric nameSuccessfulRequestLatency
. - Choose Graph metrics.
- On the Amazon Keyspaces console, on the Tables page, choose
The dashboard shows write latencies for the compressed and uncompressed writes.
-
- In the Statistic column, choose the
p99
metric. - Choose Update widget.
- In the Statistic column, choose the
If you change the graph type from line to number, you see absolute write latencies in absolute numbers, as shown in the following screenshots.
Read latency
To add the read latencies for both tables, complete the following steps:
-
- Choose All Metrics and AWS/Cassandra.
- Choose Keyspace, Operation, and TableName.
- On the drop-down menu, choose the SELECT operation for
table_with_compressed_json
andtable_with_uncompressed_json
with the metric nameSuccessfulRequestLatency
. - Choose Graph metrics.
The dashboard shows read latencies for the compressed and uncompressed reads.
-
- In the Statistic column, choose the
p99
metric. - Choose Update widget.
- In the Statistic column, choose the
If you change the graph type from line to number, you see absolute read latencies in absolute numbers, as shown in the following screenshot.
Analyzing the data
The preceding graphs show that the p99 write latency of compressed objects improved by 18.5%, and the p99 read latency improved by 19.5%.
The following table shows statistics I collected from the application side. I divided the files sizes across four different ranges to show how distribution changed after compression. The size of large objects between 20–67 KB reduced by 77.5%, objects between 4–20 KB reduced by 22%, and objects between 1–4 KB reduced by 24.2%, but the number of objects less or equal to 1 KB increased by 74.6%, which fits perfectly into 1 write capacity unit (WCU).
JSON Objects | <=1 KB | 1–4 KB | 4–20 KB | 20–67 KB |
Uncompressed | 937 | 10168 | 527 | 244 |
Compressed | 3700 | 7706 | 415 | 55 |
After capturing all metrics, delete table_with_compressed_json
and table_without_compressed_json
to avoid extra costs to your account.
Considerations
When implementing your solution, you should consider compression overhead (see “Compression overhead” below) and what work-arounds you can use to read compressed data by using cqlsh (see “Reading data stored in BLOB columns” below).
Compression overhead
Compression can improve the performance of writing and reading data by reducing I/O and network usage. However, compression increases the processing time for each run. In this example, the total elapsed time to compress data was 3,434 milliseconds, and to decompress data was 1,093 milliseconds. Therefore, the average compression overhead was 0.3 milliseconds (2.86%) for each JSON object and the average decompression overhead was 0.09 milliseconds (0.91%).
Reading data stored in BLOB columns
To help you access and read data stored in a BLOB column by using the Amazon Keyspaces console or cqlsh, I’ve created a helper wrapper cqlsh-experimental.sh
that you can use from our Amazon Keyspaces developer toolkit.
-
- Clone the Amazon Keyspaces developer toolkit with the following code:
-
- Replace the Dockerfile in the cloned repository with
src/cqlsh-experimental/Dockerfile.
- Build the Docker image:
- Replace the Dockerfile in the cloned repository with
-
- Connect to Amazon Keyspaces with an additional parameter
-d snappy
to decompress data:
- Connect to Amazon Keyspaces with an additional parameter
-
- Run your CQL query.
All BLOB columns in your select statement are automatically decompressed, as shown in the following screenshot.
Storage savings
In this performance test, I processed 11,876 JSON objects that aren’t uniform in size. The original serialized data size is 30,153,018 bytes, and the compressed data size is 19,695,386 bytes. Using compression decreased the data size by 35% with a compress ratio of 1.53.
Conclusion
The approach outlined in this post offers as an effective way to compress and decompress data and store large objects in Amazon Keyspaces. You can use the examples in this post to easily implement a similar compression approach on your application to improve read and write performance and reduce your storage costs.
There are other options you can consider to store large rows in Amazon Keyspaces. For example, if your compressed data is still larger than 1 MB or if you prefer not to compress your data, you can break up your data across multiple rows. You can also store large objects in Amazon Simple Storage Service (Amazon S3) and store the S3 object pointers in your Amazon Keyspaces columns, or replace JSON/XML verbose format with concise binary object representation (CBOR) to reduce the size of objects.
Please submit your questions and requests in the comments section.
About the Author
Nikolai Kolesnikov is a Sr. Data Architect and helps AWS Professional Services customers build highly-scalable applications using Amazon Keyspaces. He also leads Amazon Keyspaces ProServe customer engagements.