AWS Database Blog

Compress and conquer with Amazon Keyspaces (for Apache Cassandra)

Amazon Keyspaces (for Apache Cassandra) is a scalable, highly available, and managed Apache Cassandra-compatible database service that enables you to run Cassandra workloads more easily by using a serverless, pay-as-you-go solution.

With Amazon Keyspaces, you don’t have to worry about configuring and optimizing your Cassandra cluster for your mission-critical, operational workloads. Amazon Keyspaces provides you with single-digit-millisecond response times at any scale. You can build applications with virtually unlimited throughput and storage that can serve thousands of requests per second. To help deliver fast performance, Amazon Keyspaces has a 1 MB row-size quota (for more information, see Quotas for Amazon Keyspaces (for Apache Cassandra).

However, you may already have rows larger than 1 MB in your existing Cassandra tables. To reduce the size of these rows, you can compress one or more large columns. Compressing large columns reduces your storage costs, improves performance, reduces I/O and network usage, and enables you to fit the data within the Amazon Keyspaces row quota.

In this post, I show you how to compress your data using freely available compression tools and store that compressed data in Amazon Keyspaces.

Prerequisites

To get started, let’s first create one keyspace and two tables.

Amazon Keyspaces stores data durably across multiple AWS Availability Zones using a replication factor of three for high availability. You don’t have to specify a replication factor when creating a new keyspace in Amazon Keyspaces; the service configures these settings automatically.

The following code creates one keyspace:

CREATE KEYSPACE compression WITH replication = {'class':
'com.amazonaws.cassandra.DefaultReplication'} AND durable_writes = true;

Compression algorithms such as Snappy, LZ4, GZIP, and ZSTD generate binary output that you can store in a BLOB column type in your table. For this post, I use Snappy compression to compress my data. I decided to use the Snappy compression algorithm for the following reasons:

  • Fast compression and decompression rate between 200–560 MB per second
  • Low-memory-footprint; SnappyOutputStream uses only 32 KB+ by default
  • Portable across various operating systems; Snappy-Java contains native libraries built for Window, Mac, and Linux (I did all tests on an Amazon Elastic Compute Cloud (Amazon EC2) instance with 8 vCPUs and 16 GB of RAM)
  • Simple API
  • Free for commercial and non-commercial use
  • Compression ratio is 2.073

Now that you have the keyspace, you need to create two tables to benchmark using Snappy compression with Amazon Keyspaces data. The following script creates two key-value tables with a timeuuid partition key. Choosing timeuuid is ideal for natural distribution, conflict-free partition key values, and evenly distributes our workload across the table.

The following code creates two tables with the timeuuid and blob columns:

CREATE TABLE table_with_compressed_json (
id timeuuid PRIMARY KEY,
data blob)

CREATE TABLE table_with_uncompressed_json (
id timeuuid PRIMARY KEY,
data blob)

Preparing the data

To work through our tests in this post, you need to download JSON objects with different sizes. For this post, I use 11,876 JSON objects that I downloaded from OpenFDA, but you can use any source of data available to you.

For our use case, I prepared 937 JSON objects less or equal to 1 KB, 10,168 JSON objects between 1–4 KB, 527 JSON objects between 4–20 KB, and 244 JSON objects between 20–67 KB.

Using Snappy and Amazon Keyspaces

Let’s first import org.xerial.snapy.Snappy and a Cassandra driver in your Java project and then use Snappy.compress(byte[]) and Snappy.uncompress(byte[]) to compress and decompress bytes. For example:

import org.xerial.snappy.Snappy;

import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.cql.BoundStatement;
import com.datastax.oss.driver.api.core.cql.PreparedStatement;
import com.datastax.oss.driver.api.core.cql.ResultSet;
import com.datastax.oss.driver.api.core.cql.Row;
import com.datastax.oss.driver.api.core.uuid.Uuids;

import java.nio.ByteBuffer;
import java.util.UUID;

String originalData = "Compress and Conquer with Amazon Keyspaces!";

List<InetSocketAddress> contactPoints = Collections.singletonList( InetSocketAddress.createUnresolved("service-endpoint", 9142));

// Let's create the cqlSession
session = CqlSession.builder() 
	.addContactPoints(contactPoints)
        .withSslContext(SSLContext.getDefault())
        .withLocalDatacenter("your_region")      
	.withAuthProvider(new SigV4AuthProvider("your_region")).build();

// compressedData might be persisted into Amazon Keyspaces in BLOB format 
byte[] compressedData = Snappy.compress(originalData.getBytes("UTF-8"));

PreparedStatement writePs = session.prepare("insert into test1.table_with_compressed_json(id, data) VALUES(?,?);");
// Generate random timeUUID
UUID uuid = Uuids.timeBased();
// Prepare write bound statement
BoundStatement writeBoundStatement = writePs.bind(uuid,ByteBuffer.wrap(compressedData)).setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);
// Persist into Amazon Keyspaces
session.execute(writeBoundStatement);

// Compressed data might be read from Amazon Keyspaces
PreparedStatement readPs = session.prepare("select data from test1.table_with_compressed_json where id = ?"); 
// Prepare read bound statement
BoundStatement readBoundStatement = readPs.bind(uuid);
ResultSet resultSet = session.execute(readBoundStatement);
// Get the compressed row
Row data = resultSet.one();

ByteBuffer raw_bytes = data.getByteBuffer("data");
// Uncompress data from raw bytes
String result = new String(Snappy.uncompress(raw_bytes.array()), "UTF-8"));
System.out.println(result);

Running the write performance test

To run the write performance tests, complete the following steps:

  1. Download the maven project from the AWS Samples GitHub.
  2. From the top directory, run mvn install.
  3. Configure resources/config.properties:
    1. Set contactPoint to the service endpoint. For example, us-east-1.amazonaws.com.
    2. Set port to 9142.
    3. Set region to your Region. For example, us-east-1.
    4. Set input_jsons to a path to your JSON file. For example, resources/device-enforcement-0001-of-0001.json.
    5. Set output_partitions_compressed to a path to compressed partitions file to read compressed data back by ID. For example, resources/compressed_partitions.out.
    6. Set output_partitions_uncompressed to a path of your uncompressed partitions file to read uncompressed data back by ID. For example, resources/uncompressed_partitions.out.
  1. Run the write performance test:
    java -cp SnappyKeyspaces-1.0-SNAPSHOT-jar-with-dependencies.jar PerformanceTestWriteRunner

Running the performance test might incur costs to your account. For more information, see Amazon Keyspaces pricing.

Running the read performance test

To run the read performance tests, enter the following code:

java -cp SnappyKeyspaces-1.0-SNAPSHOT-jar-with-dependencies.jar PerformanceTestReadRunner

Running the performance test might incur costs to your account. For more information, see Amazon Keyspaces pricing.

Configuring write and read latency metrics

In this section, we walk you through configuring write and read latency metrics in Amazon CloudWatch.

Write latency

To add the write latencies for both tables, complete the following steps:

    1. On the Amazon Keyspaces console, on the Tables page, choose table_with_compressed_json or table_with_uncompressed_json.
    2. On the Capacity tab, choose Add to CloudWatch.
    3. Choose Widget Action and Write units per second.
    4. On the CloudWatch dashboard, choose Edit.
    5. Choose All Metrics and AWS/Cassandra.
    6. Choose Keyspace, Operation, and TableName.
    7. On the drop-down menu, choose the INSERT operation for table_with_compressed_json and table_with_uncompressed_json with the metric name SuccessfulRequestLatency.
    8. Choose Graph metrics.

The dashboard shows write latencies for the compressed and uncompressed writes.

    1. In the Statistic column, choose the p99 metric.
    2. Choose Update widget.

Choose Update widget.

If you change the graph type from line to number, you see absolute write latencies in absolute numbers, as shown in the following screenshots.

If you change the graph type from line to number, you see absolute write latencies in absolute numbers, as shown in the following screenshots.

Read latency

To add the read latencies for both tables, complete the following steps:

    1. Choose All Metrics and AWS/Cassandra.
    2. Choose Keyspace, Operation, and TableName.
    3. On the drop-down menu, choose the SELECT operation for table_with_compressed_json and table_with_uncompressed_json with the metric name SuccessfulRequestLatency.
    4. Choose Graph metrics.

The dashboard shows read latencies for the compressed and uncompressed reads.

    1. In the Statistic column, choose the p99 metric.
    2. Choose Update widget.

Choose Update widget.

If you change the graph type from line to number, you see absolute read latencies in absolute numbers, as shown in the following screenshot.

If you change the graph type from line to number, you see absolute read latencies in absolute numbers, as shown in the following screenshot.

Analyzing the data

The preceding graphs show that the p99 write latency of compressed objects improved by 18.5%, and the p99 read latency improved by 19.5%.

The following table shows statistics I collected from the application side. I divided the files sizes across four different ranges to show how distribution changed after compression. The size of large objects between 20–67 KB reduced by 77.5%, objects between 4–20 KB reduced by 22%, and objects between 1–4 KB reduced by 24.2%, but the number of objects less or equal to 1 KB increased by 74.6%, which fits perfectly into 1 write capacity unit (WCU).

JSON Objects <=1 KB 1–4 KB 4–20 KB 20–67 KB
Uncompressed 937 10168 527 244
Compressed 3700 7706 415 55

After capturing all metrics, delete table_with_compressed_json and table_without_compressed_json to avoid extra costs to your account.

Considerations

When implementing your solution, you should consider compression overhead (see “Compression overhead” below) and what work-arounds you can use to read compressed data by using cqlsh (see “Reading data stored in BLOB columns” below).

Compression overhead

Compression can improve the performance of writing and reading data by reducing I/O and network usage. However, compression increases the processing time for each run. In this example, the total elapsed time to compress data was 3,434 milliseconds, and to decompress data was 1,093 milliseconds. Therefore, the average compression overhead was 0.3 milliseconds (2.86%) for each JSON object and the average decompression overhead was 0.09 milliseconds (0.91%).

Reading data stored in BLOB columns

To help you access and read data stored in a BLOB column by using the Amazon Keyspaces console or cqlsh, I’ve created a helper wrapper cqlsh-experimental.sh that you can use from our Amazon Keyspaces developer toolkit.

    1. Clone the Amazon Keyspaces developer toolkit with the following code:
      git clone https://github.com/aws-samples/amazon-keyspaces-toolkit.git
    1. Replace the Dockerfile in the cloned repository with src/cqlsh-experimental/Dockerfile.
    2. Build the Docker image:
      docker build --tag amazon/keyspaces-snappy --build-arg 
      CLI_VERSION=latest .
    1. Connect to Amazon Keyspaces with an additional parameter -d snappy to decompress data:
      docker run --rm -ti --entrypoint cqlsh-experimental.sh \ 
      amazon/keyspaces-toolkit cassandra.us-east-1.amazonaws.com 9142 \
      -u "SERVICEUSERNAME" -p "SERVICEPASSWORD" -d snappy –ssl
    1. Run your CQL query.

All BLOB columns in your select statement are automatically decompressed, as shown in the following screenshot.

All BLOB columns in your select statement are automatically decompressed, as shown in the following screenshot.

Storage savings

In this performance test, I processed 11,876 JSON objects that aren’t uniform in size. The original serialized data size is 30,153,018 bytes, and the compressed data size is 19,695,386 bytes. Using compression  decreased the data size by 35% with a compress ratio of 1.53.

Conclusion

The approach outlined in this post offers as an effective way to compress and decompress data and store large objects in Amazon Keyspaces. You can use the examples in this post to easily implement a similar compression approach on your application to improve read and write performance and reduce your storage costs.

There are other options you can consider to store large rows in Amazon Keyspaces. For example, if your compressed data is still larger than 1 MB or if you prefer not to compress your data, you can break up your data across multiple rows. You can also store large objects in Amazon Simple Storage Service (Amazon S3) and store the S3 object pointers in your Amazon Keyspaces columns, or replace JSON/XML verbose format with concise binary object representation (CBOR) to reduce the size of objects.

Please submit your questions and requests in the comments section.


About the Author

Nikolai KolesnikovNikolai Kolesnikov is a Sr. Data Architect and helps AWS Professional Services customers build highly-scalable applications using Amazon Keyspaces. He also leads Amazon Keyspaces ProServe customer engagements.