AWS Storage Blog

Amazon S3 Express One Zone delivers cost and performance gains for ChaosSearch customers

ChaosSearch is an Amazon S3-native database built on a serverless, stateless compute architecture within AWS that delivers live search, SQL, and Generative AI analytics. At ChaosSearch, the speed and performance of our architecture is important to us and our customers because time to results is the difference between success and failure, and we rely on only the best technology to continue to revolutionize live data analytics at scale.

As the Founder and Chief Technology and Science Officer at ChaosSearch, I am responsible for both strategy and innovation. Therefore, I was eager to try out the new Amazon S3 Express One Zone storage class, the fastest storage class in the cloud, and bolster the ChaosSearch design, a next-generation database that runs entirely on cloud storage, to provide even faster near real-time analytics.

In this post, I discuss the unique approach we took to transform shared-everything storage like Amazon S3 into serverless and stateless compute architecture, resulting in a third-generation database. Leveraging S3 Express One Zone’s consistent single-digit millisecond request latency enables our customers to run queries up to 60% faster without any code changes to their workloads. S3 Express One Zone is a true game changer, allowing us to unify all workloads (at scale) across both operational and business analyses, such as security, observability, and deep application insights, as shown in the following figure.

New approach to databases

Cloud computing like AWS and object storage like Amazon S3 offer significant advantages for scalability, availability, and database costs. This is especially true when databases are truly designed for the cloud from the ground up. However, many databases ported to the cloud carry some first- and second-generation designs and administration overhead, especially in the areas of deployment, configuration, partitioning, and synchronization:

  • Manually provisioning, configuring, and load balancing, including stateful clusters of nodes with respect to quorum and data synchronization.
  • Recurring expenses for production resources, particularly around failover and recovery compute that usually sit idle until trouble occurs.
  • Increased volume, such as during the Black Friday shopping holiday, becomes a capacity-replanning event where administrators predict, add, and rebalance compute and storage in advance, applying them during service windows and periods of reduced performance.

For an analysis of third-generation architectures, the post “​​Cloud Object Storage-based Architectures are Natively Scalable and Available” examines how ChaosSearch outperforms both first/second generation database architecture limitations.

Figure 1 - Database architecture evolution

Figure 1: Database architecture evolution

The basic idea is that ChaosSearch nodes essentially do not communicate state directly, leveraging strongly consistent and distributed cloud storage like Amazon S3 as a point of synchronization. On the other hand, third-generation databases inherently have deterministic controls and procedures. Think of cloud object storage as a data fabric, providing a location that stores both the data as well as the state.

ChaosSearch made an early strategic commitment back in 2016 to leverage Amazon S3 as the backbone to support analytics at true scale. Now Amazon S3 Express One Zone, which can improve data access speeds by 10x and reduce request costs by 50% compared to S3 Standard and scales to process millions of requests per minute, has turbocharged the ChaosSearch database platform. The results are unequivocal: without any tuning, and simply by porting to S3 Express One Zone, ChaosSearch achieved round-trip times that were 2-3x faster than Amazon S3 Standard, where queries that would take 100s of milliseconds to fully complete, can now take 25-to-50 milliseconds (i.e., no caching or local storage). We’ve observed this to be similar to local instance storage performance.

Our customers’ experiences have shown that a third-generation database achieves scale with performance at one-tenth of the cost when compared to first and second generations. With cloud storage as the “enabler” for this scalability and availability, you have a solid foundation to make many strong architectural decisions.

ChaosSearch solution

In today’s ever changing digital era and tsunami of data, the rise of movements like Generative AI is more than just a trend – it’s a wakeup call. The industry is at a point where time, cost, and complexity are blocking the next generation of data-driven value. Organizations are increasingly gravitating toward cloud object storage services, like Amazon S3, due to their inherent security, scalability, and cost-effectiveness, as shown in the following figure.

Figure 2 - Chaos LakeDB solution

Figure 2: Chaos LakeDB solution

Recognizing an enormous opportunity, ChaosSearch brought a novel and innovative approach that turns static cloud repositories into a live lake database. This is called Chaos LakeDB, and it provides a unified data fabric across Search, SQL, and Generative AI interface. The core principles of this transformational architecture are aggregation, automation, and activation. It emphasizes live data ingestion, where schema detection and indexing is self-governed, removing labor-intensive ETL pipelines, and ultimately democratizing value-driven analytics.

Along with cloud object storage, ChaosSearch shines with its multi-model access. Users are not just restricted to SQL queries. They are empowered to delve deep into this live unified data lake, harnessing everything from Full-Text Search to Natural Language AI-driven analytics. This broad-capability spectrum is foundationally supported by stateless scalability and availability. As the Big Bang of Big Data continues to expand, ChaosSearch delivers peak performance without manual interventions or huge costs.

The science behind live and variable ingestion with multi-model access is our index technology and architecture. As data flows into Amazon S3, ChaosSearch automatically indexes, “enabling” data to become instantly searchable and ready for analysis. Coupled with an “intelligent query engine,” ChaosSearch indexing allows fast “first reading” of data stored within Amazon S3 with no need for caching or local storage. This results in a solution that is much more dynamic and greatly reduces its cost and complexity.

A new centerpiece of the ChaosSearch access offering is a portal to Conversational AI analytics through Large Language Models (LLM). Recognizing the treasure trove that data lakes present for AI workloads, ChaosSearch allows businesses to extract profound insights directly from their data lake repositories. This eliminates the need for external tools or complicated low-level interfaces. Users can have a conversation with their data, asking questions as if they are talking to internal security, business, or product experts.

“ChaosSearch is a distinctive component of our data lake architecture. With Chaos LakeDB, we’ve greatly simplified data ingestion, schema creation and management of some of our most challenging data sets and dynamic use cases. Our inherently challenging data is instantly accessible for queries, offering unique flexibility and robustness – all at Cisco’s scale and with significant cost savings.”

– Lee Jones, Architect & Distinguished Engineer, Cisco Systems

Amazon S3 Express One Zone test results

We are always looking to continue improving, and speeding up our indexing and analytics solutions to deliver even faster results and insights to customers is a constant endeavor.

During my initial testing, I found the experience with Amazon S3 Standard prepared me well for adopting S3 Express One Zone. I was able to use the Python SDK for initial setup and bucket creation, and the Java SDK to evaluate and study the underlying nature of PUT, GET, and LIST operations.

Today, I am excited to announce that ChaosSearch can officially use Amazon S3 Express One Zone in combination with S3 Standard. As data lands in our customer’s S3 Standard storage class, ChaosSearch is notified of its existence through Amazon Simple Query Service (Amazon SQS), where the solution automatically and fully indexes the raw data and writes the indices back into a specified customer’s Amazon S3 bucket as well as S3 Express One Zone, as shown in Figure 3. The duration that these indexes (i.e., segments) reside in either storage type is configurable, where the ChaosSearch query planner accesses S3 Express One Zone over S3 Standard, if available.

Figure 3 - ChaosSearch data workflow in AWS

Figure 3: ChaosSearch data workflow in AWS

Chaos performance benchmarks were leveraged across a variety of data sources to validate the benefits of Amazon S3 Express One Zone. These data sources can be described as billions of rows and hundreds of gigabytes in size. The primary focus was to quantify Search (Elastic) and SQL (Presto) API query planning. The report centers on GET actions, which in the case of database benchmarking, is where it’s at.

As part of any Search and/or Query, there are three types of data and thus three types of GETs:

  • The first GET is during the Scoping phase of the query plan (that is, which segments are required to resolve a particular search/query).
  • The second and third GETs occur in the Execution phase (that is, where the work is actually accomplished at the compute fabric edge).

The size of each of these GETs is respectively small. For example, the Scoping GET averages 10s of kilobytes in size, whereas the two Execution GETs are in the 10s of megabytes. No GET exceeds 100 MB by design, which lends itself to the low latency of Amazon S3 Express One Zone.

With that said, for each GET there is work to be done on its data. Therefore, if Amazon S3 Express One Zone can return a GET in 5 milliseconds, it does not mean that Chaos Index-related processing work will complete in 5 milliseconds. It means for a ChaosSearch task that once completed in 100-250 milliseconds of total materialization, that task can now complete in 40-100 milliseconds, a more than 2x reduction in the total time to complete the task. In other words, these tasks are a GET/INSTANTIATE action for an object.

Performance improvements

The test results show the three types of GETs where the total time of a query is:

GET OBJECT + OBJECT WORK = COMPLETION TIME

The work is executed on a distributed compute fabric. The total time includes scoping (segment identification) and execution (segment work). Therefore, if there are 10 workers to 100 segments, then each worker processes 10 segments and the results are aggregated to be returned.

1st GET (Index Scoping – not parallelized)

Figure 4 - Index scoping comparison

Figure 4: Index scoping comparison

This data shows that the scoping time was reduced on average by 2-3x. Note that scope parallelization was not turned on to better quantify throughput. In other words, Test 4 can be halved or thirded if parallelism was enabled and might change the results slightly.

2nd GET (Index Symbols – individual request)

Figure 5 - GET symbols comparison

Figure 4: Index scoping comparison

This data shows that the scoping time was reduced on average by 2-3x. Note that scope parallelization was not turned on to better quantify throughput. In other words, Test 4 can be halved or thirded if parallelism was enabled and might change the results slightly.

2nd GET (Index Symbols – individual request)

Figure 6 - GET locality comparison

Figure 6: GET locality comparison

Once again, we see an average latency reduction of 2-3x. These tests also used different sizes of locality to get a range of latency and performance.

WORK TIME (GETs and Work – individual request)

Figure 7 - GET segment and work comparison

Figure 7: GET segment and work comparison

This data shows that latency reduction is more variable. There is not a clear range of 2-3x reduction, but 1.5, 4.0, 2.7, and 1.3 respectively. This is a result of the type of query performed, where the GETs are consistent in their reduction based on the size/number of segments, but the actual work to be done differs. What is of interest is the benefit of faster GETs (through particular queries) where workers move quicker and get to the next item of work. For example, Test 2 has a particular query execution plan (e.g., correlations) so there is a great advantage to using Amazon S3 Express One Zone (4x in performance).

TOTAL TIME (From User request to User response)

Figure 8 - Total time comparison

Figure 8: Total time comparison

The total time of a query continues the variability of benefits based on the type of test that is run. There is a strong correlation between WORK TIME and TOTAL TIME. The very good news is that when WORK TIME is faster, TOTAL TIME is also faster and can be reduced even more with the addition of workers. In other words, the “Theoretical Minimum” time a query could take is based on its WORK TIME duration. Therefore, each test always achieves a response time of less than 1 second if there are enough workers to run in parallel over the compute fabric.

Cost savings

The computing resources used for these tests varied between 7 and 12 c7g.2xlarge EC2 instances, each priced on-demand at $0.289 (see the Amazon EC2 pricing page for the latest prices). Each of these instances allows for 6 workers. It was observed that on average, 42 workers using the Amazon S3 Express One Zone storage class achieved the equivalent performance to 72 workers using the S3 Standard storage class. This leads to a reduction in compute costs amounting to a savings of $1,000 per month. Therefore, even though S3 Express One Zone has higher storage costs than S3 Standard, the savings in compute costs are the primary factor in overall database expenses. Moreover, it’s evident that customers can achieve up to 60% savings for comparable performance using S3 Express One Zone.

Although we are huge believers in Amazon S3 Express One Zone, there is no better way to explain its value than from the voice of our mutual customer, Cisco:

“For Cisco Talos to provide the industry-leading threat protection and interdiction we are known for at our scale, speed remains critical. We use ChaosSearch’s Amazon S3-native live analytics database to index against high impact sets of our massive, unrivaled telemetry data. ChaosSearch has always been fast, but with the S3 Express One Zone storage class processing hundreds of thousands of transactions per second we can supercharge our analytics workloads to the tune of 60% better performance or 60% lower costs for our current performance. This combination of S3 Express One Zone and ChaosSearch enables us to rethink our data lake architecture and overall data strategy.”

– Lee Jones, Architect & Distinguished Engineer, Cisco Systems

Conclusion

At ChaosSearch, we always saw object storage as a hot backend where we created technology and architecture to achieve such low latency database queries. With Amazon S3 Express One Zone, our bet continues to pay off. We now resolve queries in seconds to sub-seconds, without the need for local storage or caching, providing our customers upwards of 60% in performance improvements or 60% in cost reduction.

With all the advantages of data lake simplification, elasticity, durability, scale, and the security of Amazon S3, S3 Express One Zone is one more level up in our performance. Although the S3 Express One Zone storage class might have a higher price-point than S3 Standard, do not forget that compute is where the “real costs” add up, and that it can actually help you reduce your overall TCO. Amazon S3 Express One Zone and ChaosSearch are a match that delivers live data analytics at scale and in a highly cost-performant manner.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.