Data observability has enabled real‑time analytics and cost savings but needs smoother inserts and cleanup
What is our primary use case?
ClickHouse has been used for more than a year as the primary solution for an observability data platform hosted on AWS EC2 instances. Data flows from S3 through Kinesis Data Streams and is processed in DataBricks using Medallion architecture, which includes bronze, silver, and gold layers. Once the gold layer is finalized, the data is sent to the ClickHouse cluster hosted on EC2 nodes.
ClickHouse serves as the data sync solution in the workflow, processing data from over 9,000 retail stores from CVS Health in JSON format to make it meaningful for analytical purposes. This data is stored in ClickHouse in the form of multiple tables, such as logs, events, metrics, and traces, and is used for reporting purposes in Grafana and for ML model training.
ClickHouse also hosts as a database sync for the front-end application. A chatbot prompt has been developed on this data that has been hosted, with the model trained based on it. When a user sends a text prompt, the model sends a more generalized or granular query to get the desired results quickly, so the user does not have to write their own query. Instead, the user just needs to provide the prompt of what kind of query they are looking for, and the AI prompt will provide the required query.
What is most valuable?
ClickHouse offers several best features, including the S3 engine function and the ReplicatedMergeTree function. Instead of storing data in EBS volumes, data is stored mainly in S3 Express, which provides computation speed comparable to an EBS volume. The ability to configure ClickHouse using XML configs is also highly appreciated. With an 18-node cluster, multiple zone replication can be configured across three availability zones in US East 1. The configs allow data to be stored in multiple locations separately, and with ReplicatedMergeTree, data needs to be sent to only six nodes, and that gets replicated automatically to all 18 nodes. These features are beneficial, and since ClickHouse is an OLAP database, it provides faster analytical speeds, which is really suitable for this use case.
For the S3 engine, a fault-tolerant data storage pattern was sought, as the data is highly secure and contains PII data. The concern was that data should not be lost or be in remote setups so that even if a failure occurs, data is not lost or accessed by unauthorized individuals. S3 engine was considered a very encrypted and fault-tolerant solution, and it has proven to be very reliable for storing data in ClickHouse. Regarding ReplicatedMergeTree, high write and read rates are dealt with because a real-time analytical application is being created, so avoiding reliance on all 18 nodes for read and write simultaneously was important. ReplicatedMergeTree helps maintain an isolated environment for writing on specific nodes while allowing all nodes to participate in read queries, which is how both S3 engine and ReplicatedMergeTree have helped.
ClickHouse provides great query speeds because it is an OLAP database, so naturally, it provides higher speeds. For cost optimization, after deploying the cluster on-premises and using S3 Express, approximately 5x cost savings were achieved on data storage. Initially, a budget of $15,000 monthly was anticipated for data storage with EBS volumes, but upon switching to S3 Express, storage costs dropped to $3,000. In terms of scalability, the observability data that was sitting idle was expanded to multiple terabytes and incorporated security data from AWS, Tanium, Azure, and CrowdStrike, scaling up to multiple petabytes. This solution led to high performance, prompting the ClickHouse team to invite this organization to share its performance at their annual conference.
What needs improvement?
ClickHouse could be improved concerning data insertion, especially given the high amount of data handled. Constant efforts are made to optimize the features on its own, but with merges and inserts, only a single insert query can be performed allowing for the input of only 100,000 rows per second. It would be beneficial to insert more data and have configurations that are less user-operated. Ideally, ClickHouse would optimize itself to handle these processes automatically, reducing the need to contact the ClickHouse support team for infrastructure optimization.
Additionally, delays are experienced when trying to delete databases with corrupt data, taking too much time and causing major outages, which necessitate contacting multiple teams across continents for resolution. The community surrounding ClickHouse also seems limited, providing a reliance on documentation, and there is a scarcity of developers working with ClickHouse, which hinders growth. If ClickHouse were more user-friendly and technically feasible, it would likely see greater expansion in usage.
For how long have I used the solution?
More than four years have been spent working in the current field.
What do I think about the stability of the solution?
ClickHouse has been stable and reliable for most workloads.
What do I think about the scalability of the solution?
ClickHouse has handled growth and changes in data volume remarkably well.
How are customer service and support?
Overall, customer support has been positive, and the representatives are knowledgeable. However, during major issues, such as the three to four-day outage experienced, the support team was less available, possibly due to tight schedules. If more timely support could be provided during critical issues, situations could have been resolved much more quickly, saving considerable time.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
ClickHouse was the first solution used; the product was built from scratch, and ClickHouse was started from the first stage. No other solution was considered.
How was the initial setup?
ClickHouse is deployed in a private cloud setup. An AWS VPC is configured with multiple AWS nodes, comprising two environments: dev with a 12-node cluster and prod with an 18-node cluster. Each node is equipped with 128 GB RAM and 128 GB of SSD, backed by S3 for data storage, and comes with a CHProxy layer of four nodes in both environments to handle read requests. Write requests are handled directly through an IP basis, and AWS Route 53 and other DNS services have been incorporated for the VPC communication.
Which other solutions did I evaluate?
Before choosing ClickHouse, Elastic was evaluated for observability data. However, since the organization was previously using Elastic and scrapping it due to higher costs with the cloud version, ClickHouse was opted for instead.
What other advice do I have?
This interview could be improved by not asking a single question multiple times. The same question was asked in different ways, and it is recommended to avoid stressing the interviewer or interviewee by not repeating questions. If the answer provider has already given an answer, there is no need to ask for more details again.
My advice to others considering ClickHouse is to opt for the cloud version instead of the on-premises version if budget permits. This decision saves considerable time on managing infrastructure and allows more focus on application and product development. This review has been rated with a score of six out of ten.
Which deployment model are you using for this solution?
Private Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)