
Chronosphere SaaS
Monitors pipelines with real-time alerts and good documentation
What is our primary use case?
I work as a data engineer, and we have many streaming pipelines. We use Chronosphere to monitor various metrics, such as how much data our pipeline is processing in each batch, the volume of incoming data, our consumption rates, and the time to process each batch. Additionally, we set alerts in Chronosphere for situations like job failures or when the number of processed records falls below a certain threshold. We get alerts if the record count drops below our threshold. Sometimes, we face silent failures, where our system appears to be working fine but isn't consuming any data because another system has stopped sending data. Chronosphere helps us detect these cases.
Another team member was involved in setting up a framework using Terraform on Chronosphere to monitor our job SLAs. We receive alerts on Slack or via email if any job fails to meet its SLA.
What is most valuable?
The alerting features are good because they provide many alerts you can't get with daily tools like Airflow or other orchestration tools, which notify you of job failures. With Chronosphere, I can set custom thresholds, such as the number of records being processed.
I can also configure alerts for silent failures and jobs that take longer than expected. Chronosphere is completely loosely coupled with your pipeline, meaning it has no dependencies on it, which is pretty cool.
What needs improvement?
It isn't very easy. It's not easy for everyone. It would be much easier if there could be a simpler version, like a data number version or an SQL version. It's hard to debug if you don't know the syntax. Also, I saw the Slack alerts feature, which is pretty cool, but I cannot customize my messages on the Slack alerts. It would be great if it were possible to tag people in the alerts. At DoorDash, we have hundreds of pipelines, and if something fails, I want to tag specific people so they can start working on the issue immediately.
For how long have I used the solution?
I have been using Chronosphere for one year.
What do I think about the stability of the solution?
Sometimes, we noticed issues because we used the push method to send our metrics to Chronosphere using Prometheus.
I rate the solution's stability a nine out of ten.
What do I think about the scalability of the solution?
100 users are using this solution, and we have a significant number of metrics coming in. It's highly scalable and can handle the load efficiently.
How are customer service and support?
The documentation is pretty good.
What other advice do I have?
If you want to monitor pipelines and use something like Kafka or any streaming platform, Chronosphere is the best option for monitoring pipelines with real-time alerts. It is loosely coupled with your pipeline, adding no confusion or load. I recommend using it.
Overall, I rate the solution a nine out of ten.
An exceptional, pragmatic observability platform
* Support: knowledgeable, friendly support from day one. I assumed this was pre-contract wooing, but one year later, support is as great as ever.
* Ergonomics: tools are only useful if they're used. Their interfaces load quickly and make sense, and as a result, our engineers are happy to use them.
* Operations: the product just works; the undifferentiated heavy lifting is handled behind the scenes, without incident.
* Cost: they include tools to identify and tame anomalous and low-value data, leading to lower costs without sacrificing signal.
* Prometheus histograms are clunky. It sounds like this may be addressed soon.
A solid observability service
Additionally some tips for people new to using it like in AWS Cloudwatch insights would be helpful
Many features
Chronosphere as Observability
A necessary, scalable observability platform run by a stellar team
Early in the onboarding process, I was blown away when we discussed the tools available in the aggregation tier. It appeared that giving us this level of control wouldn't be good for Chronosphere's revenue in the long run, but I soon realized it's part of their philosophy and mission to give the power back to the customers. As an operator coming from an ELK-based stack (which comes with plenty of operational toils), Chronosphere is a true SaaS where you don't need to worry about the underlying storage and query infrastructure.
The profiling tools that allow you to look at incoming data at various process stages have been handy in many cases. The backend ingest and query performance have been phenomenal, especially compared to our legacy stack. Being able to use rollups to extend the retention of data will prove helpful to us in the long run, which isn't something we've been able to do effectively in our legacy stack.
Product aside, the team has been highly supportive throughout the process, from onboarding to implementation to stabilization. They've been a solid partner for the complex project of moving observability stacks within a large engineering organization.
This challenge is somewhat outside of the sphere of responsibility of Chronosphere. Like the Chronosphere collector, I think there might be some opportunity for productized tooling on the library side to help solve common problems across all organizations working with Prometheus. Still, on the bright side, the team is working on backend features like the usage profiler to give us the next level of visibility.
Review from Kevin
Scalable metrics storage with M3DB
* Built on industry standards: Prometheus and Grafana
* Uses M3DB for scalable storage
* Customer is not required to manage storage scaling, sharding, or federation
* Knowledgeable, hands-on support by account managers
Chronosphere is a company that helps you walk through the complexity of Observability
A great tool for metrics at scale
Their internal observability tooling is great. Their profiler is excellent for discovering cost-saving opportunities (by adding aggregation or drop rules). They also provide out-of-the-box dashboards for understanding how you are using the system, which helps answer questions like "how much of my limit is being used by job X?"
Extremely fast to query.
Very easy to onboard if you're already using Prometheus. The chronosphere collector is mostly compatible with existing Prometheus configs, and their frontend can import any of your existing Grafana dashboards.
Wonderful support. The Chronosphere team has been fantastic, not just during the implementation phase but also afterward. They've also been great at surfacing non-obvious issues to us in places where they haven't quite automated everything yet (for example, misconfigured notifiers and alerts).