
Overview

Product video
Gremlin's Reliability Management Platform builds upon the practice of Chaos Engineering. By giving teams a more guided and product-led way to achieve reliability goals, Gremlin's Reliability Management Platform let's you easily define services, integrate them to your Golden Signals in your APM tool, run a series of Reliability Tests, and receive a Reliability Score for your defined service. These efforts help groups approach their reliability efforts in a safe, secure, scalable and standardized way.
Highlights
- Use Reliability Engineering to proactively get ahead of any infrastructure related issues in your environment
- Use Reliability Scoring to get a comprehensive view of where your services rank from a reliability perspective
- Don't let velocity conflict with reliability. Build a regression set of Reliability tests to understand how changes to your applications and infrastructure impact your underlying microservices and infrastructure
Details
Introducing multi-product solutions
You can now purchase comprehensive solutions tailored to use cases and industries.
Features and programs
Financing for AWS Marketplace purchases
Pricing
Free trial
Dimension | Description | Cost/12 months |
|---|---|---|
Reliability Management: 50 Agents | Reliability Management Platform - Unlimited Reliability Testing & Scoring, Fault Injection, Failure Flags - 50 Agents | $45,000.00 |
Vendor refund policy
No refunds.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Software as a Service (SaaS)
SaaS delivers cloud-based software applications directly to customers over the internet. You can access these applications through a subscription model. You will pay recurring monthly usage fees through your AWS bill, while AWS handles deployment and infrastructure management, ensuring scalability, reliability, and seamless integration with other AWS services.
Resources
Vendor resources
Support
Vendor support
Email support is offered during 8am - 8pm PST, Monday - Friday. support@gremlin.com or by Zendesk widget in App
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.

Standard contract
Customer reviews
Chaos testing has uncovered vulnerabilities and now drives stronger, more reliable infrastructures
What is our primary use case?
My main use case for Gremlin Reliability Management Platform is chaos testing. I take my infrastructure and then I sabotage some things to see how they reach the goal. I try network or infrastructure attacks mainly, and I play every code on Gremlin Reliability Management Platform . Regarding a memorable incident, I found a lot of vulnerabilities in some SMTP servers, and I fixed it with Gremlin Reliability Management Platform. It is interesting because Gremlin Reliability Management Platform is not a penetration tester, but by disrupting other parts of the infrastructure and then running some other tests, it serves this purpose effectively.
What is most valuable?
The best feature Gremlin Reliability Management Platform offers in my experience is having everything in one dashboard and the ability to perform tests of every kind of infrastructure. The flexibility is one of the main things about Gremlin Reliability Management Platform that I found, and it is really important. It is also important to have the possibility of targeting even specific or wider parts of infrastructure, and it is simple and well-thought-out to isolate things or put it in a more reasonable way.
Using Gremlin Reliability Management Platform has raised more than fifty percent of the reliability of the infrastructure. I do not own a single infrastructure of my own because I am a freelancer, and so I have many cases of customers, but the percentage of the average improvement is very huge. Mainly, we notice fewer incidents and less downtime. There are really two pathways along: fewer incidents because with Gremlin Reliability Management Platform, we can make every part of the infrastructure more solid, and less downtime because we can test more architectures and then things like how to put in high availability clusters. The impact in clients' environments is really significant, and it is one of the special things.
What needs improvement?
I think that it will be important to have resources to perform self-directed studies on Gremlin Reliability Management Platform as an improvement. There is a small and fast and simple certification, but if they add possibilities to learn and get certified for free, it would be great because it is very powerful and the documentation is very high quality. However, I do not think that only with the documentation you can reach all the complexity of the tool. Some learning paths, free and by webinar, could help.
I think it would be useful to have some integration with Splunk or other log collectors, or maybe in the future, the ability to link Dynatrace or any other observability platform.
For how long have I used the solution?
I have been using Gremlin Reliability Management Platform for about five years.
What do I think about the stability of the solution?
Gremlin Reliability Management Platform is stable.
What do I think about the scalability of the solution?
More than scalability, I thought about availability because it is a really important thing of the architecture tools, but I think it is also scalable with AWS .
How are customer service and support?
The customer support quality is very good. I would rate the customer support an eight on a scale of one to ten.
How would you rate customer service and support?
Positive
Which solution did I use previously and why did I switch?
I was born as a chaos tester with Gremlin Reliability Management Platform, and I think I will die with it professionally.
How was the initial setup?
I purchased Gremlin Reliability Management Platform through the AWS Marketplace .
What was our ROI?
I cannot share relevant metrics because my customers cover it with an NDA. However, it is not a general impression; the numbers are impressive because, as I said previously, reducing downtime and mainly reducing failures is significant.
What's my experience with pricing, setup cost, and licensing?
It is not so cheap, but it has very powerful features. For my experience with pricing, setup cost, and licensing, the value is there.
Which other solutions did I evaluate?
I found Gremlin Reliability Management Platform and discovered the free learning path, so I dove into it before choosing Gremlin Reliability Management Platform.
What other advice do I have?
The main advice I would give to others looking into using Gremlin Reliability Management Platform would be to study it. Do not be shy to fail. Test everything and do lab architectures to test. It is very important to have hands-on experience with tools of this caliber. I would rate this review a nine out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Chaos experiments have revealed reliability risks and provide clear reliability scores
What is our primary use case?
My main use case for Gremlin Reliability Management Platform is the Chaos Engineering part for software. A quick specific example of how I've used Gremlin Reliability Management Platform for Chaos Engineering in my work is with a web service we have, where we need to know the reliability score of it. We conducted chaos experiments with it, including a network experiment, black hole, CPU, and memory experiments, that create chaos for the service, and then we receive a reliability score reflecting the service's reliability, especially in a production environment.
Gremlin Reliability Management Platform is amazing with the reliability score. There is a built-in Chaos Engineering experiment that can help you to provide this to your service. You run it on your service, and then you receive the reliability score from Gremlin Reliability Management Platform, along with insights on the issues and risks present in your service that you can examine and work on.
What is most valuable?
One of my best features of Gremlin Reliability Management Platform is the built-in chaos experiments, which gives you the reliability score of your service.
The built-in chaos experiments and reliability scoring have helped me in my day-to-day work by making it easier to run the experiments directly instead of doing them manually one by one. It allows running scenarios for my web service, for example, and in terms of CPU, it runs the container in terms of Kubernetes from 25% to 75% CPU utilization, giving me more insights about how reliable my system is, making my approach easier for Gremlin Reliability Management Platform and Chaos Engineering.
Game Days can help you take a day with your team to experiment with your services in a production or pre-production environment, allowing you to see how reliable your system is, which is a great feature for the team to deep dive into Chaos Engineering.
Gremlin Reliability Management Platform has positively impacted my organization because we had clients come to us to implement Gremlin Reliability Management Platform as a Chaos Engineering platform for their use cases, which has gained us a lot of potential client opportunities as a consulting company. The reliability scores have improved, as built-in experiments give you the reliability scores, along with insights on risks you have, how you can manage and improve them, which is very helpful. In terms of faster incident response, especially in Kubernetes , if you have one container, Gremlin Reliability Management Platform flags the need for an HPA that will increase your reliability score for the service.
What needs improvement?
Gremlin Reliability Management Platform can be improved by introducing open-source features. It currently has a paid version, but introducing open-source features could encourage more people to use and try it.
The user interface is great, the integration is smooth, and Gremlin Reliability Management Platform has a fantastic support team that helps us a lot in many cases.
For how long have I used the solution?
I have been using Gremlin Reliability Management Platform for around two years, and I am certified in Gremlin Reliability Management Platform.
What do I think about the stability of the solution?
Gremlin Reliability Management Platform is stable with good availability and is very reliable.
What do I think about the scalability of the solution?
Gremlin Reliability Management Platform scales smoothly for running more chaos experiments, adding more services, or supporting a larger team. It can easily scale up your experiments for many of your services, and it can provide other experiments for interconnected dependencies.
How are customer service and support?
When I have questions or run into issues with Gremlin Reliability Management Platform, their support team is helpful and responsive. They resolve our problems quickly and provide assistance through Zoom meetings, which has been very effective in troubleshooting.
How would you rate customer service and support?
Negative
Which solution did I use previously and why did I switch?
We did not use any other solutions; we only started with Gremlin Reliability Management Platform.
What was our ROI?
I can see a return on investment because we save a lot of time during our Chaos Engineering experiments. We do not need to look at all the day's metrics on Grafana dashboards; we run our chaos experiments in a production environment to see how reliable our product or service is.
What's my experience with pricing, setup cost, and licensing?
My experience with pricing, setup cost, and licensing depends on the company. My role does not incur costs for us since we have an NFR for Gremlin Reliability Management Platform that we can use in our case.
Which other solutions did I evaluate?
I did not evaluate other options before choosing Gremlin Reliability Management Platform; the company did that, so I do not have an answer for that.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Chaos testing has increased confidence in Kubernetes reliability and reduced production issues
What is our primary use case?
My main use case for Gremlin Reliability Management Platform is to test. We are running a Kubernetes cluster on GCP, and we want to check our clusters, especially node reliability for the HA use case. The way we used to check the Kubernetes cluster is that we have multiple nodes with multiple tags on nodes, and we are deploying different applications on different nodes to ensure that all the nodes are up. We are using Gremlin Reliability Management Platform for chaos engineering to check those nodes in a pre-prod environment. Sometimes, we also check EC2 instances on Amazon.
What is most valuable?
The best feature that Gremlin Reliability Management Platform offers for me is the prebuilt reliability test; I think that is the best feature along with the automated scheduling. These are the best features that I can mention.
Gremlin Reliability Management Platform has positively impacted our organization by providing us with more confidence in production. We are more confident about running chaos in production, and related to the prebuilt test, we have some scalability tests, especially regarding the infrastructure side, such as CPU tests, memory tests, or disk tests, as I mentioned earlier. If CPU pushes more than seventy-five percent, we make sure that services scale and behave correctly. If memory moves from the threshold of more than seventy-five percent to eighty percent, then we take action accordingly, and we also conduct some redundancy host tests that I mentioned before. We are more confident about the production environment, and we have significantly reduced our issues in production by thirty percent.
What needs improvement?
I think Gremlin Reliability Management Platform can be improved by integrating with more AWS services or GCP services. I also think we can somehow integrate it with machine learning or perhaps some sort of AI by utilizing natural language processing so that it will be easier to interact with non-technical persons as well. We need more services and more prebuilt plugins for Gremlin Reliability Management Platform, especially for stress testing. I want to see how it can be integrated with machine learning, particularly on the NLP side. If we can integrate it with natural language, could we talk to Gremlin Reliability Management Platform and have it configure some of the basic settings so that non-technical persons can also work on Gremlin Reliability Management Platform-like tools? Even a QA person should be able to integrate it without needing any DevOps or cloud expertise.
For how long have I used the solution?
I have been using Gremlin Reliability Management Platform for two years.
What do I think about the stability of the solution?
Gremlin Reliability Management Platform is stable; it is quite stable.
What do I think about the scalability of the solution?
The scalability of Gremlin Reliability Management Platform is good; it is scalable.
How are customer service and support?
The customer support for Gremlin Reliability Management Platform is good overall; the documentation is good.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
I did not previously use a different solution before Gremlin Reliability Management Platform.
What was our ROI?
We are seeing a return on investment from using Gremlin Reliability Management Platform because we are getting less production issues by thirty percent, as I mentioned earlier, making it a great investment. Now we are free at least on long weekends, knowing what the issues are, and that is a great thing.
Which other solutions did I evaluate?
Before choosing Gremlin Reliability Management Platform, I did not evaluate other options.
What other advice do I have?
I rate Gremlin Reliability Management Platform an eight out of ten. I give it an eight because I want to see improvements on the machine learning side, particularly how it can be integrated with NLP.
I chose eight out of ten for Gremlin Reliability Management Platform because it is one of the best tools in terms of chaos engineering. It also has ready-made templates, and we are more confident about the production environment, which saves our time, especially during long weekends.
I advise others looking into using Gremlin Reliability Management Platform to run it for production-grade applications, specifically on Kubernetes, and run production Kubernetes at scale. That is how we are using it for multi-node clusters, multi-zone deployment, and microservices architecture. We can replicate some of the production issues by ensuring that a node is down, allowing us to deploy without issues while maintaining visibility. Reliability score is the main metric for the enterprise solution, and we have standardized tests and a history of tracking trends.
Platform has improved reliability metrics but still raises questions about overall value
What is our primary use case?
The Enterprise Reliability Platform serves as my main use case for the next question.
A quick specific example of how I use The Enterprise Reliability Platform to maintain reliability and efficiency is that we have our own internal system to track and maintain the reliability and efficiency.
What is most valuable?
The Enterprise Reliability Platform has positively impacted my organization as it has significantly increased the efficiency and reliability of our systems.
I measured that increase in efficiency, and I can share that the metrics I noticed include latency and the SLOs, error budget, and not burning through the error budgets.
What needs improvement?
I have no recommendations for how The Enterprise Reliability Platform can be improved.
For how long have I used the solution?
I have been using The Enterprise Reliability Platform for one year.
What other advice do I have?
I have no answer regarding the best features The Enterprise Reliability Platform offers.
I would provide no advice to others looking into using The Enterprise Reliability Platform.
My company does not have a business relationship with this vendor other than being a customer.
I was not offered a gift card or incentive for this review.
I do not have any additional thoughts about The Enterprise Reliability Platform before we wrap up.
I gave this review a rating of 6.