Chaos experiments have revealed weak points and now provide controlled cost-saving tests
What is our primary use case?
My main use case for Gremlin Reliability Management Platform is that we wanted to do chaos engineering, and in order for us to orchestrate the tests better, Gremlin helped us a lot.
A quick specific example of a chaos engineering test I've run using Gremlin is that one use case that actually helped us was to simulate a CPU spike on one of our servers, because it was harder for us in production to simulate a spike in CPU servers as we need. Gremlin helped us to spike the CPU servers.
I have a lot to add about how I'm using Gremlin Reliability Management Platform, as there were many experiments that have actually helped us. Auto-scaling was one thing that we actually wanted to see how it works. It was difficult for us to experiment and see how different auto-scaling strategies are working based on CPU utilization and whether they will automatically scale down. We wanted to see it live if it is happening because it relates directly and correlates to the costing of our services on the cloud. Using Gremlin Reliability Management Platform, when we launched some CPU spikes and intentionally reduced the utilization of an API, we were able to see the auto-scaling up and down. It helped us save a lot of costs and select the right instances.
What is most valuable?
The best features of Gremlin Reliability Management Platform are the safe failure injection, which is crucial as we can simulate the failures in a manner that we know these are just dumping tests and not the actual issues. Whether it is the CPU spike or the memory exhaustion, or the network latency, or the server shutdown, server shutdown is one of the most favorite features that I have in Gremlin Reliability Management Platform. The controlled blast radius is another standout feature.
The controlled blast radius feature has helped my team in that we actually wanted to target only one specific container, our Docker containers that we deployed. It helped us to conduct tests in a very specific, isolated manner instead of launching a larger test or focusing on hundreds of servers at a time, resulting in very limited impact. Since ours is a very small team, we do not want to impact other servers. This controlled blast radius helped us to only focus on our servers and not impact any other team.
Gremlin Reliability Management Platform has positively impacted my organization because before Gremlin Reliability Management Platform, we did not even know how to conduct these chaos engineering tests. We heard about it, but we had no idea of how to do something of that nature. If there are ten servers, ten systems in our architecture and if suddenly something goes down, nobody knew what would happen next. We did not even know how to simulate these types of tests. This lack of confidence has been mitigated by using Gremlin Reliability Management Platform. Now we can confidently test and see which system is the most critical. If this goes down, what happens? How much business valuation are we going to impact? How much loss are we going to incur? All of this is now clearly visible and transparent.
Since using Gremlin Reliability Management Platform, we were able to reduce the incidents by six percent after conducting our limited experiments. We were also able to increase the uptime from ninety-eight to ninety-nine, which represents a one percent increase in uptime.
What needs improvement?
Gremlin Reliability Management Platform can be improved as the pricing is a bit expensive and the learning curve for beginners is a bit difficult. It is not easy to get along with, and we need pretty good time to understand and grasp those concepts before we can use it. The infrastructure also needs to be very mature; it should be set up properly and that takes a lot of compliance and regulation time.
For how long have I used the solution?
I have been using Gremlin Reliability Management Platform for a couple of years now. I think it has been two years since we started using it.
What do I think about the stability of the solution?
Gremlin Reliability Management Platform is quite stable, and I have not seen any downtime or issues with its behavior or performance.
What do I think about the scalability of the solution?
The scalability of Gremlin Reliability Management Platform depends on the scalability of the underlying infrastructure that we are hosting it on. So far for us, it has been pretty good and clean with no issues.
How are customer service and support?
My interaction with customer support has not been quite often as we never had any requirements where we needed their help. The platform was quite stable.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
I did not use any different solution before Gremlin Reliability Management Platform. That is the only reliability management platform I have used, and it is pretty good.
How was the initial setup?
My experience with pricing, setup cost, and licensing is that it was a bit expensive, but most of it is handled by our team. I was not involved in the payment of it, as it was handled by the payments team.
What was our ROI?
I have seen a return on investment since using Gremlin Reliability Management Platform because fewer employees are needed now to conduct more reliable tests. If we needed ten people to do tests once upon a time, now, using Gremlin Reliability Management Platform, we can do it with a fifty percent reduction in employees. Only five people with Gremlin Reliability Management Platform can conduct much more reliable tests.
Which other solutions did I evaluate?
I did not evaluate any other platforms before choosing Gremlin Reliability Management Platform; we directly went to Gremlin Reliability Management Platform.
What other advice do I have?
There were a lot of good examples and great documentation for Gremlin Reliability Management Platform, which is something that I appreciate. It helped us a lot.
My advice for others looking into using Gremlin Reliability Management Platform is that in the starting stages, it will take some time to understand its capabilities, what it can do, and what it cannot do. The learning curve is a bit difficult, but once you understand it, it is a pretty great product to use. I rate Gremlin Reliability Management Platform a nine out of ten because, as I mentioned, the learning curve and the pricing made me reduce that one point.
Which deployment model are you using for this solution?
Private Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Chaos testing has uncovered vulnerabilities and now drives stronger, more reliable infrastructures
What is our primary use case?
My main use case for Gremlin Reliability Management Platform is chaos testing. I take my infrastructure and then I sabotage some things to see how they reach the goal. I try network or infrastructure attacks mainly, and I play every code on Gremlin Reliability Management Platform. Regarding a memorable incident, I found a lot of vulnerabilities in some SMTP servers, and I fixed it with Gremlin Reliability Management Platform. It is interesting because Gremlin Reliability Management Platform is not a penetration tester, but by disrupting other parts of the infrastructure and then running some other tests, it serves this purpose effectively.
What is most valuable?
The best feature Gremlin Reliability Management Platform offers in my experience is having everything in one dashboard and the ability to perform tests of every kind of infrastructure. The flexibility is one of the main things about Gremlin Reliability Management Platform that I found, and it is really important. It is also important to have the possibility of targeting even specific or wider parts of infrastructure, and it is simple and well-thought-out to isolate things or put it in a more reasonable way.
Using Gremlin Reliability Management Platform has raised more than fifty percent of the reliability of the infrastructure. I do not own a single infrastructure of my own because I am a freelancer, and so I have many cases of customers, but the percentage of the average improvement is very huge. Mainly, we notice fewer incidents and less downtime. There are really two pathways along: fewer incidents because with Gremlin Reliability Management Platform, we can make every part of the infrastructure more solid, and less downtime because we can test more architectures and then things like how to put in high availability clusters. The impact in clients' environments is really significant, and it is one of the special things.
What needs improvement?
I think that it will be important to have resources to perform self-directed studies on Gremlin Reliability Management Platform as an improvement. There is a small and fast and simple certification, but if they add possibilities to learn and get certified for free, it would be great because it is very powerful and the documentation is very high quality. However, I do not think that only with the documentation you can reach all the complexity of the tool. Some learning paths, free and by webinar, could help.
I think it would be useful to have some integration with Splunk or other log collectors, or maybe in the future, the ability to link Dynatrace or any other observability platform.
For how long have I used the solution?
I have been using Gremlin Reliability Management Platform for about five years.
What do I think about the stability of the solution?
Gremlin Reliability Management Platform is stable.
What do I think about the scalability of the solution?
More than scalability, I thought about availability because it is a really important thing of the architecture tools, but I think it is also scalable with AWS.
How are customer service and support?
The customer support quality is very good. I would rate the customer support an eight on a scale of one to ten.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
I was born as a chaos tester with Gremlin Reliability Management Platform, and I think I will die with it professionally.
How was the initial setup?
I purchased Gremlin Reliability Management Platform through the AWS Marketplace.
What was our ROI?
I cannot share relevant metrics because my customers cover it with an NDA. However, it is not a general impression; the numbers are impressive because, as I said previously, reducing downtime and mainly reducing failures is significant.
What's my experience with pricing, setup cost, and licensing?
It is not so cheap, but it has very powerful features. For my experience with pricing, setup cost, and licensing, the value is there.
Which other solutions did I evaluate?
I found Gremlin Reliability Management Platform and discovered the free learning path, so I dove into it before choosing Gremlin Reliability Management Platform.
What other advice do I have?
The main advice I would give to others looking into using Gremlin Reliability Management Platform would be to study it. Do not be shy to fail. Test everything and do lab architectures to test. It is very important to have hands-on experience with tools of this caliber. I would rate this review a nine out of ten.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Chaos experiments have revealed reliability risks and provide clear reliability scores
What is our primary use case?
My main use case for Gremlin Reliability Management Platform is the Chaos Engineering part for software. A quick specific example of how I've used Gremlin Reliability Management Platform for Chaos Engineering in my work is with a web service we have, where we need to know the reliability score of it. We conducted chaos experiments with it, including a network experiment, black hole, CPU, and memory experiments, that create chaos for the service, and then we receive a reliability score reflecting the service's reliability, especially in a production environment.
Gremlin Reliability Management Platform is amazing with the reliability score. There is a built-in Chaos Engineering experiment that can help you to provide this to your service. You run it on your service, and then you receive the reliability score from Gremlin Reliability Management Platform, along with insights on the issues and risks present in your service that you can examine and work on.
What is most valuable?
One of my best features of Gremlin Reliability Management Platform is the built-in chaos experiments, which gives you the reliability score of your service.
The built-in chaos experiments and reliability scoring have helped me in my day-to-day work by making it easier to run the experiments directly instead of doing them manually one by one. It allows running scenarios for my web service, for example, and in terms of CPU, it runs the container in terms of Kubernetes from 25% to 75% CPU utilization, giving me more insights about how reliable my system is, making my approach easier for Gremlin Reliability Management Platform and Chaos Engineering.
Game Days can help you take a day with your team to experiment with your services in a production or pre-production environment, allowing you to see how reliable your system is, which is a great feature for the team to deep dive into Chaos Engineering.
Gremlin Reliability Management Platform has positively impacted my organization because we had clients come to us to implement Gremlin Reliability Management Platform as a Chaos Engineering platform for their use cases, which has gained us a lot of potential client opportunities as a consulting company. The reliability scores have improved, as built-in experiments give you the reliability scores, along with insights on risks you have, how you can manage and improve them, which is very helpful. In terms of faster incident response, especially in Kubernetes, if you have one container, Gremlin Reliability Management Platform flags the need for an HPA that will increase your reliability score for the service.
What needs improvement?
Gremlin Reliability Management Platform can be improved by introducing open-source features. It currently has a paid version, but introducing open-source features could encourage more people to use and try it.
The user interface is great, the integration is smooth, and Gremlin Reliability Management Platform has a fantastic support team that helps us a lot in many cases.
For how long have I used the solution?
I have been using Gremlin Reliability Management Platform for around two years, and I am certified in Gremlin Reliability Management Platform.
What do I think about the stability of the solution?
Gremlin Reliability Management Platform is stable with good availability and is very reliable.
What do I think about the scalability of the solution?
Gremlin Reliability Management Platform scales smoothly for running more chaos experiments, adding more services, or supporting a larger team. It can easily scale up your experiments for many of your services, and it can provide other experiments for interconnected dependencies.
How are customer service and support?
When I have questions or run into issues with Gremlin Reliability Management Platform, their support team is helpful and responsive. They resolve our problems quickly and provide assistance through Zoom meetings, which has been very effective in troubleshooting.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We did not use any other solutions; we only started with Gremlin Reliability Management Platform.
What was our ROI?
I can see a return on investment because we save a lot of time during our Chaos Engineering experiments. We do not need to look at all the day's metrics on Grafana dashboards; we run our chaos experiments in a production environment to see how reliable our product or service is.
What's my experience with pricing, setup cost, and licensing?
My experience with pricing, setup cost, and licensing depends on the company. My role does not incur costs for us since we have an NFR for Gremlin Reliability Management Platform that we can use in our case.
Which other solutions did I evaluate?
I did not evaluate other options before choosing Gremlin Reliability Management Platform; the company did that, so I do not have an answer for that.
Which deployment model are you using for this solution?
On-premises
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Platform has improved reliability metrics but still raises questions about overall value
What is our primary use case?
The Enterprise Reliability Platform serves as my main use case for the next question.
A quick specific example of how I use The Enterprise Reliability Platform to maintain reliability and efficiency is that we have our own internal system to track and maintain the reliability and efficiency.
What is most valuable?
The Enterprise Reliability Platform has positively impacted my organization as it has significantly increased the efficiency and reliability of our systems.
I measured that increase in efficiency, and I can share that the metrics I noticed include latency and the SLOs, error budget, and not burning through the error budgets.
What needs improvement?
I have no recommendations for how The Enterprise Reliability Platform can be improved.
For how long have I used the solution?
I have been using The Enterprise Reliability Platform for one year.
What other advice do I have?
I have no answer regarding the best features The Enterprise Reliability Platform offers.
I would provide no advice to others looking into using The Enterprise Reliability Platform.
My company does not have a business relationship with this vendor other than being a customer.
I was not offered a gift card or incentive for this review.
I do not have any additional thoughts about The Enterprise Reliability Platform before we wrap up.
I gave this review a rating of 6.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Feature-full graph traversal language
What do you like best about the product?
Gremlin is quite easy to learn and use. I like that it supports both graph traversals and graph pattern matching (aka declarative traversals). In many cases, I would prefer the Gremlin syntax to the SPARQL syntax.
What do you dislike about the product?
I would be interested to see inferencing support (materialized on not) in the future. Mixing features like declarative and non-declarative traversals could be a bit cumbersome.
What problems is the product solving and how is that benefiting you?
There are many use cases where the property graph data model and graph traversals with Gremlin can be very useful. I used Gremlin to solve problems related to fraud detection, real-time recommendations and customer 360.
Gremlin is one of the few good Chaos Engineering Provider with continuous improvements
What do you like best about the product?
Support for Chaos Engineering on Cloud Platforms for testing weak points in infra availability, resilience & security. Support is good for new implementations.
What do you dislike about the product?
The only thing I can think is that providing support to new technologies like Serverless on cloud continuos updates are required as Cloud Platforms change. So a maturity model is quite tough to maintain for such Chaos products.
What problems is the product solving and how is that benefiting you?
Try to find How the system will behave under inevitable failures like a specific Service or VM is down, how much time to recovery(MTTR), how resilient architecture handles random failures.
Emulating black holes service to simulate different service failures helps understand cascading failures, which you might not expect in design earlier. How interconnected services will behave/misbehave in dependency failure & prevent data loss with middleware failures
Go to solution to get started with Chaos Engineering
What do you like best about the product?
Easy to use chaos engineering tool, minimal installation, great for entrants in chaos engineering concepts. Easy cloud integration. Lots of documentation to get started quickly.
What do you dislike about the product?
Has limited support for on-premise chaos injection, and it requires a subscription to run multi-point chaos experiments. Open-source version of the product is not available,
What problems is the product solving and how is that benefiting you?
We use Gremlin to run chaos tests against our K8 workloads hosted on AWS. Our teams can get quickly onboarded with the chaos engineering concepts. Great tool for our new joiners.