My main use case for Gremlin Reliability Management Platform is that we wanted to do chaos engineering, and in order for us to orchestrate the tests better, Gremlin helped us a lot.
A quick specific example of a chaos engineering test I've run using Gremlin is that one use case that actually helped us was to simulate a CPU spike on one of our servers, because it was harder for us in production to simulate a spike in CPU servers as we need. Gremlin helped us to spike the CPU servers.
I have a lot to add about how I'm using Gremlin Reliability Management Platform, as there were many experiments that have actually helped us. Auto-scaling was one thing that we actually wanted to see how it works. It was difficult for us to experiment and see how different auto-scaling strategies are working based on CPU utilization and whether they will automatically scale down. We wanted to see it live if it is happening because it relates directly and correlates to the costing of our services on the cloud. Using Gremlin Reliability Management Platform, when we launched some CPU spikes and intentionally reduced the utilization of an API, we were able to see the auto-scaling up and down. It helped us save a lot of costs and select the right instances.
The best features of Gremlin Reliability Management Platform are the safe failure injection, which is crucial as we can simulate the failures in a manner that we know these are just dumping tests and not the actual issues. Whether it is the CPU spike or the memory exhaustion, or the network latency, or the server shutdown, server shutdown is one of the most favorite features that I have in Gremlin Reliability Management Platform. The controlled blast radius is another standout feature.
The controlled blast radius feature has helped my team in that we actually wanted to target only one specific container, our Docker containers that we deployed. It helped us to conduct tests in a very specific, isolated manner instead of launching a larger test or focusing on hundreds of servers at a time, resulting in very limited impact. Since ours is a very small team, we do not want to impact other servers. This controlled blast radius helped us to only focus on our servers and not impact any other team.
Gremlin Reliability Management Platform has positively impacted my organization because before Gremlin Reliability Management Platform, we did not even know how to conduct these chaos engineering tests. We heard about it, but we had no idea of how to do something of that nature. If there are ten servers, ten systems in our architecture and if suddenly something goes down, nobody knew what would happen next. We did not even know how to simulate these types of tests. This lack of confidence has been mitigated by using Gremlin Reliability Management Platform. Now we can confidently test and see which system is the most critical. If this goes down, what happens? How much business valuation are we going to impact? How much loss are we going to incur? All of this is now clearly visible and transparent.
Since using Gremlin Reliability Management Platform, we were able to reduce the incidents by six percent after conducting our limited experiments. We were also able to increase the uptime from ninety-eight to ninety-nine, which represents a one percent increase in uptime.
Gremlin Reliability Management Platform can be improved as the pricing is a bit expensive and the learning curve for beginners is a bit difficult. It is not easy to get along with, and we need pretty good time to understand and grasp those concepts before we can use it. The infrastructure also needs to be very mature; it should be set up properly and that takes a lot of compliance and regulation time.
I have been using Gremlin Reliability Management Platform for a couple of years now. I think it has been two years since we started using it.
Gremlin Reliability Management Platform is quite stable, and I have not seen any downtime or issues with its behavior or performance.
The scalability of Gremlin Reliability Management Platform depends on the scalability of the underlying infrastructure that we are hosting it on. So far for us, it has been pretty good and clean with no issues.
My interaction with customer support has not been quite often as we never had any requirements where we needed their help. The platform was quite stable.
I did not use any different solution before Gremlin Reliability Management Platform. That is the only reliability management platform I have used, and it is pretty good.
My experience with pricing, setup cost, and licensing is that it was a bit expensive, but most of it is handled by our team. I was not involved in the payment of it, as it was handled by the payments team.
I have seen a return on investment since using Gremlin Reliability Management Platform because fewer employees are needed now to conduct more reliable tests. If we needed ten people to do tests once upon a time, now, using Gremlin Reliability Management Platform, we can do it with a fifty percent reduction in employees. Only five people with Gremlin Reliability Management Platform can conduct much more reliable tests.
I did not evaluate any other platforms before choosing Gremlin Reliability Management Platform; we directly went to Gremlin Reliability Management Platform.
There were a lot of good examples and great documentation for Gremlin Reliability Management Platform, which is something that I appreciate. It helped us a lot.
My advice for others looking into using Gremlin Reliability Management Platform is that in the starting stages, it will take some time to understand its capabilities, what it can do, and what it cannot do. The learning curve is a bit difficult, but once you understand it, it is a pretty great product to use. I rate Gremlin Reliability Management Platform a nine out of ten because, as I mentioned, the learning curve and the pricing made me reduce that one point.