Listing Thumbnail

    Gremlin Reliability Management Platform

     Info
    Sold by: Gremlin 
    Deployed on AWS
    Free Trial
    AWS Free Tier
    Downtime is expensive and can hurt your brand. Gremlin provides engineers with the framework to safely, securely, and easily simulate real outages through the practice of Reliability Engineering. As organizations build more and more cloud-native systems, it's critical for these organizations to be able to fully understand and provide data points on what will happen to their systems if they experience any sort of degradation. This may include situations such as a spike in CPU, added latency on a service, a service being completely unreachable or other situations that result in a poor user experience or unplanned outage.
    4.3

    Overview

    Play video

    Gremlin's Reliability Management Platform builds upon the practice of Chaos Engineering. By giving teams a more guided and product-led way to achieve reliability goals, Gremlin's Reliability Management Platform let's you easily define services, integrate them to your Golden Signals in your APM tool, run a series of Reliability Tests, and receive a Reliability Score for your defined service. These efforts help groups approach their reliability efforts in a safe, secure, scalable and standardized way.

    Highlights

    • Use Reliability Engineering to proactively get ahead of any infrastructure related issues in your environment
    • Use Reliability Scoring to get a comprehensive view of where your services rank from a reliability perspective
    • Don't let velocity conflict with reliability. Build a regression set of Reliability tests to understand how changes to your applications and infrastructure impact your underlying microservices and infrastructure

    Details

    Sold by

    Delivery method

    Deployed on AWS
    New

    Introducing multi-product solutions

    You can now purchase comprehensive solutions tailored to use cases and industries.

    Multi-product solutions

    Features and programs

    Financing for AWS Marketplace purchases

    AWS Marketplace now accepts line of credit payments through the PNC Vendor Finance program. This program is available to select AWS customers in the US, excluding NV, NC, ND, TN, & VT.
    Financing for AWS Marketplace purchases

    Pricing

    Free trial

    Try this product free according to the free trial terms set by the vendor.

    Gremlin Reliability Management Platform

     Info
    Pricing is based on the duration and terms of your contract with the vendor. This entitles you to a specified quantity of use for the contract duration. If you choose not to renew or replace your contract before it ends, access to these entitlements will expire.
    Additional AWS infrastructure costs may apply. Use the AWS Pricing Calculator  to estimate your infrastructure costs.

    12-month contract (1)

     Info
    Dimension
    Description
    Cost/12 months
    Reliability Management: 50 Agents
    Reliability Management Platform - Unlimited Reliability Testing & Scoring, Fault Injection, Failure Flags - 50 Agents
    $45,000.00

    Vendor refund policy

    No refunds.

    How can we make this page better?

    We'd like to hear your feedback and ideas on how to improve this page.
    We'd like to hear your feedback and ideas on how to improve this page.

    Legal

    Vendor terms and conditions

    Upon subscribing to this product, you must acknowledge and agree to the terms and conditions outlined in the vendor's End User License Agreement (EULA) .

    Content disclaimer

    Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

    Usage information

     Info

    Delivery details

    Software as a Service (SaaS)

    SaaS delivers cloud-based software applications directly to customers over the internet. You can access these applications through a subscription model. You will pay recurring monthly usage fees through your AWS bill, while AWS handles deployment and infrastructure management, ensuring scalability, reliability, and seamless integration with other AWS services.

    Resources

    Support

    Vendor support

    Email support is offered during 8am - 8pm PST, Monday - Friday. support@gremlin.com  or by Zendesk widget in App

    AWS infrastructure support

    AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.

    Product comparison

     Info
    Updated weekly

    Accolades

     Info
    Top
    25
    In Testing, Network Infrastructure
    Top
    50
    In Compliance and Auditing, Monitoring and Observability
    Top
    10
    In Hybrid Monitoring

    Customer reviews

     Info
    Sentiment is AI generated from actual customer reviews on AWS and G2
    Reviews
    Functionality
    Ease of use
    Customer service
    Cost effectiveness
    6 reviews
    Insufficient data
    Insufficient data
    0 reviews
    Insufficient data
    Insufficient data
    Insufficient data
    Insufficient data
    2 reviews
    Insufficient data
    Insufficient data
    Insufficient data
    Insufficient data
    Positive reviews
    Mixed reviews
    Negative reviews

    Overview

     Info
    AI generated from product descriptions
    Chaos Engineering Framework
    Simulates real outages and infrastructure degradation scenarios including CPU spikes, service latency, and complete service unavailability to test system resilience.
    Reliability Scoring System
    Generates comprehensive reliability scores for defined services based on integration with Golden Signals from APM tools to assess service reliability rankings.
    Guided Reliability Testing
    Provides a product-led approach to define services, integrate monitoring data, and execute a series of structured reliability tests in a standardized manner.
    Regression Test Suite
    Enables creation of regression test sets to measure and understand the impact of application and infrastructure changes on microservices and underlying systems.
    Proactive Issue Detection
    Identifies and helps prevent infrastructure-related issues before they occur through systematic reliability engineering practices and testing.
    Service Level Objective Management
    Defines and monitors customer-centric Service Level Objectives (SLOs) with flexible Error Budgets, Occurrences, and Time-slices configurations
    Multi-Source Data Integration
    Connects to existing observability and monitoring data sources through a library of integrations without requiring additional tooling
    SLO Analysis and Reporting
    Provides SLI Analyzer for defining SLOs based on historical metrics, intuitive reliability burndown reports, and composite SLOs that aggregate multiple individual SLOs
    Automated Alerting and Incident Response
    Triggers proactive automated alerts to existing incident response tools and executes customizable webhooks with runbooks based on SLOs-at-risk conditions
    SLO-as-Code and Extended Features
    Supports SLOs-as-code, OpenSLO compatibility, sloctl command-line tool, annotations for context, and replay functionality for historical analysis
    Automated Service Discovery and Modeling
    Lightweight, agentless, and scalable discovery of IT infrastructure, applications, and software components with automatic identification of dependencies and relationships across the IT landscape.
    AI-Powered Root Cause Analysis
    Causal AI technology to determine root causes and isolate issues across services, reducing mean time to resolution and eliminating manual war room investigations.
    Machine Learning-Based Event Correlation
    ML-powered situations for proactively correlating events and determining root causes across services with human-readable summaries and visual diagrams showing impact and diagnosis.
    Predictive Capacity Planning
    Saturation forecasting to predict up to 30 days in advance when infrastructure and application resources will run out of capacity, with what-if simulation capabilities for business event planning.
    Generative AI-Powered Remediation Recommendations
    Patented Best Action Recommendation engine powered by generative AI that recommends resolution steps based on similar past incidents and delivers code templates including Ansible runbooks and Bash scripts for automated remediation.

    Contract

     Info
    Standard contract
    No
    No

    Customer reviews

    Ratings and reviews

     Info
    4.3
    7 ratings
    5 star
    4 star
    3 star
    2 star
    1 star
    57%
    29%
    14%
    0%
    0%
    3 AWS reviews
    |
    4 external reviews
    External reviews are from G2  and PeerSpot .
    ElenaElena

    Chaos testing has uncovered vulnerabilities and now drives stronger, more reliable infrastructures

    Reviewed on Mar 02, 2026
    Review from a verified AWS customer

    What is our primary use case?

    My main use case for Gremlin Reliability Management Platform  is chaos testing. I take my infrastructure and then I sabotage some things to see how they reach the goal. I try network or infrastructure attacks mainly, and I play every code on Gremlin Reliability Management Platform . Regarding a memorable incident, I found a lot of vulnerabilities in some SMTP servers, and I fixed it with Gremlin Reliability Management Platform. It is interesting because Gremlin Reliability Management Platform is not a penetration tester, but by disrupting other parts of the infrastructure and then running some other tests, it serves this purpose effectively.

    What is most valuable?

    The best feature Gremlin Reliability Management Platform offers in my experience is having everything in one dashboard and the ability to perform tests of every kind of infrastructure. The flexibility is one of the main things about Gremlin Reliability Management Platform that I found, and it is really important. It is also important to have the possibility of targeting even specific or wider parts of infrastructure, and it is simple and well-thought-out to isolate things or put it in a more reasonable way.

    Using Gremlin Reliability Management Platform has raised more than fifty percent of the reliability of the infrastructure. I do not own a single infrastructure of my own because I am a freelancer, and so I have many cases of customers, but the percentage of the average improvement is very huge. Mainly, we notice fewer incidents and less downtime. There are really two pathways along: fewer incidents because with Gremlin Reliability Management Platform, we can make every part of the infrastructure more solid, and less downtime because we can test more architectures and then things like how to put in high availability clusters. The impact in clients' environments is really significant, and it is one of the special things.

    What needs improvement?

    I think that it will be important to have resources to perform self-directed studies on Gremlin Reliability Management Platform as an improvement. There is a small and fast and simple certification, but if they add possibilities to learn and get certified for free, it would be great because it is very powerful and the documentation is very high quality. However, I do not think that only with the documentation you can reach all the complexity of the tool. Some learning paths, free and by webinar, could help.

    I think it would be useful to have some integration with Splunk or other log collectors, or maybe in the future, the ability to link Dynatrace  or any other observability platform.

    For how long have I used the solution?

    I have been using Gremlin Reliability Management Platform for about five years.

    What do I think about the stability of the solution?

    Gremlin Reliability Management Platform is stable.

    What do I think about the scalability of the solution?

    More than scalability, I thought about availability because it is a really important thing of the architecture tools, but I think it is also scalable with AWS .

    How are customer service and support?

    The customer support quality is very good. I would rate the customer support an eight on a scale of one to ten.

    How would you rate customer service and support?

    Positive

    Which solution did I use previously and why did I switch?

    I was born as a chaos tester with Gremlin Reliability Management Platform, and I think I will die with it professionally.

    How was the initial setup?

    I purchased Gremlin Reliability Management Platform through the AWS Marketplace .

    What was our ROI?

    I cannot share relevant metrics because my customers cover it with an NDA. However, it is not a general impression; the numbers are impressive because, as I said previously, reducing downtime and mainly reducing failures is significant.

    What's my experience with pricing, setup cost, and licensing?

    It is not so cheap, but it has very powerful features. For my experience with pricing, setup cost, and licensing, the value is there.

    Which other solutions did I evaluate?

    I found Gremlin Reliability Management Platform and discovered the free learning path, so I dove into it before choosing Gremlin Reliability Management Platform.

    What other advice do I have?

    The main advice I would give to others looking into using Gremlin Reliability Management Platform would be to study it. Do not be shy to fail. Test everything and do lab architectures to test. It is very important to have hands-on experience with tools of this caliber. I would rate this review a nine out of ten.

    Which deployment model are you using for this solution?

    Public Cloud

    If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

    reviewer2805747

    Chaos experiments have revealed reliability risks and provide clear reliability scores

    Reviewed on Mar 02, 2026
    Review from a verified AWS customer

    What is our primary use case?

    My main use case for Gremlin Reliability Management Platform  is the Chaos Engineering part for software. A quick specific example of how I've used Gremlin Reliability Management Platform  for Chaos Engineering in my work is with a web service we have, where we need to know the reliability score of it. We conducted chaos experiments with it, including a network experiment, black hole, CPU, and memory experiments, that create chaos for the service, and then we receive a reliability score reflecting the service's reliability, especially in a production environment.

    Gremlin Reliability Management Platform is amazing with the reliability score. There is a built-in Chaos Engineering experiment that can help you to provide this to your service. You run it on your service, and then you receive the reliability score from Gremlin Reliability Management Platform, along with insights on the issues and risks present in your service that you can examine and work on.

    What is most valuable?

    One of my best features of Gremlin Reliability Management Platform is the built-in chaos experiments, which gives you the reliability score of your service.

    The built-in chaos experiments and reliability scoring have helped me in my day-to-day work by making it easier to run the experiments directly instead of doing them manually one by one. It allows running scenarios for my web service, for example, and in terms of CPU, it runs the container in terms of Kubernetes  from 25% to 75% CPU utilization, giving me more insights about how reliable my system is, making my approach easier for Gremlin Reliability Management Platform and Chaos Engineering.

    Game Days can help you take a day with your team to experiment with your services in a production or pre-production environment, allowing you to see how reliable your system is, which is a great feature for the team to deep dive into Chaos Engineering.

    Gremlin Reliability Management Platform has positively impacted my organization because we had clients come to us to implement Gremlin Reliability Management Platform as a Chaos Engineering platform for their use cases, which has gained us a lot of potential client opportunities as a consulting company. The reliability scores have improved, as built-in experiments give you the reliability scores, along with insights on risks you have, how you can manage and improve them, which is very helpful. In terms of faster incident response, especially in Kubernetes , if you have one container, Gremlin Reliability Management Platform flags the need for an HPA that will increase your reliability score for the service.

    What needs improvement?

    Gremlin Reliability Management Platform can be improved by introducing open-source features. It currently has a paid version, but introducing open-source features could encourage more people to use and try it.

    The user interface is great, the integration is smooth, and Gremlin Reliability Management Platform has a fantastic support team that helps us a lot in many cases.

    For how long have I used the solution?

    I have been using Gremlin Reliability Management Platform for around two years, and I am certified in Gremlin Reliability Management Platform.

    What do I think about the stability of the solution?

    Gremlin Reliability Management Platform is stable with good availability and is very reliable.

    What do I think about the scalability of the solution?

    Gremlin Reliability Management Platform scales smoothly for running more chaos experiments, adding more services, or supporting a larger team. It can easily scale up your experiments for many of your services, and it can provide other experiments for interconnected dependencies.

    How are customer service and support?

    When I have questions or run into issues with Gremlin Reliability Management Platform, their support team is helpful and responsive. They resolve our problems quickly and provide assistance through Zoom meetings, which has been very effective in troubleshooting.

    How would you rate customer service and support?

    Negative

    Which solution did I use previously and why did I switch?

    We did not use any other solutions; we only started with Gremlin Reliability Management Platform.

    What was our ROI?

    I can see a return on investment because we save a lot of time during our Chaos Engineering experiments. We do not need to look at all the day's metrics on Grafana  dashboards; we run our chaos experiments in a production environment to see how reliable our product or service is.

    What's my experience with pricing, setup cost, and licensing?

    My experience with pricing, setup cost, and licensing depends on the company. My role does not incur costs for us since we have an NFR for Gremlin Reliability Management Platform that we can use in our case.

    Which other solutions did I evaluate?

    I did not evaluate other options before choosing Gremlin Reliability Management Platform; the company did that, so I do not have an answer for that.

    Which deployment model are you using for this solution?

    On-premises

    If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

    Vinaykumar Vishwakarma

    Chaos testing has increased confidence in Kubernetes reliability and reduced production issues

    Reviewed on Feb 27, 2026
    Review provided by PeerSpot

    What is our primary use case?

    My main use case for Gremlin Reliability Management Platform is to test. We are running a Kubernetes cluster on GCP, and we want to check our clusters, especially node reliability for the HA use case. The way we used to check the Kubernetes cluster is that we have multiple nodes with multiple tags on nodes, and we are deploying different applications on different nodes to ensure that all the nodes are up. We are using Gremlin Reliability Management Platform for chaos engineering to check those nodes in a pre-prod environment. Sometimes, we also check EC2 instances on Amazon.

    What is most valuable?

    The best feature that Gremlin Reliability Management Platform offers for me is the prebuilt reliability test; I think that is the best feature along with the automated scheduling. These are the best features that I can mention.

    Gremlin Reliability Management Platform has positively impacted our organization by providing us with more confidence in production. We are more confident about running chaos in production, and related to the prebuilt test, we have some scalability tests, especially regarding the infrastructure side, such as CPU tests, memory tests, or disk tests, as I mentioned earlier. If CPU pushes more than seventy-five percent, we make sure that services scale and behave correctly. If memory moves from the threshold of more than seventy-five percent to eighty percent, then we take action accordingly, and we also conduct some redundancy host tests that I mentioned before. We are more confident about the production environment, and we have significantly reduced our issues in production by thirty percent.

    What needs improvement?

    I think Gremlin Reliability Management Platform can be improved by integrating with more AWS services or GCP services. I also think we can somehow integrate it with machine learning or perhaps some sort of AI by utilizing natural language processing so that it will be easier to interact with non-technical persons as well. We need more services and more prebuilt plugins for Gremlin Reliability Management Platform, especially for stress testing. I want to see how it can be integrated with machine learning, particularly on the NLP side. If we can integrate it with natural language, could we talk to Gremlin Reliability Management Platform and have it configure some of the basic settings so that non-technical persons can also work on Gremlin Reliability Management Platform-like tools? Even a QA person should be able to integrate it without needing any DevOps or cloud expertise.

    For how long have I used the solution?

    I have been using Gremlin Reliability Management Platform for two years.

    What do I think about the stability of the solution?

    Gremlin Reliability Management Platform is stable; it is quite stable.

    What do I think about the scalability of the solution?

    The scalability of Gremlin Reliability Management Platform is good; it is scalable.

    How are customer service and support?

    The customer support for Gremlin Reliability Management Platform is good overall; the documentation is good.

    How would you rate customer service and support?

    Which solution did I use previously and why did I switch?

    I did not previously use a different solution before Gremlin Reliability Management Platform.

    What was our ROI?

    We are seeing a return on investment from using Gremlin Reliability Management Platform because we are getting less production issues by thirty percent, as I mentioned earlier, making it a great investment. Now we are free at least on long weekends, knowing what the issues are, and that is a great thing.

    Which other solutions did I evaluate?

    Before choosing Gremlin Reliability Management Platform, I did not evaluate other options.

    What other advice do I have?

    I rate Gremlin Reliability Management Platform an eight out of ten. I give it an eight because I want to see improvements on the machine learning side, particularly how it can be integrated with NLP.

    I chose eight out of ten for Gremlin Reliability Management Platform because it is one of the best tools in terms of chaos engineering. It also has ready-made templates, and we are more confident about the production environment, which saves our time, especially during long weekends.

    I advise others looking into using Gremlin Reliability Management Platform to run it for production-grade applications, specifically on Kubernetes, and run production Kubernetes at scale. That is how we are using it for multi-node clusters, multi-zone deployment, and microservices architecture. We can replicate some of the production issues by ensuring that a node is down, allowing us to deploy without issues while maintaining visibility. Reliability score is the main metric for the enterprise solution, and we have standardized tests and a history of tracking trends.

    reviewer2783910

    Platform has improved reliability metrics but still raises questions about overall value

    Reviewed on Dec 03, 2025
    Review from a verified AWS customer

    What is our primary use case?

    The Enterprise Reliability Platform  serves as my main use case for the next question.

    A quick specific example of how I use The Enterprise Reliability Platform  to maintain reliability and efficiency is that we have our own internal system to track and maintain the reliability and efficiency.

    What is most valuable?

    The Enterprise Reliability Platform has positively impacted my organization as it has significantly increased the efficiency and reliability of our systems.

    I measured that increase in efficiency, and I can share that the metrics I noticed include latency and the SLOs, error budget, and not burning through the error budgets.

    What needs improvement?

    I have no recommendations for how The Enterprise Reliability Platform can be improved.

    For how long have I used the solution?

    I have been using The Enterprise Reliability Platform for one year.

    What other advice do I have?

    I have no answer regarding the best features The Enterprise Reliability Platform offers.

    I would provide no advice to others looking into using The Enterprise Reliability Platform.

    My company does not have a business relationship with this vendor other than being a customer.

    I was not offered a gift card or incentive for this review.

    I do not have any additional thoughts about The Enterprise Reliability Platform before we wrap up.

    I gave this review a rating of 6.

    Which deployment model are you using for this solution?

    Hybrid Cloud

    If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

    Computer Software

    Feature-full graph traversal language

    Reviewed on Mar 09, 2022
    Review provided by G2
    What do you like best about the product?
    Gremlin is quite easy to learn and use. I like that it supports both graph traversals and graph pattern matching (aka declarative traversals). In many cases, I would prefer the Gremlin syntax to the SPARQL syntax.
    What do you dislike about the product?
    I would be interested to see inferencing support (materialized on not) in the future. Mixing features like declarative and non-declarative traversals could be a bit cumbersome.
    What problems is the product solving and how is that benefiting you?
    There are many use cases where the property graph data model and graph traversals with Gremlin can be very useful. I used Gremlin to solve problems related to fraud detection, real-time recommendations and customer 360.
    View all reviews