Spend Less Time on Calls and More on Innovation with Shoreline Incident Automation on AWS
By Anurag Gupta, Founder and CEO – Shoreline.io
By Shakthi Dakuri, Sr. Partner Solutions Architect – AWS
Shoreline.io was founded by Anurag Gupta, a former VP of engineering at Amazon Web Services (AWS), who built and ran more than 11 analytic and database services during his time at AWS.
Many of these businesses grew from nothing to hundreds of thousands of nodes and billions of dollars in revenue. To succeed, it’s important to get ahead of production incidents so businesses can scale and engineering teams can focus on innovation and growth.
There are thousands of observability products that help users find new incidents, and there are hundreds of products that figure out who to route a new incident to. There is almost nothing, however, that helps teams debug, repair, and code away issues once they are found. This is what Shoreline is focused on with its Incident Automation solution.
In this post, we will discuss how Shoreline empowers support and L1 teams to resolve incidents, provide rich debugging capabilities to engineering teams to accelerate root cause analysis, and enable developers with powerful debugging tools to run across their cloud deployments.
Improving Every Step of the Incident Resolution Process
When it comes to improving how teams handle production incidents, Shoreline helps in a variety of ways:
- Empowering support and L1 teams to debug and repair incidents with live, Jupyter-style notebooks.
- Delivering rich diagnostics to engineering when incidents are escalated.
- Making it easy for engineers to create self-healing automations.
- Providing a broad library of pre-built notebooks and automations. If one customer fixes a problem, everyone benefits.
- Empowering engineers with powerful debugging tools, like the ability to run distributed Linux commands across the fleet in seconds.
Figure 1 – Shoreline helps improve the incident resolution process.
Empowering Support and L1 Teams With Self-Service Tools
You typically don’t want your teams to have Secure Shell (SSH) or kubectl access to your production infrastructure, but there are often basic issues they could identify and resolve.
Giving your team self-service tools that allow them to safely repair common issues with pre-approved actions allows them to resolve issues in a fraction of the time without escalating. This is a huge productivity win for engineering and the customer experience.
Deliver Rich Diagnostics to Engineering
Everyone has been there. You escalate an issue and the first question from engineering is “Where is the diagnostic data?” Shoreline Notebooks ensure you capture all of the data you need in seconds and then automatically saves it. All you need to do is share the URL with engineering.
Make it Easy for Engineers to Create Self-Healing Automations
Before Shoreline, creating self-healing automations included a lot of steps. You had to create a script, tie it to an alarm, ensure the script got deployed to every node, make sure the script had the right security privileges, and then create some kind of audit trail so someone could tell if the script actually ran.
The Shoreline platform eliminates most of this work. With Shoreline, you create the script and Shoreline takes care of the rest. One Shoreline customer was even able to cut their script by 90%. You can build so many more automations when you can do it in an afternoon instead of a month.
Broad Library of Pre-Built Notebooks and Automations
Every time Shoreline fixes a common issue with one customer, it’s made available to every customer. Examples of these include restarting stuck pods, cleaning and resizing full disks, and capturing data for memory leaks. Shoreline has built over 35 notebooks and automations and is adding more every week.
Empower Engineers with Powerful Debugging Tools
When an engineer is assigned a new bug, they all too often have to SSH into box after box to uncover a needle-in-a-haystack type of problem. Shoreline transforms this experience with the ability to run targeted Linux commands across clouds, Kubernetes clusters, and cloud accounts.
Shoreline’s commands make requests like “Run a grep command on every box running the credit card processing app with CPU over 80%.” Customers have told Shoreline it’s “like having a database view into my infrastructure.”
How it Works
Shoreline implements an operations-at-the-edge architecture which helps companies address cost, latency, and operational complexity better than centralized operations. This architecture employs an agent that runs on cloud nodes as a container and on a Kubernetes cluster daemon set.
The agent analyzes data while running monitors, and when a monitor is triggered the runbook automation can take over to kick off diagnostic commands and remediations—all within your Amazon Virtual Private Cloud (VPC).
The second component of this architecture is the Shoreline control plane which can be deployed as software-as-a-service (SaaS) managed by Shoreline on AWS, or as a self-hosted appliance in the customer’s VPC.
The control plane enables easy node management that is intelligent enough to push down the data processing to the agents. In addition, it provides an intelligent query mechanism that can run distributed Linux commands across the Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Kubernetes (Amazon EKS) fleet. The control plane also provides webhooks to integrate with existing observability tooling such as Amazon CloudWatch.
Want to learn more? Check out this video to see Shoreline Incident Automation in action.
Customer Use Case
Dataiku Online knew from day one that providing an outstanding customer experience was its top priority. The company recognized that a reliable SaaS solution was required for any world-class service.
As Dataiku, an AWS Machine Learning Competency Partner, launched its new SaaS solution publicly in 2021, the organization saw usage spike. The infrastructure team got busy managing the environment to ensure performance and availability targets were exceeded for customers.
However, this work soon became quite repetitive as the same issues cropped up daily, or more often. For example:
- With a large and growing fleet under heavy use by data scientists, disks were going to fill up.
- With new infrastructure coming online all the time, some metadata would sometimes get corrupted.
- With free trials open to the public, some users will operate outside of the fair-use policy of the trial program.
The Dataiku team quickly realized that in order to scale the SaaS service, it needed to bring in automation for the remediation of these repetitive incidents. After a two-week proof of concept (PoC) of Shoreline, Dataiku was realized over 20 FTE days of savings by automatically triggering 170 remediations.
Read the full case study to learn how Dataiku is saving days of DevOps work while improving app performance.
Shoreline’s cloud reliability platform leverages AWS services to increase productivity of DevOps teams by delivering rich diagnostics to engineering and making it easy to create self-healing infrastructure. Shoreline’s library of pre-built automations empowers support and L1 teams with self-service tools to provide a non-disruptive support experience to customers.
This solution has enabled Dataiku to scale its SaaS service by automatically triggering remediations for the most repetitive incidents. This has resulted in reduced manual intervention while increasing the productivity of the engineering team.
With Shoreline, available on AWS Marketplace, previous remediations that were tedious and time consuming are transformed to an automated and interactive process. This brings increased reliability to the cloud services and injects more compliance, visibility, and auditability.
Shoreline – AWS Partner Spotlight
Shoreline is an AWS Partner that empowers support and L1 teams to resolve incidents by providing rich debugging capabilities to engineering teams to accelerate root cause analysis, and enables developers with powerful debugging tools to run across their cloud deployments.