What is site reliability engineering?

Site reliability engineering (SRE) is the practice of using software tools to automate IT infrastructure tasks such as system management and application monitoring. Organizations use SRE to ensure their software applications remain reliable amidst frequent updates from development teams. SRE especially improves the reliability of scalable software systems because managing a large system using software is more sustainable than manually managing hundreds of machines. 

Why is site reliability engineering important?

Site reliability describes the stability and quality of service that an application offers after being made available to end users. Software maintenance sometimes affects software reliability if technical issues go undetected. For example, when developers make new changes, they might inadvertently impact the existing application and cause it to crash for certain use cases.

The following are some benefits of site reliability engineering (SRE) practices.

Improved collaboration

SRE improves collaboration between development and operations teams. Developers often have to make rapid changes to an application to release new features or fix critical bugs. On the other hand, the operations team has to ensure seamless service delivery. Hence, the operations team uses SRE practices to closely monitor every update and promptly respond to any issues that arise due to changes.

Enhanced customer experience

Organizations use an SRE model to ensure software errors do not impact the customer experience. For example, software teams use SRE tools to automate the software development lifecycle. This reduces errors, meaning the team can prioritize new feature development over bug fixes.

Improved operations planning

The SRE team accepts that there's a realistic chance for software to fail. Therefore, the team plans for the appropriate incident response to minimize the impact of downtime on the business and end users. They can also better estimate the cost of downtime and understand the impact of such incidents on business operations. 

What are the key principles in site reliability engineering?

The following are some key principles of site reliability engineering (SRE).

Application monitoring

SRE teams accept that errors are a part of the software deployment process. Instead of striving for a perfect solution, they monitor software performance in terms of service-level agreements (SLAs), service-level indicators (SLIs), and service-level objectives (SLOs). They observe and monitor performance metrics after deploying the application in production environments. 

Gradual change implementation

SRE practices encourage the release of frequent but small changes to maintain system reliability. SRE automation tools use consistent but repeatable processes to do the following:

  • Reduce risks due to changes.
  • Provide feedback loops to measure system performance.
  • Increase speed and efficiency of change implementation.

Automation for reliability improvement

SRE uses policies and processes that embed reliability principles in every step of the delivery pipeline. Some strategies that automatically resolve problems include the following:

  • Developing quality gates based on service-level objectives to detect issues earlier
  • Automating build testing using service-level indicators
  • Making architectural decisions that ensure system resiliency at the outset of software development

What is observability in site reliability engineering?

Observability is a process that prepares the software team for uncertainties when the software goes live for end users. Site reliability engineering (SRE) teams use tools to detect abnormal behaviors in the software and, more importantly, collect information that helps developers understand what causes the problem. Observability involves collecting the following information with SRE tools. 

Metrics 

Metrics are quantifiable values that reflect an application's performance or system health. SRE teams use metrics to determine if the software consumes excessive resources or behaves abnormally.

Logs

SRE software generates detailed, time-stamped information called logs in response to specific events. Software engineers use logs to understand the chain of events that lead to a particular problem. 

Traces 

Traces are observations of the code path of a specific function in a distributed system. For example, checking out an order cart might involve the following:

  • Tallying the price with the database
  • Authenticating with the payment gateway
  • Submitting the orders to vendors

Traces consist of an ID, name, and time. They help software developers detect latency issues and improve software performance. 

What is monitoring in site reliability engineering?

Monitoring is a process of observing predefined metrics in an application. Developers decide which parameters are critical in determining the application health and set them in monitoring tools. Site reliability engineering (SRE) teams collect critical information that reflects the system performance and visualize it in charts.

In SRE, software teams monitor these metrics to gain insight into system reliability.

Latency 

Latency describes the delay when the application responds to a request. For example, a form submission on a website takes 3 seconds before it directs users to an acknowledgment webpage. 

Traffic

Traffic measures the number of users concurrently accessing your service. It helps software teams accordingly budget computing resources to maintain a satisfactory service level for all users.

Error

Error is a condition where the application fails to perform or deliver according to expectations. For example, when a webpage fails to load or a transaction does not go through, SRE teams use software tools to automatically track and respond to errors in the application. 

Saturation

Saturation indicates the real-time capacity of the application. A high level of saturation usually results in degrading performance. Site reliability engineers monitor the saturation level and ensure it is below a particular threshold. 

What are the key metrics for site reliability engineering?

Site reliability engineering (SRE) teams measure the quality of service delivery and reliability using the following metrics. 

Service-level objectives

Service-level objectives (SLOs) are specific and quantifiable goals that you are confident the software can achieve at a reasonable cost to other metrics, such as the following: 

  • Uptime or the time a system is in operation
  • System throughput or
  •  output
  • Download rate or the speed at which the application loads

An SLO promises delivery through the software to the customer. For example, you set a 99.95% uptime for your company's food delivery app.

Service-level indicators

Service-level indicators (SLIs) are the actual measurements of the metric an SLO defines. In real-life situations, you might get values that match or differ from the SLO. For example, your application is up and running 99.92% of the time, which is lower than the promised SLO. 

Service-level agreements

The service-level agreements (SLAs) are legal documents that state what would happen when one or more SLOs are not met. For example, the SLA states that the technical team will resolve your customer's issue within 24 hours after a report is received. If your team could not resolve the problem within the specified duration, you might be obligated to refund the customer.

Error budgets

Error budgets are the noncompliance tolerance for the SLO. For example, an uptime of 99.95% in the SLO means that the allowed downtime is 0.05%. If the software downtime exceeds the error budget, the software team devotes all resources and attention to stabilize the application.

How does site reliability engineering work?

Site reliability engineering (SRE) involves the participation of site reliability engineers in a software team. The SRE team sets the key metrics for SRE and creates an error budget determined by the system's level of risk tolerance. If the number of errors is low, the development team can release new features. However, if the errors exceed the permitted error budget, the team puts new changes on hold and solves existing problems.

For example, a site reliability engineer uses a service to monitor performance metrics and detect anomalous application behavior. If there are issues with the application, the SRE team submits a report to the software engineering team. The developers fix the reported cases and publish the updated application.

DevOps

DevOps is a software culture that breaks down the traditional boundary of development and operation teams. With DevOps, developers and operation engineers no longer work in silos. Instead, they use software tools to improve collaboration and keep up with the rapid pace of software update releases.

SRE compared to DevOps 

SRE is the practical implementation of DevOps. DevOps provides the philosophical foundation of what must be done to maintain software quality amidst the increasingly shortened development timeline. Site reliability engineering offers the answers to how to achieve DevOps success. SRE ensures that the DevOps team strikes the right balance between speed and stability. 

What are the responsibilities of a site reliability engineer?

A site reliability engineer is an IT expert who uses automation tools to monitor and observe software reliability in the production environment. They are also experienced in finding problems in software and writing codes to fix them. They are typically former system administrators or operation engineers with good coding skills. The following are some site reliability responsibilities.

Operations

Site reliability engineers spend up to half of their time in operations work. This includes several tasks, such as the following: 

  • Emergency incident response
  • Change management
  • IT infrastructure management

The engineers use site reliability engineering (SRE) tools to automate several operations tasks and increase team efficiency.

System support

Site reliability engineers work closely with the development team to create new features and stabilize production systems. They create an SRE process for the entire software team and are on hand to support escalation issues. More importantly, site reliability teams provide documented procedures to customer support to help them effectively deal with complaints. 

Process improvement

Site reliability engineers improve the software development life cycle by holding post-incident reviews. The SRE team documents all software problems and respective solutions in a shared knowledge base. This helps the software team efficiently respond to similar issues in the future. 

What are the common site reliability engineering tools?

Site reliability engineering (SRE) teams use different types of tools to facilitate monitoring, observation, and incident response. 

Container orchestrator 

Software developers use a container orchestrator to run containerized applications on various platforms. Containerized applications store their code files and related resources within a single package called a container. For example, software engineers use Amazon Elastic Kubernetes Service (Amazon EKS) to run and scale cloud applications. 

On-call management tools 

On-call management tools are software that allows SRE teams to plan, arrange, and manage support personnel who deal with reported software problems. SRE teams use the software to ensure there is always a support team on standby to receive timely alerts on software issues. 

Incident response tools 

Incident response tools ensure a clear escalation pathway for detected software issues. SRE teams use incident response tools to categorize the severity of reported cases and deal with them promptly. The tools can also provide post-incident analysis reports to prevent similar problems from happening again. 

Configuration management tools

Configuration management tools are software that automates software workflow. SRE teams use these tools to remove repetitive tasks and become more productive. For example, site reliability engineers use AWS OpsWorks to automatically set up and manage servers on AWS environments. 

How does AWS help with site reliability engineering?

AWS Management and Governance services provide the necessary tools for the software team to build, scale, and deploy distributed applications without compromising system reliability. The site reliability engineering (SRE) team uses various AWS Management and Governance services to monitor and govern AWS and on-premises computing resources.

  • AWS Service Catalog allows SRE teams to catalog, manage, and quickly deploy IT services.
  • AWS Systems Manager provides a centralized management hub for site reliability engineers to gain operational insights into software computing resources.
  • AWS Proton is an automated management tool for deploying containerized and serverless applications.

Get started with site reliability engineering on AWS by creating an AWS account today.

Next steps on AWS

Check out additional product-related resources
Learn more about site reliability engineering at AWS 
Sign up for a free account

Instant get access to the AWS Free Tier.

Sign up 
Start building in the console

Get started building with the site reliability engineering on AWS management console.

Sign in