The Amazon Builders' Library

How Amazon builds and operates software
Explore the library
Filter by:
Content Category
Content Type
Learning Level
Showing results: 1-24
Total results: 33
  • Featured
No items for this filter combination
  • Software Delivery & Operations

    Resilience lessons from the lunch rush

    Author: Mike Haken

    Strategies on how to make systems more resilient and control excessive load

  • Software Delivery and Operations

    Level 300

    My CI/CD pipeline is my release captain

    Author: Clare Liguori

    Learn how Amazon continuously releases changes to production safely using practices such as trunk-based development, immutable deployment artifacts, and proactive rollbacks.

    PDF

  • Software Delivery & Operations

    LEVEL 300

    Using dependency isolation to contain…

    Author: David Yanacek

    Containing the impact caused by a failing dependency to affect only the relevant functionality in an application.

    PDF

  • Architecture

    Level 300

    Minimizing correlated failures in distributed…

    Author: Joe Magerramov

    Continue operating even if some of those servers fail, while using relatively inexpensive, commodity servers.

    PDF

  • Architecture

    Level 300

    Reliability, constant work, and a good cup…

    Author: Colm MacCarthaigh
    Simplifying systems to deliver stability by avoiding scaling during times of stress.

    PDF

  • Architecture

    Level 300

    Making retries safe with idempotent APIs

    Author: Malcolm Featonby
    Strategies for using idempotent APIs to reduce complexity and manage retries Correspondence

    PDF

  • Software Delivery & Operations

    200

    Hands-off: Automating continuous delivery…

    Author: Clare Liguori

    In this session, learn about Amazon’s automated approach to continuous delivery that helps release code safely and quickly, with pipelines that enable developers to focus on building solutions rather than managing deployments.

  • Software Delivery & Operations

    200

    Amazon's approach to production services…

    Author: David Yanacek

    This session covers the full spectrum of monitoring at Amazon, from how teams assess system health at a high level to how they zoom in to understand the details of a single request. Also, learn how Amazon thinks about percentiles, dimensionality of metrics, dashboards, log analysis, and distributed tracing.

  • Architecture

    Level 400

    Fairness in multi-tenant systems

    Author: David Yanacek
    Building fairness into multitenant systems to provide predictable performance and availability

    PDF

  • Architecture

    Level 300

    Avoiding overload in distributed systems by…

    Author: Joe Magerramov
    Strategies for avoiding the larger service from overloading the smaller one by putting the smaller service in control of the pace of interactions.

    PDF

  • Software Delivery and Operations

    Level 300

    Building dashboards for operational visibility

    Author: John O'Shea
    Building dashboards to monitor, dive deep, audit, and review distributed services and automated systems.

    PDF

  • Software Delivery and Operations

    Level 300

    Automating safe, hands-off deployments

    Author: Clare Liguori
    Strategies for continuously deploying to production while balancing safety and speed.

    PDF

  • Software Delivery and Operations

    Level 400

    Using load shedding to avoid overload

    Author: David Yanacek
    Strategies for maintaining predictable, consistent performance in the face of overload.

    PDF | Kindle

  • Architecture

    Level 400

    Workload isolation using shuffle-sharding

    Author: Colm MacCarthaigh
    Shuffle Sharding is one of our core techniques for drastically limiting the scope of impact of operational issues.

    PDF | Kindle

  • Architecture

    Level 400

    Architecting and operating resilient…

    Author: David Yanacek
    In this video, we cover what AWS does to build reliable and resilient services, including avoiding modes and overload, performing bounded work, throttling at multiple layers, guarding concurrency, sending idempotent requests, applying backpressure and fairness in queueing, and performing shuffle sharding.
  • Software Delivery and Operations

    Level 400

    Amazon's approach to high-availability…

    Author: Peter Ramensky
    In this video, learn the continuous-delivery practices that we invented that help raise the bar and prevent costly deployment failures.
  • Architecture

    Level 400

    Avoiding insurmountable queue backlogs

    Author: David Yanacek
    Prioritizing draining important workloads from queue backlogs quickly, and avoid backlogs in the first place.

    PDF | Kindle

  • Architecture

    Level 300

    Caching challenges and strategies

    Authors: Matt Brinkley, Jas Chhabra
    Improving latency and availability with caching while avoiding the modal behavior they can introduce.

    PDF | Kindle

  • Architecture

    Level 300

    Amazon’s approach to security during…

    Author: Colm MacCarthaigh
    In this video, learn about how AWS teams both minimize security risks in our products and respond to security issues proactively.
  • Architecture

    Level 200

    Timeouts, retries and backoff with jitter

    Author: Marc Brooker
    Building resilient systems and dealing with failures by using timeouts, retries, and backoff with jitter.

    PDF | Kindle

  • Software Delivery and Operations

    Level 300

    Going faster with continuous delivery

    Author: Mark Mansour
    Automating the software testing and deployment process for speed and reliability.

    PDF | Kindle

  • Architecture

    Level 400

    Beyond five 9s: Lessons from our highest…

    Author: Colm MacCarthaigh
    In this video, hear lessons from how AWS has built and architected Amazon Route 53 and the AWS authentication system, designed to survive cataclysmic failures, enormous load increases, and more.
  • Architecture

    Level 300

    Static stability using availability zones

    Authors: Becky Weiss, Mike Furr
    Architecting to use multiple availability zones for high availability and ensuring systems are statically stable.

    PDF | Kindle

  • Software Delivery and Operations

    Level 400

    Implementing health checks

    Author: David Yanacek
    Automatically detecting and mitigating server failures without unintended consequences from fleet-wide false positives.

    PDF | Kindle

1 2