Amazon's approach to production services monitoring
This session covers the full spectrum of monitoring at Amazon, from how teams assess system health at a high level to how they zoom in to understand the details of a single request. Also, learn how Amazon thinks about percentiles, dimensionality of metrics, dashboards, log analysis, and distributed tracing.
Operational Excellence at Amazon
In this session, learn about Amazon’s operational practices. How the habits that teams have adopted, such as handling retrospectives, sharing knowledge, and regularly reviewing operational metrics, led teams to innovate to build better tools and make architectural shifts.
Architecting and operating resilient serverless systems at scale
In this video, we cover what AWS does to build reliable and resilient services, including avoiding modes and overload, performing bounded work, throttling at multiple layers, guarding concurrency, sending idempotent requests, applying backpressure and fairness in queueing, and performing shuffle sharding.
Implementing health checks
Automatically detecting and mitigating server failures without unintended consequences from fleet-wide false positives.
Instrumenting distributed systems for operational visibility
Gaining operational visibility into production systems, and troubleshoot failures with software instrumentation.
Using load shedding to avoid overload
Strategies for maintaining predictable, consistent performance in the face of overload.
Using dependency isolation to contain concurrency overload
Containing the impact caused by a failing dependency to affect only the relevant functionality in an application.
Fairness in multi-tenant systems
Building fairness into multitenant systems to provide predictable performance and availability.
Avoiding insurmountable queue backlogs
Prioritizing draining important workloads from queue backlogs quickly, and avoid backlogs in the first place.