Optimize your applications for scale and reliability on Amazon Bedrock

As generative AI applications scale to serve more users and handle increasingly complex workloads, understanding how to optimize your application’s availability by proper error handling, becomes essential. Two error types—429 ThrottlingException and 503 ServiceUnavailableException—are important signals that your application is reaching operational thresholds that require attention. While these errors are typically retriable, how you handle them directly impacts user experience. Delays in responding can disrupt a conversation’s natural flow and reduce user engagement. The difference between a reliable, production-ready application and one that struggles under load often comes down to implementing the right error handling strategies and quota management practices from the start.

This post provides practical strategies for building reliable applications on Amazon Bedrock. We’ll explore proven patterns for error handling, quota optimization, and architectural resilience that help your applications scale reliably. Whether you’re launching your first AI feature or optimizing a mature production system, you’ll find actionable guidance for operating confidently at any scale. Amazon Bedrock is designed to scale with your applications, offering access to industry-leading foundation models with built-in capabilities like cross-region inference and configurable throughput. The patterns in this post help you take full advantage of these capabilities while building applications that remain responsive as demand grows.

Prerequisites

AWS account with Amazon Bedrock access
Python 3.x and boto3 installed
Basic understanding of AWS services
IAM Permissions: Make sure you have the following minimum permissions:
- bedrock:InvokeModel or bedrock:InvokeModelWithResponseStream for your specific models
- cloudwatch:PutMetricData, cloudwatch:PutMetricAlarm for monitoring
- sns:Publish if using SNS notifications
- Follow the principle of least privilege – grant only the permissions needed for your use case

Note: This walkthrough uses AWS services that may incur charges, including Amazon CloudWatch for monitoring and Amazon SNS for notifications. See AWS pricing pages for details.

Quick Reference: 503 vs 429 Errors

The following table compares these two error types:

Aspect	503 ServiceUnavailable	429 ThrottlingException
Primary Cause	Transient service unavailability	Exceeded account quotas (RPM/TPM)
Quota Related	Not Quota Related	Directly quota-related
Resolution Time	Transient, refreshes faster	Requires waiting for quota refresh
Retry Strategy	Immediate retry with exponential backoff	Must sync with 60-second quota cycle
User Action	Wait and retry, consider alternatives	Optimize request patterns, increase quotas

Deep dive into 429 ThrottlingException

A 429 ThrottlingException means Amazon Bedrock is deliberately rejecting some of your requests to keep overall usage within the quotas you have configured or that are assigned by default. In practice, you will most often see three flavors of throttling: rate‑based, token‑based, and model‑specific.

1. Rate-Based Throttling (RPM – Requests Per Minute)

Error Message:

ThrottlingException: Too many requests, please wait before trying again.

Or:

botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation: Too many requests, please wait before trying again

What this indicates

Rate‑based throttling is triggered when the total number of Bedrock requests per minute to a given model, and Region exceeds the RPM quota for your account. The key detail is that this limit is enforced across all callers within an account, not just per individual application or microservice.

Imagine a shared queue at a coffee shop: it does not matter who is standing in line; the barista can only serve a fixed number of drinks per minute. When demand exceeds the barista’s capacity, some customers are told to wait or come back later. That “come back later” message is your 429.

Multi-application spike scenario

Suppose you have three production applications, all calling the same Amazon Bedrock model in the same Region:

App A normally peaks around 2,000 requests per minute.
App B also peaks around 2,000 rpm.
App C usually runs at about 2,000 rpm during its own peak.

Ops has requested a quota of 6,000 RPM for this model, which seems reasonable since 2,000 + 2,000 = 6,000 and historical dashboards show that each app stays around its expected peak.

However, in reality your traffic is not perfectly flat. Maybe during a flash sale or a marketing campaign, App A briefly spikes 2,400 rpm while B and C stay at 2,000 . The combined total for that minute becomes 6,400 rpm, which is above your 6,000 rpm quota, and some requests start failing with ThrottlingException.

You can also experience throttling when demand shifts higher on any of the applications (while the others remain constant). Imagine a new pattern where peak traffic looks like this:

App A: 3,000 rpm
App B: 2,000 rpm
App C: 2,000 rpm

Your new true peak is 7,000 rpm even though the original quota was sized for 6,000. In this situation, you will see 429 errors when all three applications are at peak traffic, even if average daily traffic still looks “fine.”

For rate‑based throttling, the mitigation has two components: client behavior and quota management.

On the client side:

Implement request rate limiting to cap how many calls per second or per minute each application can send. APIs, SDK wrappers, or sidecars like API gateways can enforce per‑app budgets so one noisy client does not starve others.
Use exponential backoff with jitter on 429 errors so that retries can become gradually less frequent and are de‑synchronized across instances. AWS recommends using jitter (a random amount of time before making or retrying a request) to help prevent large bursts by spreading out the arrival rate.
Implement retry strategies that account for the quota refresh period: because RPM is enforced per 60-second window, spreading retries throughout the next minute increases success likelihood. AWS recommends distributing requests across multiple seconds within a 1-minute period and ensuring retry backoff lasts 1 full minute when reaching per-minute quotas. Use exponential backoff with jitter to naturally distribute load across time and prevent synchronized retry bursts.On the quota side:
Analyze CloudWatch metrics for each application to determine true peak RPM rather than relying on averages.
Sum those peaks across the apps for the same model/Region, add a safety margin, and request an RPM increase through AWS Service Quotas if needed.

In the previous example, if App A peaks at 3,000 rpm and B and C peak at 2,000 rpm, you should plan for at least 7,000 rpm and realistically target something like 8,000 rpm to provide room for growth and unexpected bursts.

2. Token-Based Throttling (TPM – Tokens Per Minute)

Error message:

botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation: Too many tokens, please wait before trying again.

Why token limits matter

Even if your request count is modest, a single large prompt or a model that produces long outputs can consume thousands of tokens at once. Token‑based throttling occurs when the sum of input tokens, output tokens, and tokens reserved by max_tokens processed per minute exceeds your account’s TPM quota for that model. Critically, if max_tokens is not explicitly set in your request, it defaults to the model’s maximum output capacity (for example, 64,000 tokens for Claude Sonnet). This means a modest request can silently reserve far more quota than intended, and is the single most common cause of unexpected throttling. See section 3 below for the full quota lifecycle equations.

For example, if your application uses Claude Opus 4.6 with a default quota of 2,000,000 tokens per minute (TPM), and each request has 500 input tokens with max_tokens set to 1,000, Bedrock initially reserves 1,500 tokens per request from your quota (500 input + 1,000 max_tokens). That gives you a theoretical capacity of ~1,333 concurrent requests per minute (2M ÷ 1,500). But if you leave max_tokens unset and it defaults to 64,000, the same request reserves 64,500 tokens — dropping your capacity to just ~31 concurrent requests per minute from the same 2M TPM quota.All major Bedrock models have TPM quotas ranging from 200K to 8M TPM depending on the model, with newer models like Claude Sonnet 4.6 offering 5M TPM. These quotas are adjustable through AWS Service Quotas, allowing you to request increases as your application scales.What this looks like in practice

You may notice that your application runs smoothly under normal workloads, but suddenly starts failing when users paste large documents, upload long transcripts, or . These are symptoms that token throughput, not request frequency, is the bottleneck. For bulk jobs, you should be using batch inference, which has separate quotas (up to 10,000 records per batch with a 24-hour processing window) and offers 50% reduced prices compared to on-demand inference.

How to respond

To mitigate token‑based throttling:

Request a higher TPM quota – If you’re consistently hitting limits, request a quota increase through AWS Service Quotas. This is the most direct solution for production workloads that have outgrown their default allocation.
Right-size your max_tokens parameter – This is the highest-impact optimization — see section 3 below for the full explanation and examples.
Monitor token usage by tracking InputTokenCount, OutputTokenCount and max_tokens for your Bedrock invocations. All three contribute to quota consumption. InputTokenCount and OutputTokenCount are available as CloudWatch metrics; max_tokens can be captured through CloudWatch Logs when model invocation logging is enabled.
Implement a token‑aware rate limiter that maintains a sliding 60‑second window of tokens consumed and only issues a new request if there is enough budget left.
Use streaming responses when appropriate; streaming often gives you more control over when to stop generation, so you do not produce unnecessarily long outputs.

3. How max_tokens Influences Token-Based Throttling

What is happening behind the scenes

While monitoring InputTokenCount and OutputTokenCount helps you understand actual token consumption, the max_tokens parameter is equally important because it determines how much quota Bedrock reserves upfront — before a single output token is generated. If max_tokens is not set, it defaults to the model’s maximum output capacity, which is the most common cause of unexpected throttling.

Critically, if max_tokens is not explicitly set in your request, it defaults to the model’s maximum output capacity. For Claude Sonnet 4, this is 64K tokens. This default behavior is the single most common cause of unexpected token-based throttling — your application may appear to use modest tokens, but Amazon Bedrock is silently reserving far more quota than intended at the start of every request.

How to respond

Let us understand the Token Lifecycle a bit further for this. When you make a request to Bedrock, the quota system manages tokens through three stages:

Stage	When	Equation	Purpose
1 — Initial reservation	Request arrives	Total input tokens + `max_tokens`	Determines whether the request is throttled
2 — Dynamic adjustment	During generation	(Bedrock progressively releases unused reserved tokens as actual output is produced)	Frees quota for concurrent requests
3 — Final settlement	Request completes	`InputTokenCount` + `CacheWriteInputTokens` + (`OutputTokenCount` × burndown rate)	Actual quota consumed; unused tokens replenished

CacheReadInputTokens do not count toward quota. The burndown rate is 5× for output tokens on Anthropic Claude Sonnet 4 and later models (1 output token = 5 quota tokens) and 1× for all other models.Why this matters for token-based throttling

If you’re hitting TPM quotas earlier than expected despite modest actual token usage, the max_tokens parameter is almost certainly the cause. Consider this example using a model that does not have the 5× burndown (1:1 rate), with 8,000 input tokens and 1,000 actual output tokens:

Stage	max_tokens = 32,000 (too high)	max_tokens = 1,250 (right-sized)
1 — Initial reservation	8,000 + 32,000 = 40,000 tokens	8,000 + 1,250 = 9,250 tokens
2 — Dynamic adjustment	Bedrock progressively releases unused reserved tokens as output is generated	Same — but far less quota was reserved, so less adjustment needed
3 — Final consumption	8,000 + 1,000 = 9,000 tokens	8,000 + 1,000 = 9,000 tokens
Quota temporarily over-reserved	31,000 tokens	Only 250 tokens

During that initial period, the high max_tokens value temporarily reduces your concurrent request capacity available in your token bucket. The final bill is 9,000 tokens either way — but the unoptimized call holds over 4× more quota while the request is in flight.

Now consider the worst case: if max_tokens is not set at all, it defaults to the model’s maximum output capacity. The same request would reserve far more tokens at Stage 1 — enough to exhaust your TPM quota in just a handful of concurrent requests, even though each request only consumes 9,000 tokens at final settlement.

Key takeaway: Always explicitly set max_tokens to approximate your expected completion size. Use CloudWatch OutputTokenCount metrics to right-size it.

This is where we must trace the max_tokens parameter. While CloudWatch metrics track InputTokenCount and OutputTokenCount, the max_tokens parameter can be captured through CloudWatch Logs when model invocation logging is enabled. You can analyze these patterns to identify gaps between your max_tokens settings and actual output token usage. For automated monitoring, see the TPM & RPM Quota Monitoring Dashboard which calculates both initial reservation and final consumption automatically.

Optimization strategies

To optimize max_tokens for better quota utilization:

Right-size based on use case: Set max_tokens to approximate your expected completion size rather than using arbitrarily high values or leaving it unset
Use CloudWatch metrics: Examine InputTokenCount and OutputTokenCount patterns to guide your max_tokens decisions
Vary by request type: Adjust max_tokens dynamically based on the specific request rather than using a single high value for all scenarios
Account for burndown rates: Remember that models like Claude Opus 4, Sonnet 4.5, Sonnet 4, Claude 3.7 Sonnet, and Haiku 4.5 have a 5x burndown rate for output tokens — meaning the final settlement can be higher than expected if output is large, but the initial reservation from an unset max_tokens is almost always the bigger problem

Implementing robust retry and rate limiting

Once you understand the types of throttling, the next step is to encode that knowledge into reusable client-side components.

Exponential backoff with jitter

When handling throttling, exponential backoff with jitter is essential for graceful recovery. AWS provides two approaches: built-in boto3 retry configuration (recommended) and custom retry logic.

Recommended: Built-in Boto3 Retry Configuration

The simplest and most reliable approach is to use boto3’s built-in retry mechanism with adaptive mode, which automatically handles throttling with exponential backoff and jitter:

import boto3 
import json 
from botocore.config import Config  

# Configure retry behavior at client initialization  
config = Config(  
    retries={  
        "max_attempts": 5,  
        "mode": "adaptive"  # Handles throttling with exponential backoff + jitter  
    }  
)  

bedrock_client = boto3.client("bedrock-runtime", config=config)  

# Now all API calls automatically retry with exponential backoff  

# Correct request format for Claude models 
request_body = { 
    "anthropic_version": "bedrock-2023-05-31", 
    "max_tokens": 1000, 
    "messages": [ 
        { 
            "role": "user", 
            "content": "Hello! Can you explain what you are in one sentence?" 
        } 
    ] 
} 

response = bedrock_client.invoke_model(  
    modelId="global.anthropic.claude-haiku-4-5-20251001-v1:0",  
    body=json.dumps(request_body) 
)  

# Parse and print the response 
response_body = json.loads(response['body'].read()) 
print("Response:") 
print(response_body['content'][0]['text'])

The adaptive retry mode intelligently adjusts retry behavior based on throttling patterns, providing better performance than fixed exponential backoff. This approach requires no additional error handling code and is maintained by AWS as part of the SDK.

Alternative: Custom Retry Implementation

For scenarios requiring custom retry logic (e.g., specific logging, metrics collection, or non-standard retry patterns), you can implement your own retry mechanism:

import time 
import random 
from botocore.exceptions import ClientError 

def bedrock_request_with_retry(bedrock_client, operation, **kwargs): 
"""Secure retry implementation with sanitized logging.""" 
    max_retries = 5 
    base_delay = 1 
    max_delay = 60 

    for attempt in range(max_retries): 
        try: 
            if operation == 'invoke_model': 
                return bedrock_client.invoke_model(**kwargs) 
            elif operation == 'converse': 
                return bedrock_client.converse(**kwargs) 
        except ClientError as e: 

# Security: Log error codes but not request/response bodies 
            # which may contain sensitive customer data 
            if e.response['Error']['Code'] == 'ThrottlingException': 
                if attempt == max_retries - 1: 
                    raise 
      
                # Exponential backoff with jitter 
                delay = min(base_delay * (2 ** attempt), max_delay) 
                jitter = random.uniform(0, delay * 0.1) 
                time.sleep(delay + jitter) 
                continue 
            else: 
                raise

This pattern prevents overwhelming the service immediately after a throttling event and helps distribute retry attempts across time, reducing the likelihood of synchronized retries from multiple instances. However, for most use cases, the built-in boto3 retry configuration is preferred as it’s simpler, well-tested, and automatically maintained by AWS.

Token-Aware Rate Limiting

For token‑based throttling, the following class maintains a sliding window of token usage and gives your caller a simple yes/no answer on whether it is safe to issue another request:

import time
from collections import deque

class TokenAwareRateLimiter:
    def __init__(self, tpm_limit):
        self.tpm_limit = tpm_limit
        self.token_usage = deque()
    
    def can_make_request(self, estimated_tokens):
        now = time.time()
        # Remove tokens older than 1 minute
        while self.token_usage and self.token_usage[0][0] < now - 60:
            self.token_usage.popleft()
        
        current_usage = sum(tokens for _, tokens in self.token_usage)
        return current_usage + estimated_tokens <= self.tpm_limit
    
    def record_usage(self, tokens_used):
        self.token_usage.append((time.time(), tokens_used))

In practice, you would estimate tokens before sending the request, call can_make_request, and only proceed when it returns True, then call record_usage after receiving the response.

Important: Multi-application Quota Sharing

The above implementation works for a single application, but Amazon Bedrock quotas are account-level and region-specific, meaning all applications within the same AWS account and region share the same quota pool. If you have multiple applications, this local rate limiter won’t prevent quota exhaustion because each application only tracks its own usage.

Recommended Practices for Multi-application environments:

The best approach depends on your organizational structure and isolation requirements:

Separate AWS Accounts (Recommended for most organizations): Deploy each application or team in its own AWS account to get independent quota allocations, eliminating quota contention entirely. This aligns with AWS best practices for account isolation, provides clear cost attribution, and simplifies security boundaries. This is particularly important for production workloads where one application’s usage shouldn’t impact another’s availability.

Alternative approaches for specific scenarios:

Application Inference Profiles (AIPs): Best for organizations that need to share quotas but want granular cost tracking and monitoring per application. Use AIPs combined with CloudWatch alarms to monitor usage and trigger automated responses when thresholds are exceeded.

Centralized Rate Limiting: Suitable for development/testing environments or when you need fine-grained control over quota distribution. Implement a shared rate limiting service (using DynamoDB, Redis, or API Gateway) that all applications query before making Bedrock requests to ensure account-wide quota awareness.

Reserved Capacity (Provisioned Throughput): For predictable, high-volume workloads, reserve dedicated capacity for critical applications to ensure they aren’t affected by other applications’ usage, regardless of account structure.

Understanding 503 ServiceUnavailableException

A 503 ServiceUnavailableException tells you that Amazon Bedrock is temporarily unable to process your request. Unlike 429, this is not about your quota; it is about the temporary conditions with the model.

Temporary Service Resource Issues

What it looks like:

botocore.errorfactory.ServiceUnavailableException: An error occurred (ServiceUnavailableException) when calling the InvokeModel operation: Service temporarily unavailable, please try again.

In this case, the Bedrock service is signaling a transient issue or infrastructure issue, often affecting on-demand models during demand spikes. Here you should treat the error as a temporary outage and focus on retrying smartly and failing over gracefully:

Use exponential backoff retries, similar to your 429 handling, but with parameters tuned for slower recovery.
Consider using cross-Region inference or different service tiers to help get more predictable capacity envelopes for your most critical workloads.

Circuit Breaker Pattern

Advanced resilience for mission-critical systems: When you operate mission-critical systems, simple retries are not enough—you also want to avoid making a bad situation worse. The circuit breaker pattern is a standard distributed systems practice for any application that depends on external services. It helps your application respond gracefully during transient conditions by temporarily pausing requests rather than repeatedly attempting calls that are unlikely to succeed. This pattern is recommended for all integrations—whether calling databases, third-party APIs, or AI services—to maintain overall application stability. For detailed guidance, see the AWS Prescriptive Guidance on Circuit Breaker Pattern and the AWS blog post on Using the Circuit Breaker Pattern with AWS Step Functions and Amazon DynamoDB.

The circuit breaker prevents your application from continuously making failing requests. After detecting repeated failures, it automatically transitions to an “open” state, blocking new requests during a cooling-off period during the recovery period.

CLOSED (Normal): Requests flow normally.
OPEN (Failing): After repeated failures, new requests are rejected immediately, helping reduce pressure on the service and conserve client resources.
HALF_OPEN (Testing): After a timeout, a small number of trial requests are allowed; if they succeed, the circuit closes again.

Why This Matters for Bedrock

When any service experiences high demand, implementing circuit breakers helps maintain overall system stability and allows faster recovery

Reduce pressure on the service, helping it recover faster
Fail fast instead of wasting time on requests that will likely fail
Provide automatic recovery by periodically testing if the service is healthy again
Improve user experience by returning errors quickly rather than timing out

Implementation Recommendation:

To help keep the maintenance overhead to the minimum, you can use established libraries rather than custom implementations. Well-maintained options include:

pybreaker – Mature circuit breaker implementation with support for multiple failure detection strategies
tenacity – Flexible retry library with circuit breaker capabilities and extensive configuration options

These libraries provide battle-tested implementations with proper state management, thread safety, and monitoring hooks. Custom implementations should only be considered when you have specific requirements that existing libraries cannot satisfy, such as integration with proprietary monitoring systems or unique failure detection logic that goes beyond standard error rate thresholds.

Note: The AWS SDK’s adaptive retry mode (discussed separately in this document) provides built-in token bucket rate limiting and automatic backoff, which addresses many throttling scenarios. Circuit breakers complement this by adding explicit state management and fail-fast behavior across your application layer.

Cross-Region Failover Strategy with CRIS

Amazon Bedrock Cross‑Region iInference (CRIS) add another layer of resilience by giving you a managed way to route traffic across Regions.

Global CRIS Profiles: Route traffic to any AWS commercial Regions worldwide, offering the highest available throughput and approximately 10% cost savings compared to Geographic CRIS. Global CRIS represents the baseline pricing model for cross-region inference.
Geographic CRIS Profiles: Confine traffic to specific geographies (for example, US‑only, EU‑only, APAC‑only, JP-only, and AU-only) to satisfy strict data residency or regulatory requirements. Geographic profiles incur standard pricing without the cost optimization benefits of Global CRIS, as they require additional infrastructure constraints to maintain data within geographic boundaries.

For applications without data residency requirements, global CRIS offers enhanced performance, reliability, and cost efficiency.

From an architecture standpoint:

For non‑regulated workloads, using a global profile can significantly improve availability and absorb regional spikes and reduce costs.
For regulated workloads, configure geographic profiles that align with your compliance boundaries, and document those decisions in your governance artifacts.

Amazon Bedrock encrypts data in transit using TLS and does not store customer prompts or outputs by default. All data transmitted during cross-Region operations remains on the AWS network and does not traverse the public internet. Combine this with CloudTrail logging for compliance posture.

Monitoring and Observability for 429 and 503 Errors

You cannot manage what you cannot see, so robust monitoring is essential when working with quota‑driven errors and service availability. Setting up comprehensive Amazon CloudWatch monitoring is essential for proactive error management and maintaining application reliability.

Note: CloudWatch custom metrics, alarms, and dashboards incur charges based on usage. Review CloudWatch pricing for details.

Essential CloudWatch Metrics

Monitor these CloudWatch metrics:

Invocations: Successful model invocations
InvocationClientErrors: 4xx errors including throttling
InvocationServerErrors: 5xx errors including service unavailability
InvocationThrottles: 429 throttling errors
InvocationLatency: Response times
InputTokenCount/OutputTokenCount: Token usage for TPM monitoring

For better insight, create dashboards that:

Separate 429 and 503 into different widgets so you can see whether a spike is quota‑related or service‑side.
Break down metrics by ModelId and Region to identify which models or Regions experiencing elevated traffic.
Show side‑by‑side comparisons of current traffic vs previous weeks to spot emerging trends before they become incidents.

Critical Alarms

Do not wait until users notice failures before you act. Configure CloudWatch alarms with Amazon SNS notifications based on thresholds such as:

For 429 Errors:

A high number of throttling events in a 5-minute window.
Consecutive periods with non-zero throttle counts, indicating sustained pressure.
Quota utilization above a chosen threshold (for example, 80% of RPM/TPM).

To monitor quota utilization effectively, you’ll need to track both your actual usage and your Service Quota limits. For Bedrock, this requires publishing custom CloudWatch metrics that capture the max_tokens parameter alongside your input and output token counts. Bedrock’s token quota system reserves capacity at request to start based on max_tokens, then applies model-specific burndown rates (1x for most models, 5x for Claude 4+ series) to calculate final consumption. By publishing these metrics to CloudWatch, you can create alarms that trigger when your calculated quota consumption approaches 80% of your Service Quota limits.

Once you have these custom metrics in place, set up CloudWatch alarms using metric math expressions to calculate your utilization percentage: current_usage/SERVICE_QUOTA()*100. Configure the alarm to enter ALARM state when this percentage exceeds your threshold (such as 80%) and attach an Amazon SNS topic to receive notifications via email, SMS, or other channels. For detailed implementation guidance, see the Visualizing service quotas and setting alarms documentation and the TPM & RPM Quota Monitoring Dashboard for Amazon Bedrock sample implementation.

For 503 Errors:

Service success rate falling below your SLO (for example, 95% over 10 minutes).
Note: A Service Level Objective (SLO) is an internal performance target that defines the reliability you aim to achieve for your application. While AWS provides a Service Level Agreement (SLA) guaranteeing 99.9% monthly uptime for Amazon Bedrock, your application’s SLO should be more stringent to provide a buffer before reaching SLA thresholds. For instance, you might set an SLO of 95% success rate measured over 10-minute windows, meaning no more than 5% of requests should fail with errors during that period. To monitor this, track the ratio of successful requests to total requests using CloudWatch metrics. Calculate your success rate as (TotalRequests - Errors) / TotalRequests * 100, where Errors include 500-series responses like 503 ServiceUnavailableException. Set CloudWatch alarms to trigger when this success rate drops below your SLO threshold. Since 503 errors in Bedrock typically indicate transient issues or temporary resource strain, breaching your SLO provides early warning to implement mitigation strategies such as exponential backoff, cross-region inference routing, or switching to a service tier before customer impact becomes severe.
Sudden spikes in 503 counts correlated with specific Regions or models.
Service availability (for example, <95% success rate)
Signs of connection pool saturation on client metrics.

Alarm Configuration Best Practices

Use Amazon Simple Notification Service (Amazon SNS) topics to route alerts to your team’s communication channels (Slack, PagerDuty, email)
Set up different severity levels: Critical (immediate action), Warning (investigate soon), Info (trending issues)
Configure alarm actions to trigger automated responses where appropriate
Include detailed alarm descriptions with troubleshooting steps and runbook links
Test your alarms regularly to make sure notifications are working correctly
Do not include sensitive customer data in alarm messages

Log Analysis Queries

CloudWatch Logs Insights queries help you move from “we see errors” to “we understand patterns.” Examples include:

Find 429 error patterns:

fields @timestamp, @message
| filter @message like /ThrottlingException/
| stats count() by bin(5m)
| sort @timestamp desc

Analyze 503 error correlation with request volume:

fields @timestamp, @message
| filter @message like /ServiceUnavailableException/
| stats count() as error_count by bin(1m)
| sort @timestamp desc

For concurrent workloads configure the appropriate connection pool in the client settings, which will ensure the optimal user experience.

config = Config(
max_pool_connections=50, # Adjust based on concurrency requirements
retries={'mode': 'adaptive', 'max_attempts': 5}
)

Default connection pool size (10) may be insufficient for high-concurrency applications. Monitoring the connection pool metrics and adjust based on workload characteristics becomes necessary for the production grade application.

Wrapping Up: Building Resilient Applications

We’ve covered a lot of ground in this post, so let’s bring it all together. Successfully handling Bedrock errors requires:

Understand root causes: Distinguish quota limits (429) from transient issues (503)
Implement appropriate retries: Use exponential backoff with different parameters for each error type
Design for scale: Use connection pooling, circuit breakers, and Cross-Region failover
Monitor proactively: Set up comprehensive CloudWatch monitoring and alerting
Plan for growth: Request quota increases and implement fallback strategies

Conclusion

Handling 429 ThrottlingException and 503 ServiceUnavailableException errors effectively is a crucial part of running production‑grade generative AI workloads on Amazon Bedrock. By combining quota‑aware design, intelligent retries, client‑side resilience patterns, cross‑Region strategies, and strong observability, you can keep your applications responsive even under unpredictable load.

As a next step, identify your most critical Bedrock workloads, enable the retry and rate‑limiting patterns described here, and build dashboards and alarms that expose your real peaks rather than just averages. Over time, use real traffic data to refine quotas, fallback models, and regional deployments so your AI systems can remain both responsive and dependable as they scale. For teams looking to accelerate incident resolution, consider enabling AWS DevOps Agent—an AI-powered agent that can investigate Bedrock errors by correlating CloudWatch metrics, logs, and alarms just like an experienced DevOps engineer would. It learns your resource relationships, works with your observability tools and runbooks, and can significantly reduce mean time to resolution (MTTR) for 429 and 503 errors by automatically identifying root causes and suggesting remediation steps.

Learn More

About the Authors

Artificial Intelligence

Optimize your applications for scale and reliability on Amazon Bedrock

Prerequisites

Quick Reference: 503 vs 429 Errors

Deep dive into 429 ThrottlingException

1. Rate-Based Throttling (RPM – Requests Per Minute)

2. Token-Based Throttling (TPM – Tokens Per Minute)

Why token limits matter

How to respond

3. How max_tokens Influences Token-Based Throttling

What is happening behind the scenes

How to respond

Implementing robust retry and rate limiting

Exponential backoff with jitter

Token-Aware Rate Limiting

Important: Multi-application Quota Sharing

Understanding 503 ServiceUnavailableException

Temporary Service Resource Issues

Circuit Breaker Pattern

Why This Matters for Bedrock

Cross-Region Failover Strategy with CRIS

Monitoring and Observability for 429 and 503 Errors

Essential CloudWatch Metrics

Critical Alarms

Alarm Configuration Best Practices

Log Analysis Queries

Wrapping Up: Building Resilient Applications

Conclusion

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help