Artificial Intelligence
Optimize your applications for scale and reliability on Amazon Bedrock
As generative AI applications scale to serve more users and handle increasingly complex workloads, understanding how to optimize your application’s availability by proper error handling, becomes essential. Two error types—429 ThrottlingException and 503 ServiceUnavailableException—are important signals that your application is reaching operational thresholds that require attention. While these errors are typically retriable, how you handle them directly impacts user experience. Delays in responding can disrupt a conversation’s natural flow and reduce user engagement. The difference between a reliable, production-ready application and one that struggles under load often comes down to implementing the right error handling strategies and quota management practices from the start.
This post provides practical strategies for building reliable applications on Amazon Bedrock. We’ll explore proven patterns for error handling, quota optimization, and architectural resilience that help your applications scale reliably. Whether you’re launching your first AI feature or optimizing a mature production system, you’ll find actionable guidance for operating confidently at any scale. Amazon Bedrock is designed to scale with your applications, offering access to industry-leading foundation models with built-in capabilities like cross-region inference and configurable throughput. The patterns in this post help you take full advantage of these capabilities while building applications that remain responsive as demand grows.
Prerequisites
- AWS account with Amazon Bedrock access
- Python 3.x and boto3 installed
- Basic understanding of AWS services
- IAM Permissions: Make sure you have the following minimum permissions:
bedrock:InvokeModelorbedrock:InvokeModelWithResponseStreamfor your specific modelscloudwatch:PutMetricData,cloudwatch:PutMetricAlarmfor monitoringsns:Publishif using SNS notifications- Follow the principle of least privilege – grant only the permissions needed for your use case
Note: This walkthrough uses AWS services that may incur charges, including Amazon CloudWatch for monitoring and Amazon SNS for notifications. See AWS pricing pages for details.
Quick Reference: 503 vs 429 Errors
The following table compares these two error types:
| Aspect | 503 ServiceUnavailable | 429 ThrottlingException |
|---|---|---|
| Primary Cause | Transient service unavailability | Exceeded account quotas (RPM/TPM) |
| Quota Related | Not Quota Related | Directly quota-related |
| Resolution Time | Transient, refreshes faster | Requires waiting for quota refresh |
| Retry Strategy | Immediate retry with exponential backoff | Must sync with 60-second quota cycle |
| User Action | Wait and retry, consider alternatives | Optimize request patterns, increase quotas |
Deep dive into 429 ThrottlingException
A 429 ThrottlingException means Amazon Bedrock is deliberately rejecting some of your requests to keep overall usage within the quotas you have configured or that are assigned by default. In practice, you will most often see three flavors of throttling: rate‑based, token‑based, and model‑specific.
1. Rate-Based Throttling (RPM – Requests Per Minute)
Error Message:
ThrottlingException: Too many requests, please wait before trying again.
Or:
botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation: Too many requests, please wait before trying again
What this indicates
Rate‑based throttling is triggered when the total number of Bedrock requests per minute to a given model, and Region exceeds the RPM quota for your account. The key detail is that this limit is enforced across all callers within an account, not just per individual application or microservice.
Imagine a shared queue at a coffee shop: it does not matter who is standing in line; the barista can only serve a fixed number of drinks per minute. When demand exceeds the barista’s capacity, some customers are told to wait or come back later. That “come back later” message is your 429.
Multi-application spike scenario
Suppose you have three production applications, all calling the same Amazon Bedrock model in the same Region:
- App A normally peaks around 2,000 requests per minute.
- App B also peaks around 2,000 rpm.
- App C usually runs at about 2,000 rpm during its own peak.
Ops has requested a quota of 6,000 RPM for this model, which seems reasonable since 2,000 + 2,000 = 6,000 and historical dashboards show that each app stays around its expected peak.
However, in reality your traffic is not perfectly flat. Maybe during a flash sale or a marketing campaign, App A briefly spikes 2,400 rpm while B and C stay at 2,000 . The combined total for that minute becomes 6,400 rpm, which is above your 6,000 rpm quota, and some requests start failing with ThrottlingException.
You can also experience throttling when demand shifts higher on any of the applications (while the others remain constant). Imagine a new pattern where peak traffic looks like this:
- App A: 3,000 rpm
- App B: 2,000 rpm
- App C: 2,000 rpm
Your new true peak is 7,000 rpm even though the original quota was sized for 6,000. In this situation, you will see 429 errors when all three applications are at peak traffic, even if average daily traffic still looks “fine.”
For rate‑based throttling, the mitigation has two components: client behavior and quota management.
On the client side:
- Implement request rate limiting to cap how many calls per second or per minute each application can send. APIs, SDK wrappers, or sidecars like API gateways can enforce per‑app budgets so one noisy client does not starve others.
- Use exponential backoff with jitter on 429 errors so that retries can become gradually less frequent and are de‑synchronized across instances. AWS recommends using jitter (a random amount of time before making or retrying a request) to help prevent large bursts by spreading out the arrival rate.
- Implement retry strategies that account for the quota refresh period: because RPM is enforced per 60-second window, spreading retries throughout the next minute increases success likelihood. AWS recommends distributing requests across multiple seconds within a 1-minute period and ensuring retry backoff lasts 1 full minute when reaching per-minute quotas. Use exponential backoff with jitter to naturally distribute load across time and prevent synchronized retry bursts.On the quota side:
- Analyze CloudWatch metrics for each application to determine true peak RPM rather than relying on averages.
- Sum those peaks across the apps for the same model/Region, add a safety margin, and request an RPM increase through AWS Service Quotas if needed.
In the previous example, if App A peaks at 3,000 rpm and B and C peak at 2,000 rpm, you should plan for at least 7,000 rpm and realistically target something like 8,000 rpm to provide room for growth and unexpected bursts.
2. Token-Based Throttling (TPM – Tokens Per Minute)
Error message:
botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation: Too many tokens, please wait before trying again.
Why token limits matter
Even if your request count is modest, a single large prompt or a model that produces long outputs can consume thousands of tokens at once. Token‑based throttling occurs when the sum of input and output tokens processed per minute exceeds your account’s TPM quota for that model.
For example, if your application uses Claude Opus 4.6 with a default quota of 2,000,000 tokens per minute (TPM), and you’re generating responses that average 1,000 output tokens, you could theoretically handle 2,000 requests per minute. However, you also need to account for input tokens in your rate calculations. If each request includes 500 input tokens, your effective capacity becomes approximately 1,333 requests per minute (2M TPM ÷ 1,500 tokens per request). All major Bedrock models have TPM quotas ranging from 200K to 8M TPM depending on the model, with newer models like Claude Sonnet 4.6 offering 5M TPM. These quotas are adjustable through AWS Service Quotas, allowing you to request increases as your application scales.
What this looks like in practice
You may notice that your application runs smoothly under normal workloads, but suddenly starts failing when users paste large documents, upload long transcripts, or run bulk summarization jobs. These are symptoms that token throughput, not request frequency, is the bottleneck. For bulk jobs, you should be using batch inference, which has separate quotas (up to 10,000 records per batch with a 24-hour processing window) and offers 50% reduced prices compared to on-demand inference.
How to respond
To mitigate token‑based throttling:
- Monitor token usage by tracking
InputTokenCountandOutputTokenCountmetrics and logs for your Bedrock invocations. - Implement a token‑aware rate limiter that maintains a sliding 60‑second window of tokens consumed and only issues a new request if there is enough budget left.
- Break large tasks into smaller, sequential chunks, so you spread token consumption over multiple minutes instead of exhausting the entire budget in one spike.
- Use streaming responses when appropriate; streaming often gives you more control over when to stop generation, so you do not produce unnecessarily long outputs.
For consistently high‑volume, token‑intensive workloads, you should also evaluate requesting higher TPM quotas or using models with larger context windows and better throughput characteristics.
3. How max_tokens Influences Token-Based Throttling
What is happening behind the scenes
While monitoring InputTokenCount and OutputTokenCount helps you understand actual token consumption, it’s also important to note that the max_tokens parameter plays a role in how tokens are managed during request processing.
How to respond
Let us understand the Token Lifecycle a bit further for this. When you make a request to Bedrock, the quota system manages tokens through three stages:
- At request start: Total input tokens +max_tokensare deducted from your TPM quota
- During processing: The quota consumed is dynamically adjusted based on actual output tokens generated
- At request end: Final quota consumption is calculated as
InputTokenCount+ CacheWriteInputTokens + (OutputTokenCount x burndown rate), and any unused tokens are replenished.
Why this matters for token-based throttling
If you’re hitting TPM quotas earlier than expected despite modest actual token usage, the max_tokens parameter may be causing this. Consider this example using a model with a 5x burndown rate:
- Setting
max_tokens=32,000 with 8,000 input tokens initially deducts 40,000 tokens from your quota
If the model only generates 1,000 output tokens, the final adjusted deduction is 9,000 tokens. During that initial period, the high max_tokens value temporarily reduces your concurrent request capacity available in your token bucket. Compare this to an optimized scenario:
- Setting
max_tokens=1,250 for the same request initially deducts 9,250 tokens
The final adjusted deduction is still 9,000 tokens. Your concurrent request capacity remains higher throughout the request lifecycle. This is where we must trace the max_tokens parameter. While CloudWatch metrics track InputTokenCount and OutputTokenCount, the max_tokens parameter can be captured through CloudWatch Logs when model invocation logging is enabled. You can analyze these patterns to identify gaps between your max_tokens settings and actual output token usage.
Optimization strategies
To optimize max_tokens for better quota utilization:
- Right-size based on use case: Set
max_tokensto approximate your expected completion size rather than using arbitrarily high values - Use CloudWatch metrics: Examine
InputTokenCountandOutputTokenCountpatterns to guide your max_tokens decisions - Vary by request type: Adjust
max_tokensdynamically based on the specific request rather than using a single high value for all scenarios - Account for burndown rates: Remember that models like Claude Opus 4, Sonnet 4.5, Sonnet 4, Claude 3.7 Sonnet, and Haiku 4.5 have a 5x burndown rate for output tokens
Implementing robust retry and rate limiting
Once you understand the types of throttling, the next step is to encode that knowledge into reusable client-side components.
Exponential backoff with jitter
When handling throttling, exponential backoff with jitter is essential for graceful recovery. AWS provides two approaches: built-in boto3 retry configuration (recommended) and custom retry logic.
Recommended: Built-in Boto3 Retry Configuration
The simplest and most reliable approach is to use boto3’s built-in retry mechanism with adaptive mode, which automatically handles throttling with exponential backoff and jitter:
The adaptive retry mode intelligently adjusts retry behavior based on throttling patterns, providing better performance than fixed exponential backoff. This approach requires no additional error handling code and is maintained by AWS as part of the SDK.
Alternative: Custom Retry Implementation
For scenarios requiring custom retry logic (e.g., specific logging, metrics collection, or non-standard retry patterns), you can implement your own retry mechanism:
This pattern prevents overwhelming the service immediately after a throttling event and helps distribute retry attempts across time, reducing the likelihood of synchronized retries from multiple instances. However, for most use cases, the built-in boto3 retry configuration is preferred as it’s simpler, well-tested, and automatically maintained by AWS.
Token-Aware Rate Limiting
For token‑based throttling, the following class maintains a sliding window of token usage and gives your caller a simple yes/no answer on whether it is safe to issue another request:
import time
from collections import deque
class TokenAwareRateLimiter:
def __init__(self, tpm_limit):
self.tpm_limit = tpm_limit
self.token_usage = deque()
def can_make_request(self, estimated_tokens):
now = time.time()
# Remove tokens older than 1 minute
while self.token_usage and self.token_usage[0][0] < now - 60:
self.token_usage.popleft()
current_usage = sum(tokens for _, tokens in self.token_usage)
return current_usage + estimated_tokens <= self.tpm_limit
def record_usage(self, tokens_used):
self.token_usage.append((time.time(), tokens_used))
In practice, you would estimate tokens before sending the request, call can_make_request, and only proceed when it returns True, then call record_usage after receiving the response.
Important: Multi-application Quota Sharing
The above implementation works for a single application, but Amazon Bedrock quotas are account-level and region-specific, meaning all applications within the same AWS account and region share the same quota pool. If you have multiple applications, this local rate limiter won’t prevent quota exhaustion because each application only tracks its own usage.
Recommended Practices for Multi-application environments:
The best approach depends on your organizational structure and isolation requirements:
Separate AWS Accounts (Recommended for most organizations): Deploy each application or team in its own AWS account to get independent quota allocations, eliminating quota contention entirely. This aligns with AWS best practices for account isolation, provides clear cost attribution, and simplifies security boundaries. This is particularly important for production workloads where one application’s usage shouldn’t impact another’s availability.
Alternative approaches for specific scenarios:
Application Inference Profiles (AIPs): Best for organizations that need to share quotas but want granular cost tracking and monitoring per application. Use AIPs combined with CloudWatch alarms to monitor usage and trigger automated responses when thresholds are exceeded.
Centralized Rate Limiting: Suitable for development/testing environments or when you need fine-grained control over quota distribution. Implement a shared rate limiting service (using DynamoDB, Redis, or API Gateway) that all applications query before making Bedrock requests to ensure account-wide quota awareness.
Reserved Capacity (Provisioned Throughput): For predictable, high-volume workloads, reserve dedicated capacity for critical applications to ensure they aren’t affected by other applications’ usage, regardless of account structure.
Understanding 503 ServiceUnavailableException
A 503 ServiceUnavailableException tells you that Amazon Bedrock is temporarily unable to process your request. Unlike 429, this is not about your quota; it is about the temporary conditions with the model.
Temporary Service Resource Issues
What it looks like:
botocore.errorfactory.ServiceUnavailableException: An error occurred (ServiceUnavailableException) when calling the InvokeModel operation: Service temporarily unavailable, please try again.
In this case, the Bedrock service is signaling a transient issue or infrastructure issue, often affecting on-demand models during demand spikes. Here you should treat the error as a temporary outage and focus on retrying smartly and failing over gracefully:
- Use exponential backoff retries, similar to your 429 handling, but with parameters tuned for slower recovery.
- Consider using cross-Region inference or different service tiers to help get more predictable capacity envelopes for your most critical workloads.
Circuit Breaker Pattern
Advanced resilience for mission-critical systems: When you operate mission-critical systems, simple retries are not enough—you also want to avoid making a bad situation worse. The circuit breaker pattern is a standard distributed systems practice for any application that depends on external services. It helps your application respond gracefully during transient conditions by temporarily pausing requests rather than repeatedly attempting calls that are unlikely to succeed. This pattern is recommended for all integrations—whether calling databases, third-party APIs, or AI services—to maintain overall application stability. For detailed guidance, see the AWS Prescriptive Guidance on Circuit Breaker Pattern and the AWS blog post on Using the Circuit Breaker Pattern with AWS Step Functions and Amazon DynamoDB.
The circuit breaker prevents your application from continuously making failing requests. After detecting repeated failures, it automatically transitions to an “open” state, blocking new requests during a cooling-off period during the recovery period.
- CLOSED (Normal): Requests flow normally.
- OPEN (Failing): After repeated failures, new requests are rejected immediately, helping reduce pressure on the service and conserve client resources.
- HALF_OPEN (Testing): After a timeout, a small number of trial requests are allowed; if they succeed, the circuit closes again.
Why This Matters for Bedrock
When any service experiences high demand, implementing circuit breakers helps maintain overall system stability and allows faster recovery
- Reduce pressure on the service, helping it recover faster
- Fail fast instead of wasting time on requests that will likely fail
- Provide automatic recovery by periodically testing if the service is healthy again
- Improve user experience by returning errors quickly rather than timing out
Implementation Recommendation:
To help keep the maintenance overhead to the minimum, you can use established libraries rather than custom implementations. Well-maintained options include:
- pybreaker – Mature circuit breaker implementation with support for multiple failure detection strategies
- tenacity – Flexible retry library with circuit breaker capabilities and extensive configuration options
These libraries provide battle-tested implementations with proper state management, thread safety, and monitoring hooks. Custom implementations should only be considered when you have specific requirements that existing libraries cannot satisfy, such as integration with proprietary monitoring systems or unique failure detection logic that goes beyond standard error rate thresholds.
Note: The AWS SDK’s adaptive retry mode (discussed separately in this document) provides built-in token bucket rate limiting and automatic backoff, which addresses many throttling scenarios. Circuit breakers complement this by adding explicit state management and fail-fast behavior across your application layer.
Cross-Region Failover Strategy with CRIS
Amazon Bedrock Cross‑Region iInference (CRIS) add another layer of resilience by giving you a managed way to route traffic across Regions.
- Global CRIS Profiles: Route traffic to any AWS commercial Regions worldwide, offering the highest available throughput and approximately 10% cost savings compared to Geographic CRIS. Global CRIS represents the baseline pricing model for cross-region inference.
- Geographic CRIS Profiles: Confine traffic to specific geographies (for example, US‑only, EU‑only, APAC‑only, JP-only, and AU-only) to satisfy strict data residency or regulatory requirements. Geographic profiles incur standard pricing without the cost optimization benefits of Global CRIS, as they require additional infrastructure constraints to maintain data within geographic boundaries.
For applications without data residency requirements, global CRIS offers enhanced performance, reliability, and cost efficiency.
From an architecture standpoint:
- For non‑regulated workloads, using a global profile can significantly improve availability and absorb regional spikes and reduce costs.
- For regulated workloads, configure geographic profiles that align with your compliance boundaries, and document those decisions in your governance artifacts.
Amazon Bedrock encrypts data in transit using TLS and does not store customer prompts or outputs by default. All data transmitted during cross-Region operations remains on the AWS network and does not traverse the public internet. Combine this with CloudTrail logging for compliance posture.
Monitoring and Observability for 429 and 503 Errors
You cannot manage what you cannot see, so robust monitoring is essential when working with quota-driven errors and service availability. Setting up comprehensive Amazon CloudWatch monitoring is essential for proactive error management and maintaining application reliability.
Note: CloudWatch custom metrics, alarms, and dashboards incur charges based on usage. Review CloudWatch pricing for details.
Essential CloudWatch Metrics
Monitor these CloudWatch metrics:
- Invocations: Successful model invocations
- InvocationClientErrors: 4xx errors including throttling
- InvocationServerErrors: 5xx errors including service unavailability
- InvocationThrottles: 429 throttling errors
- InvocationLatency: Response times
- InputTokenCount/OutputTokenCount: Token usage for TPM monitoring
For better insight, create dashboards that:
- Separate 429 and 503 into different widgets so you can see whether a spike is quota‑related or service‑side.
- Break down metrics by ModelId and Region to identify which models or Regions experiencing elevated traffic.
- Show side‑by‑side comparisons of current traffic vs previous weeks to spot emerging trends before they become incidents.
Critical Alarms
Do not wait until users notice failures before you act. Configure CloudWatch alarms with Amazon SNS notifications based on thresholds such as:
For 429 Errors:
- A high number of throttling events in a 5-minute window.
- Consecutive periods with non-zero throttle counts, indicating sustained pressure.
- Quota utilization above a chosen threshold (for example, 80% of RPM/TPM).
To monitor quota utilization effectively, you’ll need to track both your actual usage and your Service Quota limits. For Bedrock, this requires publishing custom CloudWatch metrics that capture the max_tokens parameter alongside your input and output token counts. Bedrock’s token quota system reserves capacity at request to start based on max_tokens, then applies model-specific burndown rates (1x for most models, 5x for Claude 4+ series) to calculate final consumption. By publishing these metrics to CloudWatch, you can create alarms that trigger when your calculated quota consumption approaches 80% of your Service Quota limits.
Once you have these custom metrics in place, set up CloudWatch alarms using metric math expressions to calculate your utilization percentage: current_usage/SERVICE_QUOTA()*100. Configure the alarm to enter ALARM state when this percentage exceeds your threshold (such as 80%) and attach an Amazon SNS topic to receive notifications via email, SMS, or other channels. For detailed implementation guidance, see the Visualizing service quotas and setting alarms documentation and the TPM & RPM Quota Monitoring Dashboard for Amazon Bedrock sample implementation.
For 503 Errors:
- Service success rate falling below your SLO (for example, 95% over 10 minutes).
Note: A Service Level Objective (SLO) is an internal performance target that defines the reliability you aim to achieve for your application. While AWS provides a Service Level Agreement (SLA) guaranteeing 99.9% monthly uptime for Amazon Bedrock, your application’s SLO should be more stringent to provide a buffer before reaching SLA thresholds. For instance, you might set an SLO of 95% success rate measured over 10-minute windows, meaning no more than 5% of requests should fail with errors during that period. To monitor this, track the ratio of successful requests to total requests using CloudWatch metrics. Calculate your success rate as(TotalRequests - Errors) / TotalRequests * 100, where Errors include 500-series responses like 503ServiceUnavailableException. Set CloudWatch alarms to trigger when this success rate drops below your SLO threshold. Since 503 errors in Bedrock typically indicate transient issues or temporary resource strain, breaching your SLO provides early warning to implement mitigation strategies such as exponential backoff, cross-region inference routing, or switching to a service tier before customer impact becomes severe. - Sudden spikes in 503 counts correlated with specific Regions or models.
- Service availability (for example, <95% success rate)
- Signs of connection pool saturation on client metrics.
Alarm Configuration Best Practices
- Use Amazon Simple Notification Service (Amazon SNS) topics to route alerts to your team’s communication channels (Slack, PagerDuty, email)
- Set up different severity levels: Critical (immediate action), Warning (investigate soon), Info (trending issues)
- Configure alarm actions to trigger automated responses where appropriate
- Include detailed alarm descriptions with troubleshooting steps and runbook links
- Test your alarms regularly to make sure notifications are working correctly
- Do not include sensitive customer data in alarm messages
Log Analysis Queries
CloudWatch Logs Insights queries help you move from “we see errors” to “we understand patterns.” Examples include:
Find 429 error patterns:
fields @timestamp, @message
| filter @message like /ThrottlingException/
| stats count() by bin(5m)
| sort @timestamp desc
Analyze 503 error correlation with request volume:
fields @timestamp, @message
| filter @message like /ServiceUnavailableException/
| stats count() as error_count by bin(1m)
| sort @timestamp desc
For concurrent workloads configure the appropriate connection pool in the client settings, which will ensure the optimal user experience.
Default connection pool size (10) may be insufficient for high-concurrency applications. Monitoring the connection pool metrics and adjust based on workload characteristics becomes necessary for the production grade application.
Wrapping Up: Building Resilient Applications
We’ve covered a lot of ground in this post, so let’s bring it all together. Successfully handling Bedrock errors requires:
- Understand root causes: Distinguish quota limits (429) from transient issues (503)
- Implement appropriate retries: Use exponential backoff with different parameters for each error type
- Design for scale: Use connection pooling, circuit breakers, and Cross-Region failover
- Monitor proactively: Set up comprehensive CloudWatch monitoring and alerting
- Plan for growth: Request quota increases and implement fallback strategies
Conclusion
Handling 429 ThrottlingException and 503 ServiceUnavailableException errors effectively is a crucial part of running production‑grade generative AI workloads on Amazon Bedrock. By combining quota‑aware design, intelligent retries, client‑side resilience patterns, cross‑Region strategies, and strong observability, you can keep your applications responsive even under unpredictable load.
As a next step, identify your most critical Bedrock workloads, enable the retry and rate‑limiting patterns described here, and build dashboards and alarms that expose your real peaks rather than just averages. Over time, use real traffic data to refine quotas, fallback models, and regional deployments so your AI systems can remain both responsive and dependable as they scale. For teams looking to accelerate incident resolution, consider enabling AWS DevOps Agent—an AI-powered agent that can investigate Bedrock errors by correlating CloudWatch metrics, logs, and alarms just like an experienced DevOps engineer would. It learns your resource relationships, works with your observability tools and runbooks, and can significantly reduce mean time to resolution (MTTR) for 429 and 503 errors by automatically identifying root causes and suggesting remediation steps.
Learn More
- Amazon Bedrock Documentation
- Amazon Bedrock Quotas
- Cross-Region Inference
- Cross-Region Inference Security
- SNS Security
- AWS Logging Best Practices
- AWS Bedrock Security Best Practices
- AWS IAM Best Practices – Least Privilege
About the Authors