Guidance for Low-Latency High-Throughput Model Inference Using Amazon ECS
Overview
How it works
These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.
Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Operational Excellence
Amazon CloudWatch monitors the performance of the Amazon ECS cluster (including CPU and memory) along with the incoming requests sent through Network Load Balancer . Your CloudWatch dashboard—created as part of an AWS CloudFormation script—provides a comprehensive view of the number of incoming requests and their associated latency. By using CloudWatch to visualize and analyze performance and latency, you can better identify any bottlenecks in your application.
Security
By scoping down all AWS Identity and Access Management (IAM) policies to the minimum permissions required for the services to function properly, you can limit unauthorized access to resources.
Reliability
The Amazon ECS cluster runs a service definition that maintains a desired capacity of EC2 instances. If one of the instances becomes unavailable, a new instance will automatically launch and be registered with the Amazon ECS cluster as a healthy target to receive incoming requests routed by Network Load Balancer .
Performance Efficiency
Network Load Balancer , which communicates with Amazon ECS , supports low-millisecond latency and high throughput that are apt for this use case.
Cost Optimization
Amazon EC2 Auto Scaling groups let you run your application at the desired capacity while providing dynamic support for scaling based on the load. Automatic scaling grows or reduces the infrastructure based on load and your scaling policy. This helps you control the costs associated with running your application.
Sustainability
The Amazon EC2 -based Amazon ECS cluster lets you choose appropriate hardware types and configurations for specific workloads so that they run efficiently. As a result, you can maximize utilization and avoid overprovisioning resources. This Guidance is designed for low-latency and high-performance model inference workloads, so appropriate EC2 instance types are powered by AWS Graviton3 . This service uses up to 60 percent less energy for the same performance as comparable EC2 instances, helping you reduce your carbon footprint.
Disclaimer
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages