Overview
This solution enables high-performance deployment of the Llama-3.1-8B-Instruct model - an instruction-tuned, 8-billion-parameter transformer developed by Meta (Llama 3.1 series)-on Intel® Xeon® 6 processors using a vLLM CPU-optimized Docker image. Llama-3.1-8B-Instruct is specifically tuned for multilingual, assistant-style tasks such as conversational agents, summarization, question answering, code generation, and tool-enabled dialogues. Available via Hugging Face under the meta-llama/Llama-3.1-8B-Instruct model card, it supports a broad range of languages - including, but not limited to, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai - and excels in both general and instruction-following capabilities.
The deployment leverages vLLM, a high-throughput inference engine optimized for CPU environments. VLLM uses PagedAttention, Tensor Parallelism, and PyTorch 2.0 to deliver efficient memory usage and low-latency inference. The Docker image is tuned for Intel® Xeon® 6 processors, which feature advanced architectural enhancements including Efficient-cores (E-cores) and Performance-cores (P-cores), support for Intel® Advanced Matrix Extensions (Intel® AMX), and Intel® Deep Learning Boost (DL Boost). These features accelerate AI workloads and enable scalable deployment of LLMs in cloud, edge, and enterprise environments.
This containerized solution provides a plug-and-play experience for deploying Llama-3.1-8B-Instruct on CPU-only infrastructure, eliminating the need for GPUs while maintaining competitive performance. It supports RESTful APIs, batch inference, and integration into existing ML pipelines, making it ideal for developers, researchers, and enterprises seeking cost-effective, scalable, and production-ready LLM deployment.
Highlights
- Run Llama-3.1-8B-Instruct on Intel® Xeon® 6: Deploy Hugging Face instruction-tuned LLM efficiently on CPU-only infrastructure using Intel® AMX and DL Boost.
- vLLM-Powered CPU Inference: Use vLLM with PyTorch 2.0 and PagedAttention for fast, scalable inference - no GPU required.
Details
Unlock automation with AI agent solutions

Features and programs
Financing for AWS Marketplace purchases
Pricing
Dimension | Cost/hour |
---|---|
r8i.32xlarge | $0.00 |
r8i.12xlarge | $0.00 |
r8i-flex.12xlarge | $0.00 |
r8i.metal-48xl | $0.00 |
r8i.96xlarge | $0.00 |
r8i.metal-96xl | $0.00 |
r8i-flex.8xlarge | $0.00 |
r8i.8xlarge | $0.00 |
r8i.24xlarge | $0.00 |
r8i.16xlarge | $0.00 |
Vendor refund policy
N.A
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Production-grade LLM inference service via CloudFormation.
N.A
CloudFormation Template (CFT)
AWS CloudFormation templates are JSON or YAML-formatted text files that simplify provisioning and management on AWS. The templates describe the service or application architecture you want to deploy, and AWS CloudFormation uses those templates to provision and configure the required services (such as Amazon EC2 instances or Amazon RDS DB instances). The deployed application and associated resources are called a "stack."
Version release notes
New Features:
-
Model Deployment: Integrated support for deploying the Llama-3.1-8B-Instruct model from Hugging Face, optimized for instruction-following tasks on a broad range of languages - including, but not limited to, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
-
Intel® Xeon® 6 Optimization: Enhanced performance on Intel® Xeon® 6 processors using Intel® AMX, DL Boost, and AVX-512 for accelerated CPU inference.
-
vLLM Inference Engine: Utilizes vLLM with PyTorch 2.0, PagedAttention, and Tensor Parallelism for efficient memory usage and low-latency inference.
-
Containerized Setup: Docker-based deployment with REST API support for easy integration into existing ML workflows and backend services.
-
Tool Calling Support: Enabled tool calling functionality, allowing the model to interact with external tools and APIs directly from within the application, enhancing automation and extensibility.
Additional details
Usage instructions
This product uses an AWS CloudFormation template to deploy the Llama-3.1-8B-Instruct model on an EC2 instance using a VLLM CPU-optimized Docker image. Follow the steps below to ensure a successful setup:
- Pre-requisites: Before launching the CloudFormation stack, ensure the following resources are available in your AWS account:
1a. Subnet ID and Security Group ID: Required for provisioning the EC2 instance within your VPC. Ensure the Security Group has appropriate inbound rules configured to allow traffic on port 8000 (TCP) from your IP or trusted sources. This is necessary to access the model endpoint. 1b. Hugging Face Access Token: Required to authenticate and pull the model from Hugging Face Hub. You can generate a token from your Hugging Face account at https://huggingface.co/settings/tokens .
-
Launch the CloudFormation Stack: Subscribe to the product via AWS Marketplace and proceed to launch the CloudFormation template. Enter the required parameters: SubnetId, SecurityGroupId, HuggingFaceToken. Click Submit to deploy the stack.
-
Access the Model Endpoint: Once the CloudFormation stack reaches the CREATE_COMPLETE state, navigate to the EC2 Console, locate the instance created by the stack, and copy its Public IP address. The model server will be accessible on port 8000. As the template pulls the vLLM CPU-optimized Docker image and loads the model, the inference service may take a few minutes to fully initialize - please allow some time before sending requests.
-
Query the Model: You can interact with the model using a simple HTTP POST request.
Example using curl:
$ curl -X POST http://<EC2_PUBLIC_IP>:8000/v1/chat/completions
-H "Content-Type: application/json"
--data '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'
Note: Replace <EC2_PUBLIC_IP> with the actual public IP of your EC2 instance.
Support
Vendor support
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.