Overview
This solution enables high-performance deployment of the Qwen/Qwen3-14B model-on Intel® Xeon® 6 processors using a vLLM CPU-optimized Docker image. Qwen3 is the latest generation in the Qwen LLM series, featuring both dense and Mixture-of-Experts (MoE) models. It introduces seamless switching between reasoning-intensive and general-purpose dialogue modes, significantly improving performance in math, coding, and logical tasks. Qwen3 also excels in human alignment, multilingual support (100+ languages), and agent-based tool integration, making it one of the most versatile open-source models available.
The deployment leverages vLLM, a high-throughput inference engine optimized for CPU environments. VLLM uses PagedAttention, Tensor Parallelism, and PyTorch 2.0 to deliver efficient memory usage and low-latency inference. The Docker image is tuned for Intel® Xeon® 6 processors, which feature advanced architectural enhancements including Efficient-cores (E-cores) and Performance-cores (P-cores), support for Intel® Advanced Matrix Extensions (Intel® AMX), and Intel® Deep Learning Boost (DL Boost). These features accelerate AI workloads and enable scalable deployment of LLMs in cloud, edge, and enterprise environments.
This containerized solution provides a plug-and-play experience for deploying Qwen3-14B on CPU-only infrastructure, eliminating the need for GPUs while maintaining competitive performance. It supports RESTful APIs, batch inference, and integration into existing ML pipelines, making it ideal for developers, researchers, and enterprises seeking cost-effective, scalable, and production-ready LLM deployment.
Highlights
- Run Qwen3-14B on Intel® Xeon® 6: Deploy Hugging Face instruction-tuned LLM efficiently on CPU-only infrastructure using Intel® AMX and DL Boost.
- vLLM-Powered CPU Inference: Use vLLM with PyTorch 2.0 and PagedAttention for fast, scalable inference - no GPU required.
Details
Unlock automation with AI agent solutions

Features and programs
Financing for AWS Marketplace purchases
Pricing
Dimension | Cost/hour |
---|---|
r8i.metal-96xl | $0.00 |
r8i.96xlarge | $0.00 |
r8i.metal-48xl | $0.00 |
r8i.12xlarge | $0.00 |
r8i-flex.12xlarge | $0.00 |
r8i.32xlarge | $0.00 |
r8i.48xlarge | $0.00 |
r8i.16xlarge | $0.00 |
r8i-flex.16xlarge | $0.00 |
r8i-flex.8xlarge | $0.00 |
Vendor refund policy
NA
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Production-grade LLM inference service via CloudFormation.
N.A
CloudFormation Template (CFT)
AWS CloudFormation templates are JSON or YAML-formatted text files that simplify provisioning and management on AWS. The templates describe the service or application architecture you want to deploy, and AWS CloudFormation uses those templates to provision and configure the required services (such as Amazon EC2 instances or Amazon RDS DB instances). The deployed application and associated resources are called a "stack."
Version release notes
New Features:
-
Model Deployment: Integrated support for deploying the Qwen/Qwen3-14B from Hugging Face, the latest generation in the Qwen LLM series, featuring both dense and Mixture-of-Experts (MoE) models.
-
Intel® Xeon® 6 Optimization: Enhanced performance on Intel® Xeon® 6 processors using Intel® AMX, DL Boost, and AVX-512 for accelerated CPU inference.
-
vLLM Inference Engine: Utilizes vLLM with PyTorch 2.0, PagedAttention, and Tensor Parallelism for efficient memory usage and low-latency inference.
-
Containerized Setup: Docker-based deployment with REST API support for easy integration into existing ML workflows and backend services.
Additional details
Usage instructions
This product uses an AWS CloudFormation template to deploy the Qwen/Qwen3-14B model on an EC2 instance using a VLLM CPU-optimized Docker image. Follow the steps below to ensure a successful setup:
- Pre-requisites: Before launching the CloudFormation stack, ensure the following resources are available in your AWS account:
1a. Subnet ID and Security Group ID: Required for provisioning the EC2 instance within your VPC. Ensure the Security Group has appropriate inbound rules configured to allow traffic on port 8000 (TCP) from your IP or trusted sources. This is necessary to access the model endpoint. 1b. Hugging Face Access Token: Required to authenticate and pull the model from Hugging Face Hub. You can generate a token from your Hugging Face account at https://huggingface.co/settings/tokens .
-
Launch the CloudFormation Stack: Subscribe to the product via AWS Marketplace and proceed to launch the CloudFormation template. Enter the required parameters: SubnetId, SecurityGroupId, HuggingFaceToken. Click Submit to deploy the stack.
-
Access the Model Endpoint: Once the CloudFormation stack reaches the CREATE_COMPLETE state, navigate to the EC2 Console, locate the instance created by the stack, and copy its Public IP address. The model server will be accessible on port 8000. As the template pulls the vLLM CPU-optimized Docker image and loads the model, the inference service may take a few minutes to fully initialize - please allow some time before sending requests.
-
Query the Model: You can interact with the model using a simple HTTP POST request.
Example using curl:
$ curl -X POST "http://<EC2_PUBLIC_IP>:8000/v1/chat/completions"
-H "Content-Type: application/json"
--data '{
"model": "Qwen/Qwen3-14B",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'
Note: Replace <EC2_PUBLIC_IP> with the actual public IP of your EC2 instance.
Support
Vendor support
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.