Intel® AI for Enterprise Inference - Llama-3.1-8B-Instruct

This deployment package enables seamless hosting of the meta-llama/Llama-3.1-8B-Instruct language model on Intel® Xeon® processors using the VLLM CPU-optimized Docker image. Designed for efficient inference on CPU-only environments, this solution leverages vLLM lightweight architecture to deliver fast and scalable performance without requiring GPU acceleration. Ideal for enterprise-grade NLP tasks, it offers a cost-effective and accessible way to run large language models on Intel-powered infrastructure.

0 AWS reviews

View purchase options

Overview

This solution enables high-performance deployment of the Llama-3.1-8B-Instruct model - an instruction-tuned, 8-billion-parameter transformer developed by Meta (Llama 3.1 series)-on Intel® Xeon® 6 processors using a vLLM CPU-optimized Docker image. Llama-3.1-8B-Instruct is specifically tuned for multilingual, assistant-style tasks such as conversational agents, summarization, question answering, code generation, and tool-enabled dialogues. Available via Hugging Face under the meta-llama/Llama-3.1-8B-Instruct model card, it supports a broad range of languages - including, but not limited to, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai - and excels in both general and instruction-following capabilities.

The deployment leverages vLLM, a high-throughput inference engine optimized for CPU environments. VLLM uses PagedAttention, Tensor Parallelism, and PyTorch 2.0 to deliver efficient memory usage and low-latency inference. The Docker image is tuned for Intel® Xeon® 6 processors, which feature advanced architectural enhancements including Efficient-cores (E-cores) and Performance-cores (P-cores), support for Intel® Advanced Matrix Extensions (Intel® AMX), and Intel® Deep Learning Boost (DL Boost). These features accelerate AI workloads and enable scalable deployment of LLMs in cloud, edge, and enterprise environments.

This containerized solution provides a plug-and-play experience for deploying Llama-3.1-8B-Instruct on CPU-only infrastructure, eliminating the need for GPUs while maintaining competitive performance. It supports RESTful APIs, batch inference, and integration into existing ML pipelines, making it ideal for developers, researchers, and enterprises seeking cost-effective, scalable, and production-ready LLM deployment.

Highlights

Run Llama-3.1-8B-Instruct on Intel® Xeon® 6: Deploy Hugging Face instruction-tuned LLM efficiently on CPU-only infrastructure using Intel® AMX and DL Boost.
vLLM-Powered CPU Inference: Use vLLM with PyTorch 2.0 and PagedAttention for fast, scalable inference - no GPU required.

Details

Sold by

Intel

Unlock automation with AI agent solutions

Fast-track AI initiatives with agents, tools, and solutions from AWS Partners.

Explore AI agent solutions

Features and programs

Financing for AWS Marketplace purchases

AWS Marketplace now accepts line of credit payments through the PNC Vendor Finance program. This program is available to select AWS customers in the US, excluding NV, NC, ND, TN, & VT.

View financing details

Pricing

Intel® AI for Enterprise Inference - Llama-3.1-8B-Instruct

Info

View purchase options

This product is available free of charge. Free subscriptions have no end date and may be canceled any time.

Additional AWS infrastructure costs may apply. Use the AWS Pricing Calculator to estimate your infrastructure costs.

Vendor refund policy

N.A

How can we make this page better?

We'd like to hear your feedback and ideas on how to improve this page.

Legal

Vendor terms and conditions

Upon subscribing to this product, you must acknowledge and agree to the terms and conditions outlined in the vendor's End User License Agreement (EULA) .

Content disclaimer

Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

Usage information

Info

Version

Delivery details

Production-grade LLM inference service via CloudFormation.

N.A

CloudFormation Template (CFT)

AWS CloudFormation templates are JSON or YAML-formatted text files that simplify provisioning and management on AWS. The templates describe the service or application architecture you want to deploy, and AWS CloudFormation uses those templates to provision and configure the required services (such as Amazon EC2 instances or Amazon RDS DB instances). The deployed application and associated resources are called a "stack."

Version release notes

New Features:

Model Deployment: Integrated support for deploying the Llama-3.1-8B-Instruct model from Hugging Face, optimized for instruction-following tasks on a broad range of languages - including, but not limited to, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
Intel® Xeon® 6 Optimization: Enhanced performance on Intel® Xeon® 6 processors using Intel® AMX, DL Boost, and AVX-512 for accelerated CPU inference.
vLLM Inference Engine: Utilizes vLLM with PyTorch 2.0, PagedAttention, and Tensor Parallelism for efficient memory usage and low-latency inference.
Containerized Setup: Docker-based deployment with REST API support for easy integration into existing ML workflows and backend services.
Tool Calling Support: Enabled tool calling functionality, allowing the model to interact with external tools and APIs directly from within the application, enhancing automation and extensibility.
Enabled chunked prefill and optimized batching, along with multiprocessing backend and CPU-specific tuning (VLLM_CPU_SGL_KERNEL, VLLM_CPU_KVCACHE_SPACE) to reduce TTFT by around 30%, increase token throughput by around 25%, and deliver faster inference on Intel® Xeon® 6 processors through improved parallelism and memory efficiency.

Additional details

Usage instructions

This product uses an AWS CloudFormation template to deploy the Llama-3.1-8B-Instruct model on an EC2 instance using a VLLM CPU-optimized Docker image. Follow the steps below to ensure a successful setup:

Pre-requisites: Before launching the CloudFormation stack, ensure the following resources are available in your AWS account:

1a. Subnet ID and Security Group ID: Required for provisioning the EC2 instance within your VPC. Ensure the Security Group has appropriate inbound rules configured to allow traffic on port 8000 (TCP) from your IP or trusted sources. This is necessary to access the model endpoint. 1b. Hugging Face Access Token: Required to authenticate and pull the model from Hugging Face Hub. You can generate a token from your Hugging Face account at https://huggingface.co/settings/tokens .

Launch the CloudFormation Stack: Subscribe to the product via AWS Marketplace and proceed to launch the CloudFormation template. Enter the required parameters: SubnetId, SecurityGroupId, HuggingFaceToken. Click Submit to deploy the stack.
Access the Model Endpoint: Once the CloudFormation stack reaches the CREATE_COMPLETE state, navigate to the EC2 Console, locate the instance created by the stack, and copy its Public IP address. The model server will be accessible on port 8000. As the template pulls the vLLM CPU-optimized Docker image and loads the model, the inference service may take a few minutes to fully initialize - please allow some time before sending requests.
Query the Model: You can interact with the model using a simple HTTP POST request.

Example using curl:

$ curl -X POST http://<EC2_PUBLIC_IP>:8000/v1/chat/completions
-H "Content-Type: application/json"
--data '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'

Note: Replace <EC2_PUBLIC_IP> with the actual public IP of your EC2 instance.

Support

Vendor support

Get support

AWS infrastructure support

AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.

Get support

Similar products

Intel® Geti™

By Intel

Build computer vision models in a fraction of the time and with less data.

View product

Open Platform for Enterprise AI - OPEA

By Intel

OPEA (Open Platform for Enterprise AI) is an AI inferencing and finetuning microservice framework that enables the creation and evaluation of open, configurable, and composable generative AI solutions.

View product

Clear Linux OS

By Intel

A reference Linux distribution optimized for Intel Architecture

View product

PostgreSQL Optimized by Intel®

By Intel

Deploy an AI ready PostgreSQL instance optimized by Intel® on Intel® Xeon Instances with up to 2.4X performance gains over default PostgreSQL. Enhance your cloud instances efficiency and achieve better performance while minimizing infrastructure costs. Perfect for AI workloads, data pipelines, and enterprise applications requiring high-performance database operations.

View product

Intel® Distribution of OpenVINO™ Toolkit

By Intel

An open-source toolkit for optimizing and deploying deep learning models. Boost your AI deep-learning inference performance!

View product

Customer reviews

Leave a review

Ratings and reviews

Info

0 ratings

5 star

4 star

3 star

2 star

1 star

0 AWS reviews

No customer reviews yet

Be the first to review this product . We've partnered with PeerSpot to gather customer feedback. You can share your experience by writing or recording a review, or scheduling a call with a PeerSpot analyst.