Meta Llama 3.2 模型（11B 和 90B 版本）在视觉任务中的应用

本页导读

Llama 3.2 11B 和 90B 视觉模型概述解决方案概述前提条件在 Amazon Bedrock 中配置 Llama 3.2 进行视觉推理在 SageMaker 中配置 Llama 3.2 进行视觉推理文档问答基于财务报告幻灯片问答图形数学问答实体信息提取图像描述生成生成基础描述生成创意描述生成商品营销描述总结

难度

300 - 高级

时间

20 分钟

前提条件

海外区域：注册 / 登录亚马逊云科技

上次更新时间

2024 年 9 月 25 日

Llama 3.2 11B 和 90B 视觉模型概述

Llama 3.2 多模态和多语言大语言模型 (LLM) 系列包含不同参数规模的已预训练和指令微调的模型。11B 和 90B 模型具备多模态处理能力，不仅能够处理传统的文本输入/文本输出任务，还支持根据文本和图像的混合输入生成相应的文本输出。

Llama 3.2 11B 和 90B 是首批支持视觉任务的 Llama 模型，采用了新的模型架构，将图像编码器表征集成到语言模型中。这些新一代模型经过精心优化，显著提升了 AI 工作负载的处理效率，在降低延迟的同时大幅提高了整体性能，能够适应更广泛的应用场景。Llama 3.2 系列的所有模型均继承了 Llama 3.1 的长上下文处理能力，支持高达 128K token 的上下文。此外，新系列模型在多语言处理方面取得了显著进展，现已优化支持八种主要语言：英语、德语、法语、意大利语、葡萄牙语、印地语、西班牙语和泰语。

Llama 3.2 模型现已在 Amazon SageMaker JumpStart 和 Amazon Bedrock 上推出，你可以在这两个平台上直接使用。在 Amazon SageMaker JumpStart 平台上，Llama 3.2 模型率先在美国东部（俄亥俄）AWS 区域推出，且支持各种实例类型。现在，Amazon Bedrock 在美国西部（俄勒冈）区域也上线了 Meta Llama 3.2 模型的 90B 和 11B 版本。通过跨区域推理技术，用户也可以在美国东部（俄亥俄、弗吉尼亚北部）区域使用这些模型。Llama 3.2 模型的 1B 和 3B 版本已在更多地区部署，包括美国西部（俄勒冈）和欧洲（法兰克福）。通过跨区域推理技术，这些模型同样可在美国东部（俄亥俄、弗吉尼亚北部）和欧洲（爱尔兰、巴黎）区域使用。AWS 正积极规划进一步扩大这些模型的可用区域。

解决方案概述

接下来，我们将深入探讨如何在 Amazon Bedrock 和 Amazon SageMaker JumpStart 上配置 Llama 3.2 视觉模型，以进行基于视觉的推理。我们还将演示文档问答、实体信息提取和图像描述生成的使用场景。

除非另有说明，本文中展示的示例均使用 Llama 3.2 90B 模型。本文示例中的服饰类图像来自 Fashion Product Images Dataset。用于展示图像描述生成的图像来自 Human Preference Synthetic Dataset。室内设计和房地产相关图像来自 Interior design dataset。

前提条件

以下是进行示例实验操作所需的前提条件：

已有一个可以访问所有 AWS 资源的 AWS 账户。
已有一个可以访问 Amazon SageMaker 和 Amazon Bedrock 的 AWS Identity and Access Management (IAM) 角色。有关详细的 IAM 设置信息，请参阅 Amazon SageMaker 身份和访问管理以及 Amazon Bedrock 身份和访问管理。
已有 Amazon SageMaker Studio、或 SageMaker Notebook 实例、或如 PyCharm、Visual Studio Code 等常用的集成开发环境 (IDE)。

关于如何在 Amazon Bedrock 上配置 Llama 3.2 模型的访问权限，请参阅模型发布博文。有关如何通过 Amazon SageMaker JumpStart 创建模型端点的详细信息，请参阅模型发布博文。

在 Amazon Bedrock 中配置 Llama 3.2 进行视觉推理

在 Amazon Bedrock 平台上配置 Llama 3.2 模型以执行视觉推理任务，可参考如下代码段示例：

import boto3
import json
import base64
from botocore.config import Config

# Initialize the Bedrock client
config = Config(
            region_name = os.getenv("BEDROCK_REGION", "us-west-2"),
            )
bedrock_runtime = boto3.client('bedrock-runtime', config=config)
MODEL_ID = " us.meta.llama3-2-90b-instruct-v1:0"

Amazon Bedrock 支持在 Converse API 中使用 messages 对象。与使用 SageMaker JumpStart 不同的是，使用 Converse API 时无需将图像转换为 base64 格式。

你可以运行以下 Python 代码来读取图像文件：

# Read and encode the image
image_path = "<your_file_path>"  # Replace with the actual path to your image
try:
    # Open the image file and read its contents
    with open(image_path, "rb") as image_file:
        image_bytes = image_file.read()
    # Encode the image bytes to base64
    image_data = image_bytes
except FileNotFoundError:
    print(f"Image file not found at {image_path}")
    image_data = None

运行以下代码构造模型输入的 messages 对象：

# 构造模型输入 messages 对象

# Construct the messages for the model input
messages = [    
    {
        "role": "user",
        "content": [
            {                
                "text": prompt
            },
            {                
                "image": {
                    "format": "<your_file_format>",
                    "source": {
                        "bytes":image_data
                }
            }
        ]
    }
]

使用以下代码调用 Amazon Bedrock 的 Converse API：

try:
    # Invoke the SageMaker endpoint
    response = bedrock_runtime.converse(
        modelId=MODEL_ID, # MODEL_ID defined at the beginning
        messages=[
            messages
        ],
        inferenceConfig={
        "maxTokens": 4096,
        "temperature": 0,
        "topP": .1
        },        
    )
    
    # Read the response 
    print(response['output']['message']['content'][0]['text'])

except Exception as e:
    print(f"An error occurred while invoking the endpoint: {str(e)}")

在 SageMaker 中配置 Llama 3.2 进行视觉推理

以下代码段演示了如何在 SageMaker 端点上配置 Llama 3.2 视觉模型以执行视觉推理任务。（注：关于如何设置推理端点的详细说明，请参阅 SageMaker JumpStart 博文板块中 Llama 3.2 相关的文章）

import boto3
import json
import base64

# Initialize the SageMaker runtime client
sagemaker_runtime = boto3.client('sagemaker-runtime')
endpoint_name = '<model-endpoint>'  # Replace with your actual endpoint name

与 Amazon Bedrock 的 Converse API 类似，SageMaker JumpStart 部署同样支持使用 messages 对象作为输入。在构造消息对象之前，需要先将图像文件转换为 base64 编码格式。

使用以下代码读取图像文件：

# Read and encode the image
image_path = "<your_file_path>"  # Replace with the actual path to your image
try:
    # Open the image file and read its contents
    with open(image_path, "rb") as image_file:
        image_bytes = image_file.read()
    # Encode the image bytes to base64
    image_data = base64.b64encode(image_bytes).decode('utf-8')
    image_media_type = 'image/jpeg'  # Adjust if using a different image format
except FileNotFoundError:
    print(f"Image file not found at {image_path}")
    image_data = None
    image_media_type = None

使用以下代码构造 messages 对象：

# Create a data URL for the image
my_url = f"""data:image/jpeg;base64,{image_data}"""

# Construct the messages for the model input
messages = [    
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": prompt
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": my_url
                }
            }
        ]
    }
]

示例代码中的 prompt 变量代表我们希望模型回答的关于图像的具体问题，模型需要根据图像信息推理出对应的答案。

构造完 messages 对象后，可以将其作为有效负载 payload 发送到 SageMaker 端点。示例代码如下：

try:
    # Invoke the SageMaker endpoint
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    
    # Read the response body
    response_body = response['Body'].read()
    
    if response_body:
        try:
            # Parse the JSON response
            result = json.loads(response_body.decode('utf-8'))
            # Print the model's response
            print(result['choices'][0]['message']['content'])
        except json.JSONDecodeError as json_err:
            print(f"Failed to parse JSON: {json_err}")
            print(f"Raw response: {response_body['choices'][0]['message']['content']}")
    else:
        print("The response body is empty")

except Exception as e:
    print(f"An error occurred while invoking the endpoint: {str(e)}")

文档问答

在本小节中，我们将展示文档问答的示例使用场景。我们将重点关注两个使用场景：根据含有复杂图表和数据的财务报告幻灯片回答相关问题和基于图像内容解答数学问题。

基于财务报告幻灯片问答

Llama 3.2 视觉模型能够准确解读包含复杂图表、图形和表格的图像，并基于这些信息进行推理和回答问题。接下来，我们将模型用作金融分析助手，分析包含不同类型财务信息的幻灯片，并回答我们提出的问题。我们将使用 Amazon 2024 年第二季度财务业绩电话会议幻灯片作为上下文输入，然后基于这些幻灯片的内容向 Llama 3.2 模型提出一系列问题。

首先，我们使用以下这张幻灯片，并给出如下问题作为 Prompt：“How is the sales trending over the last twelve months?”（过去十二个月的销售趋势如何？）

模型给出了如下回答：

The sales are trending up 12% year over year. The bar graph shows that the sales have increased from $538,046 in Q2 2023 to $604,334 in Q1 2024, and further to $604,334 in Q2 2024. This indicates a steady growth in sales over the last twelve months.

下一个问题：“Can you please help me understand this slide? How are the operating and free cash flow trending the last 12 month period? What is the significant contributor to the free cash flow trend?”（请帮我解读这张幻灯片，并分析过去 12 个月的运营现金流和自由现金流的变化趋势，以及有哪些因素显著影响了自由现金流的变化趋势。）我们使用的是以下幻灯片。

模型给出了如下回答：

The slide presents a table with two main sections: "Operating Cash Flow" and "Free Cash Flow Less Equipment Finance Leases and Principal Repayments of All Other Finance Leases and Financing Obligations." The table is divided into five columns, each representing a quarter from Q2 2023 to Q2 2024.

**Operating Cash Flow Trend:**

* Q2 2023: $61,841

* Q3 2023: $71,654

* Q4 2023: $84,946

* Q1 2024: $99,147

* Q2 2024: $107,952

The operating cash flow has consistently increased over the five quarters, indicating a positive trend.

**Free Cash Flow Trend:**

* Q2 2023: $6,734

* Q3 2023: $20,241

* Q4 2023: $35,549

* Q1 2024: $48,778

* Q2 2024: $51,448

The free cash flow has also shown an upward trend, with a significant increase from Q2 2023 to Q2 2024.

**Significant Contributor to Free Cash Flow Trend:**

The significant contributor to the free cash flow trend is the "Principal repayments of all other finance leases" and "Principal repayments of financing obligations." These two items have consistently increased over the five quarters, contributing to the overall growth in free cash flow.

In summary, both operating cash flow and free cash flow have demonstrated a positive trend over the last 12-month period. The significant contributor to the free cash flow trend is the increase in principal repayments of all other finance leases and financing obligations.

图形数学问答

视觉问答的另一个有趣的使用场景是解答图形数学和逻辑难题。Llama 3.2 视觉模型在解决和回答图形数学问题方面表现出色。接下来，我们将通过几个具体示例来展示模型的这一能力。

首先，我们使用以下图片，并给出如下问题作为 Prompt：“The kangaroo is inside how many circles?”（袋鼠被多少个圆圈包围着？）

模型给出了如下回答：
The kangaroo is inside 3 circles.

接下来，我们针对以下图像提问 “Which of these figures differ from the other four?”（这些图形中哪一个与其他四个不同？）

模型给出了如下回答：
The figure labeled E is the only one with a different shape for the skirt. The other figures have triangular skirts, while figure E has a rectangular skirt.

实体信息提取

从产品图片中提取产品信息是零售行业的一个重要使用场景。在支持第三方卖家的平台上，产品图片是收集产品信息最直接的资源。因此，准确提取产品的相关信息对于产品营销成功至关重要。接下来，我们将通过实际案例，展示 Llama 3.2 模型如何从服装产品图片中提取关键信息，以便电商应用程序为该产品提供相应支持。

首先，我们使用以下图片，并提供如下 Prompt：“Can you please extract all information about the product in the image? Think very carefully in a step-by-step manner to construct a json file to organize all the information about the product.Please consider formulating the json file so that the extracted information can be stored in a database for further use in a retail ecommerce application.Return the json file as the output.”（请全面分析图片中的产品，提取所有相关信息。请仔细思考，分步骤执行，将提取的所有产品信息整理成 JSON 格式。在设计 JSON 结构时，请考虑提取的信息将存储在数据库中，以便在零售电商应用程序中进一步使用。最后，输出完整的 JSON 结果。）

模型分析粉色连衣裙图片后，输出了如下信息：
Based on the image, here is a JSON representation of the product information:

```
{
"product_id": "unknown",
"name": "Pink Floral Dress",
"description": "A sleeveless dress with a pink floral pattern and a pink trim at the bottom",
"category": "Dresses",
"sub_category": "Casual Dresses",
"color": "Pink",
"size": "unknown",
"price": "unknown",
"brand": "unknown",
"material": "unknown",
"features": [
"Sleeveless",
"Floral pattern",
"Pink trim at the bottom"
],
"images": [
{
"url": "image_url",
"alt_text": "Pink Floral Dress"
}
]
}
```

Please note that some information such as product_id, size, price, brand, and material are unknown as they are not visible in the image. Also, the image_url is a placeholder and should be replaced with the actual URL of the image.

模型分析蓝色帽子图片后，输出了如下信息：

Based on the image, here is a JSON representation of the product information:

```json
{
"product_id": "NIKE-CAP-001",
"name": "Nike Cap",
"description": "A blue baseball cap with a white Nike logo",
"category": "Accessories",
"subcategory": "Hats",
"brand": "Nike",
"color": "Blue",
"size": "One Size",
"material": "Cotton",
"price": 29.99,
"image_url": "https://example.com/nike-cap.jpg",
"features": [
"Adjustable strap in the back",
"Embroidered Nike logo on the front"
]
}
```

This JSON object includes the following fields:

* `product_id`: a unique identifier for the product
* `name`: the name of the product
* `description`: a brief description of the product
* `category` and `subcategory`: the categories and subcategories that the product belongs to
* `brand`: the brand of the product
* `color`: the color of the product
* `size`: the size of the product
* `material`: the material that the product is made of
* `price`: the price of the product
* `image_url`: the URL of the product image
* `features`: an array of features that describe the product

Note that this is just one possible way to represent the product information in JSON. Depending on the specific requirements of your application, you may need to add or remove fields, or use different data types for certain fields.

图像描述生成

视觉模型的另一个重要且有用的功能是自动为图像生成描述。在本小节中，我们将展示 Llama 3.2 模型在三种不同场景下的图像描述生成能力：基础描述、更具创意和幽默感的描述，以及营销风格的描述。

生成基础描述

对于测试模型的基础描述生成能力，我们使用了以下图片，并提供如下 Prompt：“Please provide a caption for this image.Be creative but respectful.”（请为这张图片生成一段描述。描述要富有创意但注意保持得体。）

模型给出了如下回答：

The image shows a young woman sitting in a dentist's chair, with a female dentist examining her teeth. The woman is wearing a bib to protect her clothes from any water or other substances that may be used during the examination. The dentist is wearing a white lab coat and a mask, and she is holding a dental mirror and a probe. The background of the image is a dental office, with various dental tools and equipment visible.

The overall atmosphere of the image is one of professionalism and care. The dentist appears to be focused on her work, and the woman seems to be relaxed and comfortable. The image suggests that the woman is receiving a routine dental checkup or cleaning, and that the dentist is taking good care of her.

生成创意描述

在第二个例子中，我们希望获得更具创意和幽默感的描述。我们使用同一张图片，但给出不同的 Prompt：“Please provide a short caption for this image.Be creative and funny but respectful.This is meant to be shared on a social network application.”（请为这张图片生成一段简短的描述。要富有创意和幽默感，但注意保持得体。这段描述将分享在社交网络应用上。）

模型给出了如下回答：

"Brace yourself for a sparkling smile, because this dentist is about to make your teeth shine brighter than a disco ball!"

生成商品营销描述

我们使用以下图片来测试模型的房产营销描述生成能力，并提供如下 Prompt：“You are a real estate marketing expert.Please create a listing description with the attached image of a property.”（你是一位房地产营销专家。请基于这张室内照片，撰写一段用于房产销售的描述。）

我们得到以下输出：

**Welcome to Your Dream Home**

This stunning property boasts an open-plan living area that seamlessly blends style and functionality. The spacious living room, dining area, and kitchen are perfect for entertaining, with plenty of natural light pouring in through the large windows. The sleek, modern kitchen features high-end appliances and ample counter space, making meal prep a breeze.

The living room is a cozy retreat, with a comfortable sofa and chairs arranged around a stylish coffee table. The adjacent dining area is perfect for hosting dinner parties, with a large table that seats eight.

But the real showstopper is the stunning view of the water, which can be enjoyed from the comfort of your own home. Imagine sipping your morning coffee or enjoying a glass of wine in the evening while taking in the breathtaking scenery.

This property is the perfect blend of form and function, offering the ultimate in comfort and style. Don't miss out on this incredible opportunity to make it your own.

**Key Features:**

* Open-plan living area with kitchen, dining area, and living room
* Modern kitchen with high-end appliances and ample counter space
* Comfortable living room with sofa and chairs
* Large dining table that seats eight
* Stunning view of the water
* Perfect for entertaining

**Don't miss out on this incredible opportunity. Contact us today to schedule a viewing.**

总结

在本文中，我们展示了 Meta Llama 3.2 11B 和 90B 模型的多种视觉处理能力。Llama 3.2 视觉模型能够满足多种使用场景，包括文档理解、数学和逻辑题求解、实体信息提取和图像描述生成。这些能力可以在多种使用场景助力提升企业生产力，包括电商（零售）、营销等等。

要了解更多关于 Amazon Bedrock 中 Llama 3.2 的特性和功能，请参阅模型发布博文、Amazon Bedrock 产品页面和 Amazon Bedrock 文档。要了解如何在 Amazon SageMaker JumpStart 中使用 Llama 3.2 模型，请参阅模型发布博文。有关如何在 Amazon SageMaker JumpStart 中使用基础模型，请参阅 Amazon SageMaker JumpStart 产品页面和 Amazon SageMaker JumpStart 文档。

免责声明：前述特定亚马逊云科技生成式人工智能相关的服务仅在亚马逊云科技海外区域可用，亚马逊云科技中国仅为帮助您了解行业前沿技术和发展海外业务选择推介该服务。

免费套餐

AWS 海外区域

拓展海外业务或个人体验
免费使用 100 余种云产品或服务，长达 12 个月

AWS 中国区域

发展中国业务
免费使用 40 余种核心云服务产品，长达 12 个月

Meta Llama 3.2 模型（11B 和 90B 版本）在视觉任务中的应用

Llama 3.2 11B 和 90B 视觉模型概述

解决方案概述

前提条件

在 Amazon Bedrock 中配置 Llama 3.2 进行视觉推理

在 SageMaker 中配置 Llama 3.2 进行视觉推理

文档问答

基于财务报告幻灯片问答

图形数学问答

实体信息提取

图像描述生成

生成基础描述

生成创意描述

生成商品营销描述

总结

更多教程

快速搭建容量高达 35GB 的免费个人网盘

构建企业专属智能客服机器人

使用生成式 AI 构建多语言问答知识库

免费套餐

AWS 海外区域

AWS 中国区域

Meta Llama 3.2 模型（11B 和 90B 版本）在视觉任务中的应用

Llama 3.2 11B 和 90B 视觉模型概述

解决方案概述

前提条件

在 Amazon Bedrock 中配置 Llama 3.2 进行视觉推理

在 SageMaker 中配置 Llama 3.2 进行视觉推理

文档问答

基于财务报告幻灯片问答

图形数学问答

实体信息提取

图像描述生成

生成基础描述

生成创意描述

生成商品营销描述

总结

更多教程

快速搭建容量高达 35GB 的免费个人网盘

构建企业专属智能客服机器人

使用生成式 AI 构建多语言问答知识库

免费套餐

AWS 海外区域

AWS 中国区域

终止对 Internet Explorer 的支持