亚马逊AWS官方博客

Amazon Bedrock Claude 3 多模态使用指南

0. 引言

在 Amazon Bedrock 最新发布的 Claude 3 模型中,相比于之前版本的模型一个重要更新就是支持了多模态的能力。多模态能力使得模型可以在理解文本输入的同时,也可以处理图片、视频类视觉输入。在多模态使用上也有一些新的技巧,因此本文对如何使用 Claude 3 多模态模型的做了一些基础介绍指引和详细介绍,其中部分参考 Anthropic 官方给出的示例和指引,以及在实际落地场景中的最佳实践经验。最后给出了在一些常见使用场景,如图片内容理解、视频问答、少样本学习、图片文字信息提取等方面的应用示例,以帮助您更快地将 Claude 3 多模态能力应用到业务实践当中。

1. 多模态使用基础

这里首先总体介绍的是通用场景下都适用基本指引,除特殊用途外所有任务都应尽量参考以下使用建议。

1.1 图片处理

相比之前版本 Claude 模型,Amazon Bedrock 上的 Claude 3 使用了最新的 message 形式的 api 来调用模型,并新增了视觉图像类的数据输入。

图片格式及编码

Claude 3 仅支持上传 base64 编码的图片,如果图片为网络 URL 图片,需要下载下来后编码再发送请求。

Claude 3 支持的处理图片格式有:JPEG,PNG,GIF,WebP,其他格式暂不支持需要图片格式转换后再请求。由于图片不同图片格式存储空间不同,为了尽量请求减少流量大小,这里推荐使用 WebP 格式作为图片编码格式。根据 WebP 官方给出的研究报告显示,WepP 格式在相同图片压缩质量的情况下,会比 JPEG 减少 25%-34% 的存储量。

图片分辨率

由于 Claude 3 处理图像的最大分辨率为长边 1568 像素,所以为了避免图片在后台压缩导致耗时增加,以及减少请求流量大小,尽量先将图片 resize 到 1568 像素以下再调用 API。

同时,由于过小分辨率的图片会导致无法正确理解图片内容,建议图片短边应保证在 200 像素以上。如果图像包含需要理解的文本,请至少确保人肉眼是可读的。

图片预处理代码示例

如下为一个使用以上指引在 Amazon Bedrock 上调用 Claude 3 模型的完整代码,包含图片类型检查及转换、图片分辨率压缩、图片编码为 base64 编码等处理等图片预处理步骤,其中图像使用 Pillow 库进行处理:

import io
import base64
import json

import httpx
from PIL import Image
import boto3

AVAILABLE_FORMAT = {"jpeg", "png", "gif", "webp"}
MAX_SIZE = 1568

def preprocessing_image(image_url, target_format=None, re_encoding=False):
    # download image from url
    image_data = httpx.get(image_url).content
    # or read from local
    # image_data = open(image_url, "rb").read()

    # load image to PIL
    image_pil = Image.open(io.BytesIO(image_data))

    # check image format if need re enconding
    image_format = image_pil.format.lower()
    target_format = target_format if target_format else image_format
    if target_format not in AVAILABLE_FORMAT:
        # set to webp by default
        target_format = "webp"
    re_encoding = re_encoding or (target_format != image_format)

    # check image size if need resize
    width, height = image_pil.size
    max_size = max(width, height)
    if max_size > MAX_SIZE:
        width = round(width * MAX_SIZE / max_size )
        height = round(height * MAX_SIZE / max_size )
        image_pil = image_pil.resize((width, height))
        re_encoding = True

    if re_encoding:
        buffer = io.BytesIO()
        # quality: 75 by default | 100 for lossless compression
        image_pil.save(buffer, format=target_format, quality=75)
        image_data = buffer.getvalue()

    image_base64 = base64.b64encode(image_data).decode("utf-8")
    return image_base64, target_format


image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image_base64, image_format = preprocessing_image(image_url, target_format="webp")
body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 2048,
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": f"image/{image_format}",
                        "data": image_base64,
                    },
                },
                {
                    "type": "text",
                    "text": "Describe this image."
                }
            ],
        }
    ],
}, ensure_ascii=False)

bedrock_runtime = boto3.client(service_name='bedrock-runtime', region_name="us-west-2")
response = bedrock_runtime.invoke_model(
    body=body, modelId="anthropic.claude-3-haiku-20240307-v1:0"
)
message = json.loads(response.get('body').read())["content"][0]["text"]
print(message)

1.2 多模态输入方式

单张图

根据 Anthropic 官方指引,Claude 3 在图像在文本之前时表现最佳,图像放在文本之后或插入文本中仍然会表现良好,但如果您的用例允许,我们建议使用图像-然后-文本的结构。以下是单张图输入时的一个示例:

Role Content
User [Image]
Describe this image.

API 请求 messages 参数如下:

messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": image_media_type,
                    "data": image_data,
                },
            },
            {
                "type": "text",
                "text": "Describe this image."
            }
        ],
    }
]

多张图

多张图输入时,先输入多张图,每张图前可以用 Image 1 / 2 / 3 等来编号,后续提问以及 Claude 返回结果能利用到这些编号。Claude 3 一次最多处理 20 张图。

Role Content
User Image 1: [Image 1]
Image 2: [Image 2]
How are these images different?

API 请求 messages 参数如下:

messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Image 1:"
            },
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": image1_media_type,
                    "data": image1_data,
                },
            },
            {
                "type": "text",
                "text": "Image 2:"
            },
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": image2_media_type,
                    "data": image2_data,
                },
            },
            {
                "type": "text",
                "text": "How are these images different?"
            }
        ],
    }
]

系统指令

除了 messages 之外,也可以指定系统指令,与 GPT4 等模型通过 role 来指定不同,系统指令作为单独的调用参数传入。

Role Content
System Respond only in Spanish.
User Image 1: [Image 1] Image 2: [Image 2] How are these images different?

API 请求参数如下:

system="Respond only in Spanish.",
messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Image 1:"
            },
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": image1_media_type,
                    "data": image1_data,
                },
            },
            {
                "type": "text",
                "text": "Image 2:"
            },
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": image2_media_type,
                    "data": image2_data,
                },
            },
            {
                "type": "text",
                "text": "How are these images different?"
            }
        ],
    }
]

多轮对话

Claude 3 多模态也支持多轮对话,后续的对话中,也可以继续增加新的图片。

Role Content
User Image 1: [Image 1] Describe this image.
Assistant [Claude’s response]
User Image 2: [Image 2]  Are Image 2 similar to Image 1?

API 请求 messages 参数如下:

messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Image 1:"
            },
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": image1_media_type,
                    "data": image1_data,
                },
            },
            {
                "type": "text",
                "text": "Describe this image."
            }
        ]
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "[Claude's response]"
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Image 2:"
            },
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": image2_media_type,
                    "data": image2_data,
                },
            },
            {
                "type": "text",
                "text": "Are Image 2 similar to Image 1?"
            }
        ],
    }
]

1.3 输出控制

与纯文本大语言模型使用方式类似,可以使用多种方式优化提示词,实现输出控制以达到更好的效果。通常在纯文本模态上的 PE 经验在多模态上也适用,这里总结了几个最常见的使用技巧。

输出语言

Claude 3 原生支持多语种输入输出,但是因为训练语料分布的原因,模型倾向于使用英文回答,输入和输出均在英文上表现最好。因此,除非任务与特定语言中的一些描述相关,建议尽量用英文撰写提示词,以获取最佳的效果。同时,如果希望用特定语言回答,可以在提示词中指定来达到更加稳定的输出效果。例如:“Please answer in English words only”、“请使用中文回答”。

输出格式

对于文本,处理结果常见的格式有 XML、JSON、Markdown、HTML、CSV 等。如果希望获取指定格式的返回结果,请在 prompts 指定。同时,尽量为回答添加一个或多个返回值示例,这样更有利于提升输出格式控制的效果。

Claude 3 也支持输出预填充,即指定 Assistant 的返回值的初始字符,这样 Claude 模型就可以根据起始字符继续续写。例如可以指定 JSON 的起始关键字或者 XML 的其实标签,来避免输出无关的输出,让后处理可以直接解析模型的返回值。

例如,希望提取图片中的内容信息,并以 JSON 格式返回:

Role Content
User

[Image]

Transcribe this image into JSON format. Only output the JSON and nothing else.

Assistant
(prefill)
{

API 请求 messages 参数如下:

messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": image_media_type,
                    "data": image_data,
                },
            },
            {
                "type": "text",
                "text": "Transcribe this image into JSON format. Only output the JSON and nothing else."
            }
        ]
    },
    {
        # assistant prefill
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "{"
            }
        ]
    }
]

引导思考

链式思考(Chain of Thought, CoT)可以通过让模型增加中间推理步骤以实现复杂的推理能力。您可以将其与少样本提示相结合,以获得更好的结果,以便在回答之前进行推理的更复杂的任务。如对于图片问答而言,可以参考如下提示词:

Role Content
User

[Image]

{{your question}}

Before providing the answer in <answer> tags, think step by step in <thinking> tags and analyze every part of the image.

Assistant
(prefill)
<thinking>

2. 应用示例

结合以上所述的多模态模型使用技巧,本节以图像理解类应用和文字识别类应用,分别给出几个示例说明。

2.1 图片理解

图片描述、标签

图片理解是多模态模型最常见的一类应用,通过给定的图片以及提示词问题,对图片内容的进行详细描述或者标签。最常用于商品标签生成,或者为特定任务(如模型训练)生成标注标签。在描述时,尽量包含希望描述的细节,以此来获得更贴合需求的答案。

Role Content
User
Please describe the contents of the image in detail, including as many details as possible about the style, artistic type, colors, etc. Skip the pleasantries and begin describing directly. You are a professional e-commerce operator. Please tag keywords for the product based on the product images, in order to achieve the best SEO effect. Please give 10 keywords in JSON format.
Assistant
(prefill)
{
“keywords”: [
Assistant
(response)

The image depicts a breathtaking coastal landscape at sunset. The sky is a stunning gradient of deep blue, vibrant purple, and warm orange hues, creating a serene and ethereal atmosphere. In the foreground, the rocky shoreline is prominently featured, with the waves gently lapping against the weathered, jagged rocks. The water appears calm and reflective, mirroring the vibrant colors of the sky above.

In the distance, a tall, slender lighthouse stands as a focal point, its bright white light contrasting against the darkening sky. The lighthouse’s presence adds a sense of scale and depth to the scene, emphasizing the vastness of the ocean and the tranquility of the moment.

The overall composition is well-balanced, with the rocky shoreline leading the viewer’s eye towards the lighthouse and the distant horizon. The long exposure technique used to capture the image has created a soft, dreamlike quality to the water, adding to the sense of serenity and timelessness.

The color palette is rich and vibrant, with the deep blues and purples of the sky complementing the warm oranges and pinks of the sunset. The contrast between the rugged, rocky coastline and the serene, reflective water creates a visually striking and harmonious scene.

This image is a stunning example of landscape photography, capturing the beauty and grandeur of the natural world in a captivating and evocative manner. The attention to detail and the skilled use of light and composition make this a truly remarkable and visually compelling work of art.

    "men's trousers",
    "gray trousers",
    "formal pants",
    "business attire",
    "professional clothing",
    "tailored pants",
    "classic style",
    "office wear",
    "fashion essentials",
    "menswear"
]}

视频问答

Claude 3 除了对图片进行理解问答以外,还可以对视频进行分析。

常见的视频输入到 Claude 3 模型的处理方式有两种:

  • 对视频拆帧后的每张图片分别编码,输入 Claude 3 模型处理,一次性处理的上限为 20 张图,并且每张图可以分别指定索引编号,方便在问答中进行提问和检索;
  • 对视频拆帧后的图片缩略图拼接,这里的好处是每张输入图可以包含多帧图像,这样一次性可以处理的视频帧数量更多。但是这里缩略图会导致图片尺寸被压缩,在需要对图片细节进行问答的情况下不应采用此方法。同时应注意不要拼接后的图片长边尺寸限制,不建议拼接成长宽比过大的情况;

视频拆帧也可以通过多种方式进行,比如固定某个时间间隔,或者按照视频编码时的关键帧信息进行拆帧。

例如使用 python 的 pyav 库,对视频进行关键帧拆帧的代码示例如下:

import av

def parsing_video(video_path):
    # input: path to input video
    # output: PIL.Image list
    container = av.open(video_path)
    stream = container.streams.video[0]
    stream.codec_context.skip_frame = "NONKEY"
    images = []
    for index, frame in enumerate(container.decode(stream)):
        if index > 20:
            break
        # you could get timestamp with "frame.time"
        imgages.append(frame.to_image())
    return images

例如我们分析一段视频:https://www.pexels.com/video/19008656/,分析其中骑自行车的人出现的时间:

Role Content
User Timestamp 0.0 s
Timestamp 3.0 s
Timestamp 6.0 s
Timestamp 9.0 s
At what timestamp does a man riding bicycle that appear?
Assistant
(response)
A person riding a bicycle appears in the image at the 3.0 second timestamp.

少样本学习

因为 Claude 3 支持多轮多张图输入,可以使用少样本学习的方式给出一些图片示例,让模型继续根据示例的形式进行模仿回答。少样本学习本身还是依赖大模型的自身掌握的领域知识的能力,比较适合通过这种方式让模型学习到如何遵循示例规则和格式进行回答,但是在遇到特别特殊的领域知识时表现可能有限,这里需要谨慎评估采用。

如下展示了一个通过少样本学习来让模型学习如何估计水果的热量的一个用例。这里在不添加额外指引说明的情况下,仅通过两个示例,即可让掌握大模型如何对这一类问题作答。

Role Content
User
Assistant
(prefill)
Apple – 50 calories / 100g
User
Assistant
(prefill)
Banana – 125 calories / 100g
User
Assistant
(response)
Kiwi – 61 calories / 100g

2.2 文字识别

Claude 3 多模态模型,非常适合做图片文字内容提取。相比于传统 OCR 模型,Claude 3 做文字提取有如下几个优势:

  1. 具备纠错能力,传统 OCR 模型经常会出现单词中出现多余的空格等错误,Claude 3 模型会根据语意输出正确的文字信息;
  2. 能够处理较复杂的文档结构,包括复杂的页面结构、公式、表格等,不需要精细的后处理或者使用专用模型来识别;
  3. 具备信息理解能力,不需要设计复杂的后处理逻辑,就可以从文档中直接抽取特定关键字信息;

在实际使用场景中,例如中文文字场景,由于 Claude 3 的中文字符准确率不高,这时可以和特定语言的 OCR 模型结合,一方面利用专用 OCR 模型更高的字符识别准确性,一方面利用 Claude 3 更强的文档格式理解以及信息抽取能力,将两者优势结合。

图片文字转录

Claude 3 可以轻松识别文档图片中的文字,并转为标准的文档格式,例如可以对公式、表格信息进行精确识别。常使用的提示词为:“Transcribe this text …”

Role Content
User
Transcribe this text to Markdown. You should keep the layout of this document, include Bold, Italic font format. You should convert math equations to Latex format in $$ symbol. You should convert tables to Markdown table. Only output Markdown document the and nothing else. Transcribe this text to Markdown. You should keep the layout of this document, include Bold, Italic font format. You should convert math equations to Latex format in $$ symbol. You should convert tables to Markdown table. Only output Markdown document the and nothing else.
Assistant
(response)

## 3.2.1 Scaled Dot-Product Attention

We call our particular attention “Scaled Dot-Product Attention” (Figure 2). The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$. We compute the matrix of outputs as:

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ [3]. We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4. To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$.

| Hyperparameters | Stage 1 | Stage 2 | Stage 3 | Stage 1 | Stage 2 | Stage 3 |
| — | — | — | — | — | — | — |
| Learning rate | $1.0 \times 10^{-3}$ | $3 \times 10^{-5}$ | $2.0 \times 10^{-5}$ | $1.0 \times 10^{-3}$ | $4.2 \times 10^{-5}$ | $2.0 \times 10^{-5}$ |
| LR scheduler | Cosine | Step | Cosine | Cosine | Step | Cosine |
| Weight decay | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Gradient clip | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Optimizer | AdamW($\beta_1 = 0.9, \beta_2 = 0.95$) | AdamW($\beta_1 = 0.9, \beta_2 = 0.95$) |
| Warm-up steps | 128 | 2000 | 256 | 128 | 2000 | 256 |
| Training steps | 15000 | 96000 | 10000 | 15000 | 42000 | 10000 |
| Batch size | 256 | 1024 | 256 | 256 | 2304 | 256 |
| Sequence length | 512 | 4096 | 4096 | 512 | 4096 | 4096 |
| Sequence packing | × | ✓ | × | × | ✓ | × |
| Pipeline parallelism | × | × | × | × | ✓ | ✓ |

图片信息提取

除了直接获取原始文字内容外,Claude 还可以对文档内容做语意理解。例如直接提取复杂图表中的关键信息,并直接输出为 JSON。

Role Content
User
Extract infomation from this form image. Only output the JSON according to the example and nothing else.
<example>

{
  "Name": xxx,
  "Address": xxx,
  "Phone": xxx,
  "Date of report": YYYY-MM-DD,
  "Date of incident": YYYY-MM-DD,
  "Location": xxx,
  "Description": xxx,
}

</example>

Extract infomation from this form image. Only output the JSON according to the example and nothing else.
<example>

{
  "Buyer": ...,
  "Seller": ...,
  "Order NO": ...,
  "Date": YYYY-MM-DD,
  "Currency": ...,
  "Total Amount": float,
  "Details" [
    {
        "Item": ...,
        "Count": int,
        "Price": float,
        "Subtotal": float
    },
    ...
  ]
}

</example>

Assistant
(prefill)
{ {
Assistant
(response)
    "Name": "John Doe",
    "Address": "123 Main St",
    "Phone": "(678) 999-8212",
    "Date of report": "2024-02-29",
    "Date of incident": "2024-02-29",
    "Location": "Corner of 2nd and 3rd",
    "Description": "Red car t-boned blue car"
}
  "Buyer": "Steel Bear Productions Ltd.",
  "Seller": "Paragon Steel",
  "Order NO": "PO-SteelBearLtd.-00135",
  "Date": "2018-10-04",
  "Currency": "CAD",
  "Total Amount": 3010.0,
  "Details": [
    {
      "Item": "SD356 Brass Pipes",
      "Count": 5,
      "Price": 70.0,
      "Subtotal": 350.0
    },
    {
      "Item": "27289 Bronze Bars",
      "Count": 10,
      "Price": 250.0,
      "Subtotal": 2500.0
    },
    {
      "Item": "40.2144W Hot Rod Steel Bars",
      "Count": 5,
      "Price": 32.0,
      "Subtotal": 160.0
    }
  ]
}

3. 总结

通过本文的介绍,我们了解了 Claude 3 多模态模型的新特性和使用技巧,以及如何在实际应用场景中更好地利用这些新特性。无论是图像视频理解问答、详情描述、还是文字信息提取等,Claude 3 多模态模型为我们提供了全新的解决方案。通过以上实践中的最佳做法和应用示例,相信您已经对如何将 Claude 3 多模态能力融入到自身业务中有了更深入的理解,Claude 3 多模态模型也必将在更多领域发挥重要作用。

参考资料

  1. Anthropic 多模态文档:https://docs.anthropic.com/claude/docs/vision
  2. Anthropic PE 技巧:https://docs.anthropic.com/claude/docs/prompt-engineering
  3. Anthropic Cookbook:https://github.com/anthropics/anthropic-cookbook/
  4. https://developers.google.com/speed/webp/docs/webp_study
  5. https://pyav.basswood-io.com/docs/develop/

本篇作者

富宸

亚马逊云科技 GenAI 解决方案技术专家,负责 GenAI 多模态方向解决方案的设计和推广。曾任职于腾讯进行 AI 应用技术研究工作,在计算机视觉以及多模态领域有丰富的应用落地经验。