使用 Amazon Bedrock 和 Langchain Agent 实现交互式文生图

背景

随着 GenAI 的流行，利用 Stable Diffusion 等方式通过使用文本描述的方式生成图片的方式在很大程度上降低了图片创作的门槛，并在包括创意图生成、营销图生成等多个领域投入了实际生产之中，提升了生产力。与此同时，这一类的生图工具又带来了一系列的新的学习门槛，尤其是在提示词工程上，为了写出好的提示词，需要用户进行系统的学习不同模型适合的提示词，并且不断的尝试。而且，仅支持英语的提示词也对非英语国家用户造成了一定的困难。本文通过 Amazon Bedrock 支持的大语言模型以及 Stable Diffusion 模型，配合 Langchain Agent 实现了交互式文生图功能，用户可以要求 Langchain Agent 根据其指令反复调整提示词并生成图片以达到无需提示词工程且较为精细化控制图片生成的效果，在开始整个博客之前，先看一下本文中使用到的几个关键服务。

Amazon Bedrock

Amazon Bedrock 是一项完全托管的服务，使用单个 API 提供来自 AI21 Labs、Anthropic、Cohere、Meta、Stability AI 和 Amazon 等领先人工智能公司的高性能基础模型（FM），以及构建生成式人工智能应用程序所需的一系列广泛功能，在维护隐私和安全的同时简化开发。借助 Amazon Bedrock 的全面功能，您可以轻松尝试各种热门 FM，使用微调和检索增强生成（RAG）等技术利用您的数据对其进行私人定制，并创建可执行复杂业务任务（从预订旅行和处理保险索赔到制作广告活动和管理库存）的托管代理，所有这些都无需编写任何代码。由于 Amazon Bedrock 是无服务器的，因此您无需管理任何基础设施，并且可以使用已经熟悉的 AWS 服务将生成式人工智能功能安全地集成和部署到您的应用程序中。

Langchain

LangChain 是一个用于开发由语言模型驱动的应用程序的框架。它使应用程序具备以下功能：

具备上下文感知：将语言模型与上下文来源（提示指令、少量示例、内容以支持其响应等）连接在一起。
具备推理能力：依赖语言模型进行推理（根据提供的上下文来确定如何回答问题，采取什么行动等）。

Chain

单独使用语言模型（LLM）对于简单的应用程序来说是可以的，但更复杂的应用程序需要将 LLMs 进行串联 – 要么相互串联，要么与其他组件串联。

LangChain 提供了两个高级框架用于“chaining”组件。传统方法是使用 Chain 接口；更新的方法是使用 LangChain 表达式语言（LCEL）。在构建新应用程序时，我们建议使用 LCEL 进行链式组合。但我们继续支持许多有用的内置 Chain，因此我们在这里记录了这两种框架。正如下文所述，Chain 也可以用于 LCEL 中，所以两者不是互斥的。

Agents

代理（agents）的核心理念是使用语言模型（LLM）来选择要执行的一系列操作。在链条（chains）中，一系列操作是硬编码的（在代码中）。而在代理中，语言模型被用作推理引擎，以确定要执行哪些操作以及以何种顺序执行这些操作，这其中又包含两个重要组件：Agent 和 Tool。

Agent：Agent 负责决定 Chain 接下来要采取的步骤，它由一个语言模型和一个提示来驱动，不同的代理具有不同的推理方式、不同的输入编码方式以及不同的输出解析方式，本文中使用的 Agent 类型为 ReAct Agent，是一种将推理和行动与 LLMs 结合的通用范式。“ReAct”通过 Prompt 的设计，为大语言模型制定了一个任务生成的口头推理和执行行动的过程。这使得大语言模型应用在执行动态推理的同时，能够创建、维护和调整行动计划，并与外部环境进行交互，将附加信息纳入推理过程。
Tool：Tool 为 Agent 提供了调用的功能，帮助 Agent 访问不同的数据集，LangChain 提供了一组广泛的工具供您开始使用，同时还可以轻松定义自己的工具。

Memory

大多数语言模型应用都具有会话式界面。会话的一个重要组成部分是能够引用先前在对话中提到的信息。最基本的情况下，一个会话系统应该能够直接访问一定范围内的先前消息。更复杂的系统需要具备一个不断更新的世界模型，使其能够维护关于实体及其关系的信息等等。我们将这种存储关于过去互动的信息的能力称为“Memory”。LangChain 提供了许多实用工具，用于为系统添加 Memory。这些实用工具可以单独使用，也可以无缝地集成到 Chain 中。

实现

架构

本文中测试脚本适用于 Mac 环境，Windows/Linux 环境需要做相应的适配，其中 Langchain 和 Bedrock 交互架构图如下所示：

准备工作

IAM 用户

本文中我们需要调用 Bedrock 的模型同时将交互过程中产生的图片放置到 S3 中，所以我们需要预先给 IAM User 这方面的权限，可以参考下面的权限配置：

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"s3:*",
				"bedrock:*"
			],
			"Resource": "*"
		}
	]
}

同时需要在 Bedrock 中配置可以使用的 Model

ANACONDA 安装

参考文档：链接

环境配置

创建以下两个文件

#requirements.txt
langchain
jupyter
boto3
Pillow
matplotlib
python-dotenv

#.env
profile=YOUR_AWS_PROFILE
bucket=YOUR_S3_BUCKET_TO_STORE_IMAGE

在命令行中执行

#python 3.10 环境
conda create -n py310 python=3.10 -y
conda activate py310
pip install -r requirements.txt
conda install -c conda-forge jupyterlab
jupyter lab

在 Jupyter Lab 中创建一个 Notebook。首先我们引入需要的依赖项并且初始化 Bedrock 和 S3 的客户端。

import base64
import io
import re
from PIL import Image
from langchain.chat_models import BedrockChat
from langchain.chains import LLMChain
from langchain.agents import initialize_agent
from langchain.agents import AgentType
from langchain.tools import tool
from langchain.prompts import PromptTemplate
import os
import boto3
import json
import matplotlib.pyplot as plt
from langchain.memory import ConversationBufferMemory
from dotenv import load_dotenv

# Load environment varibles including your aws profile and the s3 bucket to use. See https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
load_dotenv()

# Initialize the bedrock and s3 client
session = boto3.Session(profile_name=os.environ.get('profile'))

bedrock_client = session.client('bedrock-runtime')
s3_client = session.client('s3')

然后我们创建用来生成提示词的工具 generate_prompt_api，在这个工具内部，我们使用 LLMChain 来引导大语言模型生成提示词。这里我们选用了 Anthropic 的 Claude 模型。generate_prompt_api 函数的注释描述了工具的用途和输入输出，这些信息会被 Langchain Agent 捕获到，用来指导这一工具的使用时机和方式。这里我们还调整了 LLMChain 的提示词模板 prompt_template，使它更符合 Claude 模型的要求，具体参见 https://docs.anthropic.com/claude/docs/constructing-a-prompt。对于返回值，我们也做了裁剪，只留下生成的提示词部分。

# Define the chat model. You can choose from anthropic.claude-instant-v1, anthropic.claude-v2 or other text models supported by Bedrock. We will use the chat model in both the prompt generation and the agent.
model_id="anthropic.claude-instant-v1"
# model_id="anthropic.claude-v2"
chat_model = BedrockChat(
    client=bedrock_client,
    model_id=model_id,
    model_kwargs = {"max_tokens_to_sample": 8000}
)

# Define the prompt generation tool
@tool(return_direct=True)
def generate_prompt_api(query: str) -> str:
    "Useful for generating a detailed prompt describing the scene."
    "Input: User's query."
    "Output: Stable diffusion prompts used to generate the image"
    prompt_template = """Give me one good and detailed prompt to generate the image using Stable Diffusion for {query}. The result should just include the prompt. No explanation. Example: <example>H: Give me one good and detailed prompt to generate the image using Stable Diffusion for an animal swimming in a lake. \n\n A: A photo of a brown bear swimming through a large, clear blue mountain lake surrounded by tall evergreen trees and snow capped mountains in the background. The bear is mid-stroke, with its head above the water and its front legs extended. Sparkling sunlight reflects off the smooth water. The image is sharply focused and highly detailed.</example>"""
    llm_chain = LLMChain(
            llm=chat_model,
            prompt=PromptTemplate.from_template(prompt_template),

        )

    result = llm_chain.predict(query=query)

    # The Claude model is a bit chatty, it might start the response with "Here is the prompt:". We need to strip that off.
    # print(result)
    striped = re.sub(r'^.*?:', '', result).strip()
    # print(striped)
    return striped

另外我们需要定义用来文生图的工具 text_to_image_api。内部我们使用了 Stable Diffusion XL 模型。由于 Bedrock 在调用 Stable Diffusion 时会返回 base64 编码的图片，我们需要解码后进行展示，同时，我们也把图片存到 S3 上，并且生成了临时的 URL 方便共享使用。

# Define the image generation tool. 
@tool(return_direct=True)
def text_to_image_api(query: str) -> str:
    "Useful for when you need to generate an image with a prompt."
    "Input: A detailed text-2-image prompt describing an image"
    "Output: Image url"
    # generate random integer values
    from random import randint
    body = json.dumps({
        "text_prompts": [
            { 
            "text": query 
            }
        ],
        "cfg_scale":10,
        "seed":randint(0, 100000),
        "steps":40,
        })
    # Here we use stable diffusion xl as our model
    modelId = "stability.stable-diffusion-xl-v0" 
    accept = "application/json"
    contentType = "application/json"

    response = bedrock_client.invoke_model(
        body=body, modelId=modelId, accept=accept, contentType=contentType
    )

    # Parsing the response (in base64) and plot the image
    response_body = json.loads(response.get("body").read())
    base_64_img_str = response_body["artifacts"][0].get("base64")
    img_bytes = base64.decodebytes(bytes(base_64_img_str, "utf-8"))
    img = Image.open(io.BytesIO(img_bytes))
    plt.imshow(img)
    plt.title(query[:80])

    # Save the image to s3 for later retrieval. 
    bucket=os.environ.get('bucket')
    key=f'img/{base_64_img_str[0:80]}.png'
    s3_client.put_object(Body=img_bytes, Bucket=bucket, Key=key)

    # Generate a presigned url for temporary public access
    generated_url = s3_client.generate_presigned_url(
        ClientMethod='get_object',
        Params={
            'Bucket': bucket,
            'Key': key
        },
        ExpiresIn=3600 # one hour in seconds, increase if needed
    )

    return generated_url

最后，我们把以上工具、记忆功能（memory）组装到 Langchain Agent 中，此处 Langchain Agent 我们也使用 Anthropic 的 Claude 模型作为它的“大脑”。

# Define a memory for the langchain agent so that it knows the whole context. The ai_prefix="Assistant" is needed as Claude uses this prefix for model generated content.
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True, ai_prefix="Assistant")

# Define the tools can be used by the agent
tools = [text_to_image_api, generate_prompt_api]

# Define the mrkl agent
mrkl = initialize_agent(
    tools, chat_model, agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION, verbose=True, max_iterations=5, memory=memory
)

# Let's see the prompt template it will use. You can also customize the prompt template if needed. See https://github.com/langchain-ai/langchain/issues/10721
print(mrkl.agent.llm_chain.prompt.messages)

实验

英文 Prompt 测试

1. 首先我们先让 Agent 生成一个好的提示词

output = mrkl.run("Generate some good prompt for an image with a cat playing football")

2. 这个提示词还不错，我们画个图试试

3. 我们要求重新生成更好一点的 prompt

output = mrkl.run("Can you generate some better prompt?")

4. 我们希望换个赛博朋克风格

output = mrkl.run("Can you make the image more cyberpunk?")

5. 上面生成的图片中，没有足球了，我要求必须有足球

output = mrkl.run("There isn't any football, plese regenerate prompt that the cat is playing football in the image")

6. 我们要求调整一下猫的颜色为黑色

output = mrkl.run("Fix the last prompt. the cat's color is black")

通过实验我们可以发现，上述 Agent 可以比较好的识别指令，并根据上下文调整提示词并生成图片。但是随着对话轮次变多以及上下文变长，它有可能会遗忘一些关键信息，我们需要通过命令进行一些再次提示。

中文 Prompt 测试

1. 首先我们先测试通过中文输入生成英文的 Prompt，从下面的结果来看还不错

output = mrkl.run("生成水牛玩水的画的英文提示词")

2. 继续调整我们的需求，将图片中的水牛换成河马，让其 Prompt 换成“提示词中把水牛改成河马，其他不变”，从结果来看其中水牛已成功换成了河马

output = mrkl.run("提示词中把水牛改成河马，其他不变")

3. 继续增加需求，修改图片的风格，油画风格更甚

output = mrkl.run("修改提示词，更加的油画风格")

4. 增加一些印象派元素

output = mrkl.run("再加一些印象派风格")

5. 增加一些莫奈风格

output = mrkl.run("再加一些莫奈的风格")

至此，我们通过基于英文和中文的 Prompt 分别测试了交互式的生成 prompt，继而生成图片的场景，从结果来看还是可以的。当然，限于目前大语言模型的能力，多步推理的成功率并不太高，因此我们暂时还是让每个工具直接返回结果，相信随着模型能力的不断提升，这一流程能进一步简化。

总结

Amazon Bedrock 作为一项完全托管的服务，用户只需要通过 API 的调用就可以完成基于 LLM 的推理工作，大大地降低了用户涉足 AI 的门槛，同时 Serverless 的方式也极大的降低了入门成本。本文通过基于对 Bedrock API 的调用，结合 Langchain 框架的调度，实现了一套多语言的交互式的文生图场景，可以让用户基于上下文的调整对于图片的需求，最终生成自己想要的图片。

参考资料

https://python.langchain.com/docs/get_started

https://aws.amazon.com/cn/bedrock/?nc1=h_ls

亚马逊AWS官方博客