亚马逊AWS官方博客

使用 Amazon Bedrock + PandasAI 对多数据源进行对话式数据分析

在企业客户中,客户的数据往往存在于不同的数据库、数仓和数据湖等不同的数据源之中。在亚马逊云上,可以使用 Amazon Athena 或者 Amazon Redshift 进行联邦查询来联合分析这些数据。而对于本地的数据文件,数据分析师习惯使用 Pandas 这一强大的开源 Python 数据分析库来进行数据挖掘。借助 Pandas-AI,我们不仅能联合查询本地以及云上的数据,而且可以使用 Amazon Bedrock 以及其背后的 Claude3 等生成式 AI 模型,通过对话的方式对多数据源进行联合数据分析。我们将通过以下示例来演示这一强大的能力。

示例架构

数据准备

在数据库 hr 中准备以下表以及数据

https://www.sqltutorial.org/wp-content/uploads/2020/04/mysql.txt
https://www.sqltutorial.org/wp-content/uploads/2020/04/mysql-data.txt

上传 employees.csv 到 S3

"employee_id","first_name","last_name","email","phone_number","hire_date","job_id","salary","manager_id","department_id"
"100","Steven","King","steven.king@sqltutorial.org","515.123.4567","1987-06-17","4","24000.00",,"9"
"101","Neena","Kochhar","neena.kochhar@sqltutorial.org","515.123.4568","1989-09-21","5","17000.00","100","9"
"102","Lex","De Haan","lex.de haan@sqltutorial.org","515.123.4569","1993-01-13","5","17000.00","100","9"
"103","Alexander","Hunold","alexander.hunold@sqltutorial.org","590.423.4567","1990-01-03","9","9000.00","102","6"
"104","Bruce","Ernst","bruce.ernst@sqltutorial.org","590.423.4568","1991-05-21","9","6000.00","103","6"
"105","David","Austin","david.austin@sqltutorial.org","590.423.4569","1997-06-25","9","4800.00","103","6"
"106","Valli","Pataballa","valli.pataballa@sqltutorial.org","590.423.4560","1998-02-05","9","4800.00","103","6"
"107","Diana","Lorentz","diana.lorentz@sqltutorial.org","590.423.5567","1999-02-07","9","4200.00","103","6"
"108","Nancy","Greenberg","nancy.greenberg@sqltutorial.org","515.124.4569","1994-08-17","7","12000.00","101","10"
"109","Daniel","Faviet","daniel.faviet@sqltutorial.org","515.124.4169","1994-08-16","6","9000.00","108","10"
"110","John","Chen","john.chen@sqltutorial.org","515.124.4269","1997-09-28","6","8200.00","108","10"
"111","Ismael","Sciarra","ismael.sciarra@sqltutorial.org","515.124.4369","1997-09-30","6","7700.00","108","10"
"112","Jose Manuel","Urman","jose manuel.urman@sqltutorial.org","515.124.4469","1998-03-07","6","7800.00","108","10"
"113","Luis","Popp","luis.popp@sqltutorial.org","515.124.4567","1999-12-07","6","6900.00","108","10"
"114","Den","Raphaely","den.raphaely@sqltutorial.org","515.127.4561","1994-12-07","14","11000.00","100","3"
"115","Alexander","Khoo","alexander.khoo@sqltutorial.org","515.127.4562","1995-05-18","13","3100.00","114","3"
"116","Shelli","Baida","shelli.baida@sqltutorial.org","515.127.4563","1997-12-24","13","2900.00","114","3"
"117","Sigal","Tobias","sigal.tobias@sqltutorial.org","515.127.4564","1997-07-24","13","2800.00","114","3"
"118","Guy","Himuro","guy.himuro@sqltutorial.org","515.127.4565","1998-11-15","13","2600.00","114","3"
"119","Karen","Colmenares","karen.colmenares@sqltutorial.org","515.127.4566","1999-08-10","13","2500.00","114","3"
"120","Matthew","Weiss","matthew.weiss@sqltutorial.org","650.123.1234","1996-07-18","19","8000.00","100","5"
"121","Adam","Fripp","adam.fripp@sqltutorial.org","650.123.2234","1997-04-10","19","8200.00","100","5"
"122","Payam","Kaufling","payam.kaufling@sqltutorial.org","650.123.3234","1995-05-01","19","7900.00","100","5"
"123","Shanta","Vollman","shanta.vollman@sqltutorial.org","650.123.4234","1997-10-10","19","6500.00","100","5"
"126","Irene","Mikkilineni","irene.mikkilineni@sqltutorial.org","650.124.1224","1998-09-28","18","2700.00","120","5"
"145","John","Russell","john.russell@sqltutorial.org",,"1996-10-01","15","14000.00","100","8"
"146","Karen","Partners","karen.partners@sqltutorial.org",,"1997-01-05","15","13500.00","100","8"
"176","Jonathon","Taylor","jonathon.taylor@sqltutorial.org",,"1998-03-24","16","8600.00","100","8"
"177","Jack","Livingston","jack.livingston@sqltutorial.org",,"1998-04-23","16","8400.00","100","8"
"178","Kimberely","Grant","kimberely.grant@sqltutorial.org",,"1999-05-24","16","7000.00","100","8"
"179","Charles","Johnson","charles.johnson@sqltutorial.org",,"2000-01-04","16","6200.00","100","8"
"192","Sarah","Bell","sarah.bell@sqltutorial.org","650.501.1876","1996-02-04","17","4000.00","123","5"
"193","Britney","Everett","britney.everett@sqltutorial.org","650.501.2876","1997-03-03","17","3900.00","123","5"
"200","Jennifer","Whalen","jennifer.whalen@sqltutorial.org","515.123.4444","1987-09-17","3","4400.00","101","1"
"201","Michael","Hartstein","michael.hartstein@sqltutorial.org","515.123.5555","1996-02-17","10","13000.00","100","2"
"202","Pat","Fay","pat.fay@sqltutorial.org","603.123.6666","1997-08-17","11","6000.00","201","2"
"203","Susan","Mavris","susan.mavris@sqltutorial.org","515.123.7777","1994-06-07","8","6500.00","101","4"
"204","Hermann","Baer","hermann.baer@sqltutorial.org","515.123.8888","1994-06-07","12","10000.00","101","7"
"205","Shelley","Higgins","shelley.higgins@sqltutorial.org","515.123.8080","1994-06-07","2","12000.00","101","11"
"206","William","Gietz","william.gietz@sqltutorial.org","515.123.8181","1994-06-07","1","8300.00","205","11"

在本地准备 departments.csv

"department_id","department_name","location_id"
"1","Administration","1700"
"2","Marketing","1800"
"3","Purchasing","1700"
"4","Human Resources","2400"
"5","Shipping","1500"
"6","IT","1400"
"7","Public Relations","2700"
"8","Sales","2500"
"9","Executive","1700"
"10","Finance","1700"
"11","Accounting","1700"

Pandas-AI 安装

以下展示了在 EC2 Ubuntu 20 上安装 Pandas-AI 和相关依赖

conda create -n pandasai python=3.10
conda activate pandasai
conda install psycopg2
pip install 'pandasai[connectors,bedrock]'
pip install s3fs

在 SageMaker Studio JupyterLab 中使用 Python 3 Kernel 可以在 Jupyter Notebook 中执行以下命令安装依赖

!conda install psycopg2
!pip install 'pandasai[connectors,bedrock]'
!pip install s3fs
!pip uninstall pydantic
!pip install pydantic==1.10.8

其他平台的安装参考 https://github.com/Sinaptik-AI/pandas-ai

示例

查询每个部门工资最高的员工

# 保存为highest_salary.py,并执行python highest_salary.py或者在Jupyter Notebook中直接执行
from pandasai import SmartDatalake
from pandasai.connectors import MySQLConnector
import pandas as pd
from pandasai.llm.bedrock_claude import BedrockClaude
import boto3
import os

#使用支持Bedrock Claude的AWS Region取代YOUR_REGION,例如us-east-1或者us-west-2
os.environ['AWS_DEFAULT_REGION'] = 'YOUR_REGION'

#注意替换YOUR_HOST,YOUR_USER_NAME以及YOUR_PASSWORD
countries = MySQLConnector(
    config={
        "host": "YOUR_HOST",
        "port": 3306,
        "database": "hr",
        "username": "YOUR_USER_NAME",
        "password": "YOUR_PASSWORD",
        "table": "countries",
    }
)

departments = pd.read_csv("departments.csv")

dependents = MySQLConnector(
    config={
        "host": "YOUR_HOST",
        "port": 3306,
        "database": "hr",
        "username": "YOUR_USER_NAME",
        "password": "YOUR_PASSWORD",
        "table": "dependents",
    }
)

#注意替换YOUR_ACCESS_KEY,YOUR_SECRET_KEY为AWS Access Key和AWS Secret Key
employees = pd.read_csv(
    f"s3://YOUR_S3_BUCKET_AND_PATH/employees.csv",
    storage_options={
        "key": 'YOUR_ACCESS_KEY',
        "secret": 'YOUR_SECRET_KEY',
    },
)

jobs = MySQLConnector(
    config={
        "host": "YOUR_HOST",
        "port": 3306,
        "database": "hr",
        "username": "YOUR_USER_NAME",
        "password": "YOUR_PASSWORD",
        "table": "jobs",
    }
)

locations = MySQLConnector(
    config={
        "host": "YOUR_HOST",
        "port": 3306,
        "database": "hr",
        "username": "YOUR_USER_NAME",
        "password": "YOUR_PASSWORD",
        "table": "locations",
    }
)

regions = MySQLConnector(
    config={
        "host": "YOUR_HOST",
        "port": 3306,
        "database": "hr",
        "username": "YOUR_USER_NAME",
        "password": "YOUR_PASSWORD",
        "table": "regions",
    }
)

session = boto3.Session(aws_access_key_id='YOUR_ACCESS_KEY',aws_secret_access_key='YOUR_SECRET_KEY')
bedrock_runtime_client = session.client('bedrock-runtime')
# BedrockClaude默认使用Claude 3 Sonnet。更多设置参见https://github.com/Sinaptik-AI/pandas-ai/blob/main/docs/LLMs/llms.md#amazon-bedrock-models
llm = BedrockClaude(bedrock_runtime_client)

df = SmartDatalake([countries,departments,dependents,employees,jobs,locations,regions], config={"llm": llm})
result = df.chat('Give me the employee names, their job titles and the salaries with the highest salary in each department in data frame.')
result.to_csv('highest-salary.csv', index=False)

查看 highest-salary.csv

first_name,last_name,job_title,salary,department_name
Steven,King,President,24000.0,Executive
Alexander,Hunold,Programmer,9000.0,IT
Nancy,Greenberg,Finance Manager,12000.0,Finance
Den,Raphaely,Purchasing Manager,11000.0,Purchasing
Adam,Fripp,Stock Manager,8200.0,Shipping
Jennifer,Whalen,Administration Assistant,4400.0,Administration
Michael,Hartstein,Marketing Manager,13000.0,Marketing
Susan,Mavris,Human Resources Representative,6500.0,Human Resources
Hermann,Baer,Public Relations Representative,10000.0,Public Relations
Shelley,Higgins,Accounting Manager,12000.0,Accounting

对比手写 SQL 进行查询

WITH TEMP AS
  (SELECT e.employee_id,
          e.first_name,
          e.last_name,
          e.salary,
          d.department_id,
          d.department_name,
          j.job_id,
          j.job_title
   FROM employees e
   INNER JOIN departments d ON e.department_id = d.department_id
   INNER JOIN jobs j ON e.job_id = j.job_id)
SELECT t.first_name,
       t.last_name,
       t.department_name,
       t.job_title,
       t.salary
FROM TEMP t
INNER JOIN
  (SELECT department_id,
          Max(salary) AS salary
   FROM TEMP GROUP  BY department_id) AS g ON t.department_id = g.department_id
AND t.salary = g.salary;
+------------+-----------+------------------+---------------------------------+----------+
| first_name | last_name | department_name  | job_title                       | salary   |
+------------+-----------+------------------+---------------------------------+----------+
| Jennifer   | Whalen    | Administration   | Administration Assistant        |  4400.00 |
| Michael    | Hartstein | Marketing        | Marketing Manager               | 13000.00 |
| Den        | Raphaely  | Purchasing       | Purchasing Manager              | 11000.00 |
| Susan      | Mavris    | Human Resources  | Human Resources Representative  |  6500.00 |
| Adam       | Fripp     | Shipping         | Stock Manager                   |  8200.00 |
| Alexander  | Hunold    | IT               | Programmer                      |  9000.00 |
| Hermann    | Baer      | Public Relations | Public Relations Representative | 10000.00 |
| John       | Russell   | Sales            | Sales Manager                   | 14000.00 |
| Steven     | King      | Executive        | President                       | 24000.00 |
| Nancy      | Greenberg | Finance          | Finance Manager                 | 12000.00 |
| Shelley    | Higgins   | Accounting       | Accounting Manager              | 12000.00 |
+------------+-----------+------------------+---------------------------------+----------+

参照下面代码修改 highest_salary.py 最后三行并重新执行,让 Pandas-AI 以绘图方式展示结果

df = SmartDatalake([countries,departments,dependents,employees,jobs,locations,regions], config={"llm": llm, "save_charts": True})
df.chat('Give me a plot with the employee names, their job titles and the salaries with the highest salary in each department')

在 exports/charts 中找到生成图片

更进一步,我们可以让 Claude3 帮我们做一些 What-If 分析,看看不同的薪资调整计划的影响

result = df.chat('How much salary I can increase for the employees having the lowest salaries in their departments with a total budget of 1000? Draw a plot of the total budget of each deparment with the increase amount stacking on the original amount and display the amount on the bars. Use abbreviation if necessary.')
result = df.chat('If I want to increase the salary for every employee except the ones having the highest salary in each department by 5%. Draw a plot of the total budget of each deparment with the increased amount stacking on the original amount and display the amount on the bars. Use abbreviation if necessary.')

通过查看 pandasai.log 文件,我们可以看到 Pandas-AI 使用的提示词,Bedrock/Claude3 返回的代码以及实际执行的代码。我们可以参考这些代码对于图片进行进一步的调整。

例如,最后那个示例的 Pandas-AI 使用了以下的提示词:

<dataframe name="countries">
dfs[0]:25x3
country_id,country_name,region_id
HK,HongKong,3
DE,Germany,1
AU,Australia,3
</dataframe>
<dataframe>
dfs[1]:11x3
department_id,department_name,location_id
7,Accounting,1400
6,Finance,1700
9,Sales,1500
</dataframe>
<dataframe name="dependents">
dfs[2]:30x5
dependent_id,first_name,last_name,relationship,employee_id
29,Alec,Partners,Child,146
1,Penelope,Gietz,Child,206
7,Grace,Faviet,Child,109
</dataframe>
<dataframe>
dfs[3]:40x10
employee_id,first_name,last_name,email,phone_number,hire_date,job_id,salary,manager_id,department_id
113,Guy,Gietz,niyl07@yahoo.com,515.127.4561,1997-07-24,8600.0,24000.0,123,3.0
107,Payam,Livingston,w0plcke2nga7@aol.com,"N,2000-01-04""",1994-08-17,7.0,2600.0,102,
206,Kimberely,Hunold,am07y388km@gmail.com,590.423.4567,1998-03-07,12.0,10000.0,8,9.0
</dataframe>
<dataframe name="jobs">
dfs[4]:19x4
job_id,job_title,min_salary,max_salary
14,Purchasing Manager,8000.0,15000.0
10,Marketing Manager,9000.0,15000.0
7,Finance Manager,8200.0,16000.0
</dataframe>
<dataframe name="locations">
dfs[5]:7x6
location_id,street_address,postal_code,city,state_province,country_id
1400,2014 Jabberwocky Rd,26192,Southlake,Texas,US
2500,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,Oxford,UK
1700,2004 Charade Rd,98199,Seattle,Washington,US
</dataframe>
<dataframe name="regions">
dfs[6]:4x2
region_id,region_name
3,Asia
1,Europe
4,Middle East and Africa
</dataframe>




Update this initial code:
```python
# TODO: import the required dependencies
import pandas as pd

# Write code here

# Declare result var: 
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }

```



### QUERY
 If I want to increase the salary for every employee except the ones having the highest salary in each department by 5%. Draw a plot of the total budget of each deparment with the increased amount stacking on the original amount and display the amount on the bars. Use abbreviation if necessary.

Variable `dfs: list[pd.DataFrame]` is already declared.

At the end, declare "result" variable as a dictionary of type and value.

If you are asked to plot a chart, use "matplotlib" for charts, save as png.


Generate python code and return full updated code:

Bedrock Claude 根据提示词产生了以下代码:

import pandas as pd
import matplotlib.pyplot as plt

# Merge the required dataframes
employees = dfs[3]
departments = dfs[1]
merged_df = employees.merge(departments, left_on='department_id', right_on='department_id', how='left')

# Group employees by department and find the maximum salary for each department
max_salaries = merged_df.groupby('department_id')['salary'].max()

# Increase salaries by 5% for employees who don't have the maximum salary in their department
mask = ~merged_df['salary'].isin(max_salaries)
merged_df.loc[mask, 'salary'] *= 1.05

# Calculate the total budget (original and increased) for each department
dept_budgets = merged_df.groupby('department_name')['salary'].sum().reset_index()
dept_budgets['increased_budget'] = dept_budgets['salary'] * 1.05

# Plot the total budget of each department
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(dept_budgets['department_name'], dept_budgets['salary'], label='Original Budget', color='skyblue')
ax.bar(dept_budgets['department_name'], dept_budgets['increased_budget'] - dept_budgets['salary'], bottom=dept_budgets['salary'], label='Increased Amount', color='lightgreen')

# Add labels and title
ax.set_xlabel('Department', fontsize=12)
ax.set_ylabel('Budget (USD)', fontsize=12)
ax.set_title('Department Budgets with Increased Salaries', fontsize=16)

# Add value labels
for i, v in enumerate(dept_budgets['increased_budget']):
    ax.text(i, v, f"{v:,.0f}", ha='center', va='bottom', fontsize=10)

# Rotate x-axis labels
plt.xticks(rotation=45, ha='right')

# Add legend
ax.legend()

# Save the plot as a PNG file
plt.tight_layout()
plt.savefig('dept_budgets.png', dpi=300, bbox_inches='tight')

# Declare the result
result = {"type": "plot", "value": "dept_budgets.png"}

结论

通过 Pandas-AI 结合 Amazon Bedrock 以及 Claude3 模型,我们可以在很大程度上降低数据整合与数据分析的难度,使得业务人员可以更快地获得数据洞查。

本篇作者

施俊

亚马逊云科技解决方案架构师,主要负责数字金融客户和企业级客户在亚马逊云科技上的架构设计与实施。10+年金融软件研发和机器学习经验。

张尹

亚马逊云科技技术客户经理,负责企业级客户的架构和成本优化、技术支持等工作。有多年的大数据架构设计,数仓建模等实战经验。在加入亚马逊云科技之前,长期负责头部电商大数据平台架构设计、数仓建模、运维等相关工作。