In this blog post, we demonstrate prompt engineering techniques to generate accurate and relevant analysis of tabular data using industry-specific language. This is done by providing large language models (LLMs) in-context sample data with features and labels in the prompt. The results are similar to fine-tuning LLMs without the complexities of fine-tuning models. We used a method called Generative Tabular Learning (GTL) based on the whitepaper From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models and demonstrate the advantages of GTL using fully managed JupyterLab notebooks in Amazon SageMaker notebooks to interact with Meta Llama models hosted in Amazon SageMaker or Amazon Bedrock. You may check out additional reference notebooks on aws-samples for how to use Meta’s Llama models hosted on Amazon Bedrock.
Prerequisites
The following sections describes the prerequisites needed for this demonstration. You can implement these steps either from the AWS Management Console or using the latest version of the AWS Command Line Interface (AWS CLI).
- Access to LLMs such as Meta’s Llama models hosted on Amazon SageMaker or Amazon Bedrock
- Amazon SageMaker Domain configuration configured with JupyterLab notebooks and the necessary python libraries and packages to interact with the LLMs
- Sample tabular datasets from the financial industry formatted as structured data (we are using exchange-traded funds data from Kaggle) available for querying using a SQL engine like Amazon Athena.
- Knowledge of generative AI prompt engineering techniques to provide LLMs with relevant context and sample data
- Ability to evaluate and compare LLM-generated outputs for accuracy and relevance to the analysis task
- Understanding of financial industry data and knowledge of staging and querying this data in a structured tabular format consumable by LLMs
- Knowledge of the industry domain that the data belongs to in order to determine appropriate features and labels for sample data prompts
Financial industry data
In the financial industry data can be in the form of a table in PDF files or structured data in a database. The following is an example of a financial information dataset for exchange-traded funds (ETFs) from Kaggle in a structured tabular format that we used to test our solution.
A user can ask a business- or industry-related question for ETFs.
NOTE: Since we used an SQL query engine to query the dataset for this demonstration, the prompts and generated outputs mention SQL below.
# Business question
question = "Please provide a list of about 100 ETFs or ETNs names with exposure to US markets"
# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
question=question,
table_schema=table_schema_etf
)
results = get_llm_sql_analysis(
question=question,
sql_sys_prompt=SQL_SYS_PROMPT,
qna_sys_prompt=QNA_SYS_PROMPT
)
After the data is retrieved from the dataset, it’s sent to the LLM hosted in Amazon Bedrock (refer to the list of supported models in Amazon Bedrock) for analysis and generates a response to the user’s question or query in natural language.
The question in the preceding example doesn’t require a lot of complex analysis on the data returned from the ETF dataset. We get a response from the LLM based on its analysis of the data in a satisfactory industry or business-relevant language:
LLM SQL Analysis:
After analyzing the provided SQL query results, I can conclude that the list of ETFs/ETNs does not primarily focus on US markets. Instead, it appears to be a comprehensive list of bond ETFs/ETNs with a global scope, covering various regions, currencies, and bond types.
Here are some key observations:
1. **Global coverage**: The list includes ETFs/ETNs tracking bond markets in Europe (e.g., Eurozone, UK), the US, and globally diversified indices.
2. **Bond types**: The list covers a range of bond types, including corporate bonds, government bonds, high-yield bonds, and green bonds.
3. **Currency exposure**: ETFs/ETNs are denominated in various currencies, such as EUR, USD, and GBP, with some offering hedged exposure to mitigate currency risks.
4. **ESG and SRI focus**: A significant portion of the list consists of ETFs/ETNs with an Environmental, Social, and Governance (ESG) or Socially Responsible Investing (SRI) focus, which suggests a emphasis on sustainable investing.
To answer the original question, I can identify a subset of ETFs/ETNs from the list that have exposure to US markets:
**US-focused ETFs/ETNs:**
1. xxxx USD Corporate Bond 0-3yr ESG UCITS ETF USD (Dist)
2. xxxx USD Corporate Bond ESG 0-3yr UCITS ETF EUR Hedged (Acc)
3. xxxx ESG USD High Yield (DR) UCITS ETF - Dist
4. xxxx USD High Yield Corporate Bond ESG UCITS ETF USD (Acc)
5. xxxx USD High Yield Corporate Bond ESG UCITS ETF USD (Dist)
6. xxxx Index US Corporate SRI UCITS ETF DR (C)
7. xxxx Index US Corporate SRI UCITS ETF DR Hedged EUR (D)
8. xxxx USD Corporate Bond ESG UCITS ETF (Acc)
9. xxxx USD Corporate Bond ESG UCITS ETF (Dist)
10. xxxx ESG USD High Yield Corporate Bond UCITS ETF 1C
11. xxxx ETF (LU) xxxx xxxx US Liquid Corporates Sustainable UCITS ETF (USD) A-dis
12. xxxx USD Corporate Green Bond UCITS ETF 2C Acc USD
Please note that this subset is not exhaustive, and there may be other ETFs/ETNs in the original list that have some exposure to US markets. Additionally, investors should carefully evaluate the investment objectives, risks, and characteristics of each ETF/ETN before making any investment decisions.
NOTE: Output ETF names do not represent the actual data in the dataset used in this demonstration.
NOTE: Outputs generated by LLMs are non-deterministic and may vary in your testing.
What would the LLM’s response or data analysis be when the user’s questions in industry specific natural language get more complex? To answer questions that require more complex analysis of the data with industry-specific context the model would need more information than relying solely on its pre-trained knowledge.
Solution overview
We encourage you to think about this question before starting: Can enhancing the context provided to the LLM in the prompt along with the user’s natural language question work in generating better outputs, before trying to fine-tuning the LLMs which requires setting up MLOPS processes and environments, collecting and preparing relevant and accurate labeled datasets, and more?
We propose an intermediate GTL framework using the Meta Llama model on Amazon Bedrock. The proposed framework is not meant to replace the fine-tuning option. The following diagram illustrates this framework of GTL for LLMs.
GTL is a type of few-shot prompting technique where we provide the following information about the data retrieved from the structured dataset as part of the prompt to the LLM:
- A personality for the LLM to use when generating the data analysis (which provides hints to the model to use industry-specific data it has already been pre-trained with)
- Data features and descriptions
- Data labels and descriptions
- A small sample dataset containing features
- A sample analysis as an example
The following is an example GTL prompt:
instructions = [
{
"role": "user",
"content": """Given the following SQL query results: {query_results}
And the original question: {question}
You are an expert in Exchange-Traded Funds or ETFs and Exchange-Traded Notes or ETNs .
Based on the features of the funds or notes, please predict how expensive the funds are for investors.
I will supply multiple instances with features and the corresponding label for reference.
Please refer to the table below for detailed descriptions of the features and label:
— feature description —
Features:
isin: International Securities Identification Number
wkn: Wertpapierkennnummer or German securities identification number
name: ETF Name
fundprovider: Financial Company providing the ETF
legalstructure: Exchange Traded Fund (ETF) or Exchange Traded Notes (ETN)
totalexpenseratio: An expense ratio is the cost of owning an ETF or ETN, the management fee paid to the fund company for the benefit of owning the fund,
paid annually and measured as a percent of your investment in the fund. 0.30 percent means you’ll pay $30 per year for every $10,000 you have invested in the fund.
— label description —
Expensive: Whether the fund is expensive for investors or not. 0 means not expensive, 1 means expensive.
— data —
|isin|wkn|name|fundprovider|legalstructure|totalexpenseratio|Expensive|
|GB00BNRRxxxx |A3xxxx|xxxx Physical Staked Cardano|xxxx|ETN|0.0|0|
|BGPLWIG0xxxx|A2xxxx|xxxx Poland WIGxxx UCITS ETF|xxxx|ETF|0.0138|0|
|CH044568xxxx|A2Txxxx|xxxx Crypto Basket Index ETP|xxxx|ETN|0.025|1|
|CH1114873xxxx|A3Gxxxx|xxxx Solana ETP|xxxx|ETN|0.025|1|
|GB00BNRRxxxx|A3xxxx|xxxx Physical Staked Algorand|xxxx|ETN|0.0|<MASK>|
Please use the supplied data to predict the <MASK>. Fund is expensive[1] or not[0]?
Answer: 0
Please provide an analysis and interpretation of the results to answer the original {question}.
"""
}
]
QNA_SYS_PROMPT = format_instructions(instructions)
In our GTL prompt, we are highlighting only the subset of columns in the dataset that are relevant to produce accurate and relevant analysis of the data based on the industry out of the possible 129 columns in the EFT dataset we have chosen and also examples of how to interpret the data.
Use case examples
Let’s look at a few sample prompts with generated analysis. The following question requires complex industry knowledge-based analysis of data from multiple columns in the ETF database. In this example the user is trying to find ETFs (funds) that provide higher dividends and lower volatility in value, which are a desired fund characteristic that investors look for in funds they want to invest in.
# Business question
question = "Name the least risk funds that yields higher dividends and isn't volatile"
# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
question=question,
table_schema=table_schema_etf
)
results = get_llm_sql_analysis(
question=question,
sql_sys_prompt=SQL_SYS_PROMPT,
qna_sys_prompt=QNA_SYS_PROMPT
)
The following is our response without the GTL prompt:
LLM SQL Query:
SELECT
name,
yeardividendyield,
yearvolatilitycur,
fiveyearreturncur
FROM
etftable
WHERE
yeardividendyield > 3
AND yearvolatilitycur < 10
AND fiveyearreturncur > 5
ORDER BY
yearvolatilitycur ASC
LIMIT 10;
LLM SQL Analysis:
Based on the provided data, I will analyze the features and predict the GoodInvestment label for each fund. Then, I will identify
the least risk funds that yield higher dividends and aren't volatile.
**Analysis**
From the data, I observe the following:
1. **Low total expense ratio**: Funds with lower totalexpenseratio are more desirable, as they charge lower management fees.
A lower expense ratio indicates that the fund is cheaper to own.
2. **High five-year return**: Funds with higher fiveyearreturncur indicate better performance over the past 5 years.
3. **Non-zero yearly dividend yield**: Funds with non-zero yeardividendyield provide a regular income stream to investors.
**Prediction of GoodInvestment label**
Based on the analysis, I predict the GoodInvestment label for each fund as follows:
|isin|GoodInvestment|
| --- | --- |
|LU17815xxxx|1|
|LU17815xxxx7|1|
|IE00B5Bxxxx|0|
|LU14078xxxx|1|
|IE00B4xxxx|1|
|LU17815xxxx|1|
**Least risk funds that yield higher dividends and aren't volatile**
To identify the least risk funds, I will consider the following criteria:
1. **Low volatility**: Funds with lower fiveyearreturncur volatility are less risky.
2. **Higher dividend yield**: Funds with higher yeardividendyield provide a regular income stream to investors.
3. **Low expense ratio**: Funds with lower totalexpenseratio are cheaper to own.
Based on these criteria, I recommend the following funds:
1. **xxxxUK NT (DR) UCITS ETF - Dist (LU17815xxxx)**: This fund has a low expense ratio (0.04%), a relatively high dividend yield (0.0379),
and a moderate five-year return (0.2296).
2. **xxxxUS Equity (DR) UCITS ETF - Dist (LU17815xxxx)**: This fund has a low expense ratio (0.04%), a relatively high dividend yield (0.0196),
and a high five-year return (0.7337).
3. **xxxxEURO STOXX 50 UCITS ETF EUR (IE00B4Kxxxx)**: This fund has a moderate expense ratio (0.05%), a relatively high dividend yield (0.0297),
and a moderate five-year return (0.3927).
These funds offer a balance between dividend yield, low expense ratio, and moderate volatility, making them suitable for investors seeking a
relatively stable investment with a regular income stream.
We see that the data retrieval only uses the yearly volatility and not the 3-year and 5-year volatility information. Also, some of the funds don’t have volatility data in the dataset (no values for 1-year, 3-year, or 5-year volatility).
The following is a modified question requesting additional column considerations for 3-year and 5-year data.
# Business question
question = "Name the least risk funds that yields higher dividends and isn't volatile based on five year, three year and one year volatiliy data"
# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
question=question,
table_schema=table_schema_etf
)
results = get_llm_sql_analysis(
question=question,
sql_sys_prompt=SQL_SYS_PROMPT,
qna_sys_prompt=QNA_SYS_PROMPT
)
We use the following GTL prompt with labels to interpret 1-year, 3-year, and 5-year data or lack of data:
instructions = [
{
"role": "user",
"content": """Given the following SQL query results: {query_results}
And the original question: {question}
You are an expert in Exchange-Traded Funds or ETFs and Exchange-Traded Notes or ETNs .
Based on the features of the funds or notes, please predict best funds for investors to invest in.
I will supply multiple instances with features and the corresponding label for reference.
Please refer to the table below for detailed descriptions of the features and label:
— feature description —
Features:
isin: International Securities Identification Number
wkn: Wertpapierkennnummer or German securities identification number
name: ETF Name
fundprovider: Financial Company providing the ETF
legalstructure: Exchange Traded Fund (ETF) or Exchange Traded Notes (ETN)
yeardividendyield: Yearly Dividend yield as a percentage of total investment
fiveyearreturncur: Returns over past 5 year period as a percentage of investment
totalexpenseratio: An expense ratio is the cost of owning an ETF or ETN, the management fee paid to the fund company for the benefit of owning the fund,
paid annually and measured as a percent of your investment in the fund. 0.30 percent means you’ll pay $30 per year for every $10,000 you have invested in the fund.
— label description —
volatile: The fund has low fiveyearvolatilitycur, threeyearvolatilitycur, yearvolatilitycur. 0 means not volatile, 1 means volatile, 2 means cannot be determined.
— data —
|isin|name|fiveyearvolatilitycur|threeyearvolatilitycur|yearvolatilitycur|Risk|
|LU033504xxxx|xxxx II EUR Overnight Rate Swap UCITS ETF 1D|8.0E-4|9.0E-4|0.0011|0|
|FR001051xxxx|Lyxor Euro Overnight Return UCITS ETF - Acc|8.0E-4|9.0E-4|0.0011|0|
|LU0290358xxxx|xxxxII EUR Overnight Rate Swap UCITS ETF 1C|9.0E-4|0.001|0.0013|0|
|BGCROEX03xxxx|xxxx UCITS ETF|0.675|0.8606|1.4787|1|
|IE000RN0xxxx|xxxxDisruptive Technology Real Estate UCITS ETF Acc||||2|
|GB00BNRRxxxx|xxxxPhysical Staked xxxx||||2|
|FR001075xxxx|xxxx ETF xxxx0-6 Months Euro Investment Grade UCITS ETF EUR (C)|0.0014|0.0016|0.0021|<MASK>|
Please use the supplied data to predict the <MASK>. Fund is volatile[1] or not[0] or cannot-be-determined[2]?
Answer: 1
Please provide an analysis and interpretation of the results to answer the original {question}.
"""
}
]
We see that with additional prompting the model uses all of the volatility columns in the dataset (1-year, 3-year, and 5-year) and provides output suggestions for when data is present or missing in the volatility columns.
The following is our response with GTL prompts:
LLM SQL Query:
SELECT
name,
fundprovider,
yeardividendyield,
fiveyearvolatilitycur,
threeyearvolatilitycur,
yearvolatilitycur
FROM
etftable
WHERE
yeardividendyield > 0
AND fiveyearvolatilitycur < (SELECT AVG(fiveyearvolatilitycur) FROM etftable)
AND threeyearvolatilitycur < (SELECT AVG(threeyearvolatilitycur) FROM etftable)
AND yearvolatilitycur < (SELECT AVG(yearvolatilitycur) FROM etftable)
ORDER BY
yeardividendyield DESC,
fiveyearvolatilitycur,
threeyearvolatilitycur,
yearvolatilitycur
LIMIT 10;
LLM SQL Analysis:
Based on the provided data, I will analyze and interpret the results to answer the original question.
**Least Risk Funds with Higher Dividend Yields and Low Volatility**
To identify the least risk funds, I will consider the following criteria:
1. **Low Volatility**: Funds with low five-year, three-year, and one-year volatility (less than 0.1).
2. **Higher Dividend Yields**: Funds with a higher yearly dividend yield (greater than 0.07).
3. **Low Expense Ratio**: Funds with a low total expense ratio (less than 0.1).
After analyzing the data, I have identified the following funds that meet the above criteria:
1. **xxxx Global Select Dividend 100 Swap UCITS ETF 1D**:
* Yearly Dividend Yield: 0.0938
* Five-year Volatility: 0.1096
* Three-year Volatility: 0.1125
* One-year Volatility: 0.1236
* Expense Ratio: Not available in the data
2. **xxxx Emerging Markets Equity Income UCITS ETF**:
* Yearly Dividend Yield: 0.0797
* Five-year Volatility: 0.1532
* Three-year Volatility: 0.1471
* One-year Volatility: 0.1335
* Expense Ratio: Not available in the data
3. **xxxx FTSE EM High Dividend Low Volatility UCITS ETF**:
* Yearly Dividend Yield: 0.0787
* Five-year Volatility: 0.1566
* Three-year Volatility: 0.1621
* One-year Volatility: 0.1439
* Expense Ratio: Not available in the data
These funds have relatively low volatility and higher dividend yields compared to the other funds in the dataset. However, please note that the
expense ratio is not available in the data, which is an important factor to consider when evaluating the overall cost of investing in an ETF.
**Additional Considerations**
When evaluating these funds, investors should also consider other factors such as:
1. **Fund Provider**: The reputation and experience of the fund provider can impact the fund's performance and risk profile.
2. **Legal Structure**: The legal structure of the fund (ETF or ETN) can affect its tax efficiency, liquidity, and regulatory environment.
3. **Investment Strategy**: The fund's investment strategy and asset allocation can influence its risk profile and potential returns.
In conclusion, the three funds mentioned above appear to be the least risk funds with higher dividend yields and low volatility based
on the provided data. However, investors should conduct further research and consider additional factors before making an investment decision.
As we can see the data retrieval is more accurate. Additionally, the generated analysis has considered all of the volatility information in the dataset (1-year, 3-year, and 5-year) and accounted for present or missing data for volatility.
Based on this outcome, the recommendation is to build a curated set of GTL prompts along with the most common user questions pertaining to datasets that users will be asking. The prompts will need to be created by dataset specialists who have deep understanding of the dataset from industry perspective and can provide the right context to the LLMs. Organizations can use such a prompt library to build interactive applications that allow regular business users who may not have deep knowledge or understanding of underlying datasets to interact with and gain insights from these datasets using natural language questions.
Conclusion
As newer and larger LLMs are released, they get better at generating an analysis of structured datasets using industry-specific language. However, there is room for improvement in the analysis of data from structured datasets. One option is to fine-tune the LLM to improve relevance and language of the generated data analysis using specific business language. Fine-tuning requires additional efforts and costs (collecting relevant data, labeling the data, additional costs involved in procuring, and provisioning, and maintaining the fine-tuning compute environment).
In this post, we showcased a method with few-shot prompting using Meta Llama models available through Amazon Bedrock that can improve industry- or business-specific analysis of the data with just prompt engineering. (For certain use cases, fine-tuning may be required. Refer to Amazon Bedrock pricing for estimated costs with or without using fine-tuned models).
Try this solution with your own industry-specific use cases and datasets, and let us know your feedback and questions in the comments.
NOTE: Blog authors are not providing any financial or investment advice in this blog post, nor are they recommending this dataset or ETFs mentioned in this dataset.
About the Authors
Randy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. In entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.
Arghya Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area focused on helping customers adopt and use AWS Cloud. He is focused on Big Data, Data Lakes, Streaming and batch Analytics services and generative AI technologies.
Ravi Ganesh is a Sr Solution Architect in AWS at Austin Texas Area, focused on helping customer address their business problems through adoption of Cloud, He is focussed on Analytics, Resiliency, Security and Generative AI technologies.
Varun Mehta is a Sr. Solutions Architect at AWS. He is passionate about helping customers build enterprise-scale Well-Architected solutions on the AWS Cloud. He works with strategic customers who are using AI/ML to solve complex business problems. Outside of work, he loves to spend time with his wife and kids