Using large-language models for ESG sentiment analysis using Databricks on AWS

This post was contributed by Ilan Gleiser, Principal ML Specialist, Global Impact Computing, AWS, Antoine Amend, Sr. Technical Director, Databricks, Venkat Viswanathan, Senior Solutions Architect, AWS

Regulators worldwide recognize the threat of climate change to economies and financial systems, forcing public companies in the US and Europe to be aware of – and disclose – their greenhouse emissions. This has led to a surge in demand from various stakeholders, including asset managers, asset owners, investors, and regulators, for machine-learning models that can analyze the plethora of data on their environmental, social and governance (ESG) policies.

Alongside regulatory change, there is a financial incentive to this endeavor. Typically, ESG ratings have a positive correlation with both valuation and profitability, while showing a negative correlation with volatility.

In this post, we’re going to look at the challenge posed by ESG, as a data and AI problem. In the Databricks ESG Solution Accelerator, we will use natural language processing (NLP) to sort through the vast amounts of structured, and unstructured, data.

Why Databricks?

The main benefits for using the Databricks ESG Solution Accelerator are:

Detect differences between sustainability reports and real-time news data in portfolios of securities. Differences between sustainability reports and real-time news data may lead consumers and stakeholders to overvalue a company’s ESG score.
View how companies’ business relationships with one another within a global marketplace can impact their respective ESG scores.
Show how companies on the top decile of the ESG rank have half the volatility (as measured by value-at-risk) of companies on the bottom end of the ESG ranks and better returns, indicating superior Sharpe ratios.

Users of the Databricks ESG Solution Accelerator can turn ESG risk into a source of excess returns (or alpha) if spotted early in the process. This is done by processing unstructured data, like PDFs and real time news data and using machine learning (ML) algorithms that analyses the sentiment of the news data and compares with the sentiment of the data coming from the companies’ sustainability report on the same ESG policy

The critical problem of unstructured data

The ESG world generates data that is inherently unstructured. Out of the 40 frequently-disclosed ESG policies by companies, only ten can be measured as tangible figures while the rest are policy initiatives expressed in textual form across various IT systems. This raises the question of how AI can be applied to quantify and compare organizations ESG policies in a more objective manner.

To help customers unlock the value of their sustainability data stored in the AWS cloud, we are highlighting Databricks ESG Solution Accelerator as described in “A Data-driven Approach to Environmental, Social, and Governance using the Databricks Solutions Accelerator” and showing you step by step, how to run the accelerator on AWS.

The notebooks show you how machine learning can enable asset managers, regulators, and investors to assess the sustainability exposure of their investments and empower their businesses with a holistic and data-driven view of their environmental, social, and corporate governance strategies.

Specifically, the first notebook will extract key ESG initiatives communicated in yearly sustainability PDF reports and compare these with real time actual media coverage from news analytics data as in Figure 1. The idea behind this approach is to spot differences between sustainability reports and real-time news data to offer decision makers with the most accurate and promptly available information.

Figure 1. Extract the key ESG initiatives as communicated in yearly PDF reports and compare these with the actual media coverage from news analytics data

Databricks Solution Accelerators on AWS: speedways to innovation

Databricks ESG Solution Accelerator for AWS provides an expedited path to production for AWS customers who store their data on Amazon S3. Businesses can simplify the migration of their data and AI workloads to Databricks on AWS and quickly start utilizing the accelerator notebooks. The Databricks ESG Solution Accelerator come in a pair of notebooks that are easy to set up and provide prompt feedback, enabling asset management, MLOPS governance, and risk teams to achieve their goals in a much shorter time frame. Customers can take advantage of AWS’s computing power and virtually limitless storage, resulting in increased flexibility, scalability, and dependability at a reduced cost compared to developing an in-house solution.

To support users from different lines of business, these accelerators are designed to meet the different goals of multiple teams – portfolio managers and asset owners can use them to maximize alpha or minimize risk; regulators can identify new potential hot spots; and corporations can gather public sentiment of their news in the media.

ESG Accelerator Logic

Although all these accelerators work together, businesses can apply them as stand-alone projects or layer them. The process goes as follows:

Extract ESG initiatives from ESG reports using PyPDF2 and categorize them based on different topics. Then, tokenize and lemmatize the content for algorithmic ingestion. The next step is to automatically classify sentences extracted from ESG reports using atopic modeling algorithm. This classification allows customer/users to compare different companies’ by clustering them side-by-side to identify key focus areas (or ESG policies) voluntarily disclosed by companies in their sustainability reports.

To create a data-driven ESG score, the Databricks ESG Solution Accelerator runs a sentiment analysis on financial news articles related to each company is conducted using the Global Database of Event Location and Tones (GDELT) files, updated every 15mins. The assumption is that overall tone captured from financial news articles is a good proxy for companies’ ESG scores. This approach generates scores for each company across all its ESG dimensions. A propagated weighted ESG (PW-ESG) metric is then calculated to provide a global view of risk by quantifying the links between companies and assessing their importance.

Finally, to validate the assumption that high-ranked ESG companies offer better risk-adjusted returns than low-ranked ESG companies, the portfolio is split into two books – best and worst 10% ESG scores, respectively – and their historical returns and corresponding 95% Value-at-Risk are computed. The results show that low-ranked ESG portfolio has 2 times more risk than high-ranked ESG portfolio for the same level of returns.

Benefits of notebook 1 – spotting differences between sustainability reports and real-time news data

The goal of notebook #1 is to understand what the statements are all about, learning a vocabulary that is ESG specific with themes like diversity and inclusion, code of conduct, supporting communities and renewable energy.

The ESG Solution Accelerator supports Databricks/AWS customers who host their data on AWS, offering a machine learning platform that educates itself on a wide array of subjects and themes that are significant in today’s corporate social responsibility environment. These themes range from ‘diversity and inclusion’, ‘code of conduct’, and ‘supporting communities’, to ‘renewable energy’, ‘impact investing’, and ‘valuing employees’. The automated learning feature enables the algorithm to provide valuable insights in these specific fields, allowing businesses the opportunity to better understand and incorporate these aspects into their investment and risk management strategies.

Fig 2: Cluster analysis of machine learned policies emerging from unstructured Sustainability Reports PDF files

Fig 3: Comparing organizations side-by-side, based on how much they disclose in each of those categories.

In a separate blog post, we show you how you can fine-tune a large language model and accelerate hyperparameter grid search for sentiment analysis with BERT models using Weights & Biases, Amazon EKS, and TorchElastic.

Benefit of Notebook 2 – Understand How ESG Scores correlate with market risk and returns

The second notebook presents a method for constructing a synthetic portfolio and incorporating ESG insights into the market risk framework. This involves considering not only a company’s outward appearance, but also its potential for performance. Research and literature suggest that companies with strong ESG practices typically exhibit lower market volatility. Our examination of value at risk highlights that a synthetic portfolio lacking such practices can be twice as volatile. Ultimately, we aim to connect research and product by integrating our findings into a comprehensive BI to AI Dashboard.

Combining all those insights, to understand what a company says about ESG vs how much a company does across those 24 machines learned policies, informs us in real time about news events that may positively or negatively affect the ESG score of every company, and in turn, the impact it may have on its market performance.

Combining all those insights into one platform helps us understand what a company says about ESG versus how much it actually does across the 24 machine-learned policies. This platform informs us in real time about news events that may positively or negatively affect the ESG score of every company, and subsequently, the impact it may have on its market performance.

ESG factors are among major market factors, like value, momentum, and volatility. The taxonomy of ESG factors has proved adaptive, as the market empirically prices new indicators. In addition, the recent advances in quantifying the effect of ESG factors on performance, in developing a regulatory and legal framework for ESG, and in establishing new ESG ratings should continue to have a positive effect on asset flows into ESG-related strategies.

Running the Accelerator on AWS

Databricks built the ESG Solution Accelerator using the Delta Lake Medallion Architecture to ingest data from raw, enriched, and purpose-built tables categorized as Bronze, Silver, and Gold Delta Lake tables.

Delta Lake is an open-source project built for data ‘lakehouses’ with compute engines, including Apache Spark, Trino, PrestoDB, Flink, and Hive, and with APIs for Scala, Java, Rust, Ruby, and Python. Delta Lake is an ACID table storage layer over cloud object stores like S3 that provides data reliability including, but not limited to, schema enforcement and evolution, time travel, scalable metadata handling, audit history, DML operations, and unifies stream and batch processing. You can store the data in Delta Lake tables in your encrypted S3 bucket.

Fig 4. This schematic depicts how the Solutions Accelerator leverages corporate disclosures, news feeds, and portfolio ticker symbols to collect, transform, enrich, and apply machine learning models to create ESG scores from corporate disclosures and from the news feeds. This depicts the gaps between the corporate disclosures and the news feeds.

Figure 4. This schematic depicts how the Solutions Accelerator leverages corporate disclosures, news feeds, and portfolio ticker symbols to collect, transform, enrich, and apply machine learning models to create ESG scores from corporate disclosures and from the news feeds. This depicts the gaps between the corporate disclosures and the news feeds.

A machine learning model is created using open-source MLFlow to enrich the data with ML-based approach. Databricks workloads are run within the customer’s VPC account in either Amazon EC2 instances or in Amazon Elastic Container Registry (Amazon ECR) containers.

AWS native services and AWS partner services further consume the data from Delta Lake. Some examples include using Amazon QuickSight for BI and visualization, Amazon Athena for advanced analytics, AWS Glue for data catalog, and Amazon SageMaker for more models, inference, and serving.

Along with these are ancillary components like AWS Identity & Access Management (IAM), which control access and set guardrails to the applications and services, and Amazon CloudWatch for monitoring the applications and services to ensure proper functioning, can be implemented.

Steps to use the template for the ESG Sentiment use case

If you are new to AWS, create and activate a new account.
If you are new to Databricks on AWS, create and setup the Databricks account.
Login to Databricks <account_name>.cloud.databricks.com with your username and password.
Clone your repos using Repos → Add Repo → Select Git Hub from the Git provider → Enter https://github.com/databricks-industry-solutions/esg-scoring in Git repository URL field → provide your repository name in Repository name field. Then submit.
Go to Compute → Create Cluster and select ML and 10.4 LTS ML (Scala 2.12, Spark 3.2.1 and above) → Worker Type Instance Type, Min workers as 2, Max workers as 8), and Driver Type as Worker Type, Check Enable Auto scaling, Check Terminate after 60 minutes of inactivity
Create your cluster
Repos → Select your repo → Select the Notebook (there are 5 note books) → Open the Notebook sequentially → Attach Cluster from the Connect drop down and select the one you created in the last step —> Run → Run All

You can read the Databricks ESG solution accelerator explainer to learn more, or watch the video. And you can check out the Databricks page in AWS Marketplace to get started with Databricks on AWS, and visit the AWS Sustainability Solutions page for ready-to-deploy ESG solutions.

Conclusion

Through this blog, we have illustrated a streamlined method for summarizing complex documents into key ESG initiatives that offer a deeper comprehension of the sustainability aspects of your investments. With the implementation of machine learning methods powered by large language models (LLMs), we have introduced a unique approach to ESG that more effectively identifies the impact of global markets on both organizational strategy and reputational risk. Additionally, we have highlighted the significant economic influence of ESG factors on market risk calculation.

As a starting point to a data-driven ESG journey, this approach can be further improved by bringing the internal data you hold about your various investments and the additional metrics you could bring from third-party data, propagating the risks through the propagated-weighted ESG framework to keep driving more sustainable finance and impactful investments.

AWS HPC Blog