Quant research at scale using AWS and Refinitiv data

This post is a follow-up to Analyzing impact of regulatory reform on the stock market using AWS and Refinitiv data. In that post, Boris, Pramod, and Alex analyzed the impact of SEC reform on the stock market using Refinitiv data.

Performing investment research on raw (tick) market data at scale is difficult task due to its size; just three weeks amount to 300TB and 80 billion rows. Even simple techniques such as data aggregations at this scale require parallel computing and a high degree of cost optimization. For this solution, we chose to stay within the Python ecosystem with dataframe and dataframe-like APIs. This is the standard toolset data scientists and quant researchers are accustomed to.

In this blog post, Alex, Pramod, and I will show how to install and use the infrastructure we built to perform quant research at scale. We made the stack and examples available in the public repository so you can use it in your own investment research.

Background and prerequisites

Following is a list of the major components of the stack and reasons why we selected them.

Apache Spark (Spark) is a popular distributed data processing framework, capable of handling terabytes of data by distributing it across multiple nodes in the cluster. It has well-developed Python (PySpark) API with Spark data frames that provide functionality comparable to pandas, which is an open-source AI/ML and data engineering tool. For any custom development, Spark exposes user-defined functions (UDF) functionality, including but not limited to pandas UDF. That brings native pandas functionality to parallel computing.
Amazon EMR on EKS enables you to run Spark workloads using Amazon Elastic Kubernetes Service (Amazon EKS). This gives users access to any underlying Amazon Elastic Compute Cloud (Amazon EC2) fleet, including but not limited to cost-effective spot instances and out-of-the-box fault tolerance. Docker simplifies packaging, dependency management, and image customization. With this deployment option, you can focus on investment research while Amazon EMR on EKS builds, configures, and scales underlying compute resources up or down.
Choosing EKS enabled us to use the fast and cost-effective scaling solution, Karpenter, so we can use a combination of on-demand and spot instances to reduce the costs. We also automatically terminate EC2 cluster nodes once the job has finished.
We chose EMR Studio Notebooks for interactive data exploration and analysis to streamline the development process. It is not uncommon to install and manage additional Python dependencies within the notebook itself. However, using the custom Docker image feature of EMR on EKS promotes portability and simplifies dependency management. It also integrates better with the industry established build, test, and deployment processes.
AWS Data Exchange for Amazon S3 eliminates undifferentiated heavy lifting when accessing Refinitiv data.

This combination of technologies enables you to quickly prototype, develop, and deploy data engineering and research jobs. You can use the same stack to accelerate time to market and interoperability between quants and data engineering teams in larger organizations.

This Github repository has source code to initialize the entire infrastructure for this solution from a single command and example notebook.

Solution walkthrough: Quant research at scale using AWS and Refinitiv data

A. How to install the stack

Bash scripts in the repository must be run on Linux, so to avoid potential local package conflicts, we highly recommend using a Cloud9 instance to deploy the stack.

To provision a Cloud9 instance, follow the steps outlined in AWS Cloud9: Creating an EC2 Environment. Then proceed with the following steps:

In Cloud9, open a Terminal window and clone the following repository:

git clone https://github.com/aws-samples/quant-research

To configure your deployment, in your AWS Cloud Development Kit (CDK) folder, In AWS Cloud9, open deployment/cdk/cdk.context.json. CDK configuration enables you to create multiple independent projects using the cdk-project parameter to set the active project name you want to deploy. You can then use the project=<cdk-project> key to specify the project specific configuration.

By default, a single project named adx is created. Change the following parameters before deploying it in your environment:

- Replace project=adx.eks-role-arn with the IAM role you use in your AWS console. Otherwise, you will not be able to access EKS cluster metrics from the AWS console. For example, if you access your AWS Console using IAM Role named Administrator and your AWS AccountID is 111111222222, you would replace line 4 of cdk.context.json as follows:

"eks-role-arn": "arn:aws:iam::111111222222:role/Administrator",

Change project=adx.emrstudio[0].managed-endpoints[0].iam-policy IAM policy to configure access to data and other AWS services your EMR Studio notebook is able to connect to. Also, replace the <BUCKET_NAME> placeholder with your S3 bucket name.

{
              "Version": "2012-10-17",
              "Statement": [
                {
                  "Effect": "Allow",
                  "Action": [
                    "s3:ListBucket"
                  ],
                  "Resource": [
                    "arn:aws:s3:::<S3_BUCKET_NAME>"
                  ]
                },
                {
                  "Effect": "Allow",
                  "Action": [
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:DeleteObject"
                  ],
                  "Resource": [
                    "arn:aws:s3:::<S3_BUCKET_NAME>/*",
                  ]
                }
}

[Optional] Review and change src/Dockerfile to include any additional Python packages required for your workload. All contents of the src folder will be added to the Docker image and will be accessible from your EMR notebooks. The default version of the stack is shipped with Python 3.9 and popular Python libraries for data engineering, analysis and visualization.

RUN yum install -y libpng-devel sqlite-devel xz-devel gcc-c++ openssl-devel bzip2-devel libffi-devel tar gzip wget make
RUN wget https://www.python.org/ftp/python/3.9.9/Python-3.9.9.tgz
RUN tar xzf Python-3.9.9.tgz 
RUN cd Python-3.9.9 && ./configure --enable-loadable-sqlite-extensions --enable-optimizations
RUN cd Python-3.9.9 && make altinstall
RUN pip3 freeze > requirements.txt
RUN pip3.9 install -r requirements.txt
RUN rm /usr/bin/python3
RUN ln -s /usr/local/bin/python3.9 /usr/bin/python3
ENV MPLLOCALFREETYPE 1
RUN pip3.9 install --upgrade matplotlib==3.2.2 kaleido backtrader pandas sklearn numpy pyarrow s3fs bokeh vectorbt pyEX alpaca-py pyfolio awswrangler boto3 yfinance

Deploy the CDK template using the Bash script provided. To do that, replace <ACCOUNT_ID> and <REGION> in the following snippet with your account ID and Region:

cd deployment/cdk/
bash ./deployment.sh <ACCOUNT_ID> <REGION>

Once deployment has successfully finished, switch back to your AWS Management Console. To do this, in the AWS Cloud9 console upper right, choose Go to Dashboard.

You can access your provisioned infrastructure, including your EKS Cluster, EMR Virtual Cluster, the EMR Managed Endpoint, and EMR Studio in the AWS Management Console. To do that, in the upper left, choose Services and then Containers. Alternatively, you can enter keywords such as Kubernetes in the search bar.

The following screenshot shows the AWS Management Console with the Services menu open and Containers selected. The center pane shows Elastic Container Registry, Elastic Container Service, Elastic Kubernetes Service, and Red Hat OpenShift Service on AWS. Elastic Kubernetes Services is starred and selected.

AWS Management Console screenshot with the Services menu open and Containers selected

B. Configure the Jupyter notebook (EMR Studio)

To use the cluster you created in step A, you must set up a managed Jupyter notebook environment using EMR Studio. To do that, in EMR Studio, follow these instructions to create a workspace.
Assign your workspace to a managed EmrOnEks endpoint that has been provisioned by your CDK deploymentin step A. To do that, in Amazon EMR, navigate to EMR Studio: Studios. In the far right column, choose the Studio Access URL for the Studio you created in step A. In the upper right, choose Create workspace.
As part of the stack, we also automatically create an AWS CodeCommit repository you can use for your notebooks and scripts. The username and password combination is automatically generated and securely stored in AWS Secrets Manager, following the <PROJECT>-codecommit-<REGION> naming convention. To reveal the access details for AWS CodeCommit, in the AWS Management Console, navigate to AWS Secrets Manager. From the left navigation, choose Secrets and then adx-codecommit-us-east-1. Scroll down to Secret value. On the lower right, choose Retrieve secret value.
Link the repository to your EMR Studio Workspace by following Link Git-based repositories to an EMR Studio Workspace.

You are now ready to run the example notebook or working on your own research.

C. Customize the developer experience

AWS CDK lets you define parameters in cdk.context.json or cdk.json files, which we used to give you the option to define multiple EmrOnEKS managed endpoints. Each endpoint can have its own IAM policy to manage access permissions and its own EMR version to use. Endpoints can also can have separate Dockerfiles to use to create a custom Docker image, encapsulating all of your project dependencies in a single Docker image.

Using a custom Docker image improves the developer experience by avoiding potential one-time package conflicts. However, to install necessary packages in the notebook directly, you can still use !pip install <PACKAGE>.

If you change the JSON configuration or a Dockerfile for a managed endpoint, run the deployment.sh Bash script again. To do that, see step A.4.

bash ./deployment.sh <ACCOUNT_ID> <REGION>

The stack automatically detects the changes and re-deploys the changed resources.

Conclusion

In this blog post, Alex, Pramod, and I showed how to deploy and configure the investment research solution we used to perform the analysis for Analyzing impact of regulatory reform on the stock market using AWS and Refinitiv data. The stack available on Github automates its installation, ongoing development, and operational lifecycle to investment research teams.

About the authors

	Pramod Nayak is the Director of Product Management in the Low Latency Group of LSEG. He focusses on the software, data and platform products for the low-latency market data industry. Pramod is a former software engineer and passionate about market data and quantitative trading.
	Alex Tarasov is Senior Solutions Architect working with Fintech Startup customers helping them to design and run their data workloads on AWS. He is a former data engineer and is passionate about all things data and machine learning.
	Boris Litvin is Principal Solution Architect, responsible for Financial Services industry innovation. He is a former Quant and FinTech founder who is passionate about quantitative trading and data science.

Sercan Karaoglu is Senior Solutions Architect, specialized in capital markets. He is a former data engineer and passionate about quantitative investment research.

AWS Marketplace