AWS Public Sector Blog

Analytics Without Limits: FINRA’s Scalable and Secure Big Data Architecture – Part 1

A guest post by John Brady, CISSP, VP Cyber Security/CISO, Financial Industry Regulatory Authority

The Financial Industry Regulatory Authority (FINRA) oversees more than 3,900 securities firms with approximately 640,000 brokers. Every day, we watch over nearly 6 billion shares traded in U.S. equities markets—using technology powerful enough to help detect fraud, abuse, and insider trading. In fact, FINRA processes approximately 6 terabytes of data and 37 billion records on an average day to build a complete, holistic picture of market trading in the U.S. On busy days, the stock markets can generate 75 billion+ records.

FINRA was able to enable flexible, scalable, and secure analytics in the cloud with an analytics architecture leveraging Amazon Simple Storage Service (Amazon S3). We extended the data lake pattern with EMR, Hbase, and Amazon S3 to allow interactive random access queries across trillions of records spanning 600+ terabytes of data.

Before the cloud, fixed capacity and provisioning lead-times were getting in the way of the analytics. With AWS, we can now expand online storage seamlessly and scale compute dynamically to meet the demands of our analysts and data scientists and keep pace when market volumes spike. We keep an archive copy of each dataset on Amazon S3, protect the data with encryption and access policies, process directly against data on Amazon S3 where possible, and transform or extract data for extra performance when we must.

But keeping track of 300 million+ objects in Amazon S3 can be a challenge. What data do we have? Where is the data used? How many versions of this data exist? What is the source of this data? What is the retention policy?

Enter herd, our open source data catalog and orchestration tool. With herd, we are able to efficiently track and catalog data in a unified data repository, capture audit and data lineage information for our highly regulated environment, and programmatically access this data. All of this allows us to separate compute from storage in AWS, enabling near limitless scalability.

The Amazon S3 data lake architecture, coupled with herd allows us to:

  • Leverage infinite secure and cost-effective storage with Amazon S3
  • Scale compute up and down independent of storage
  • Execute multiple simultaneous analytic workloads against the same copy of data
  • Provide a centralized dataset for diverse analytic platforms
  • Optimize costs by leveraging AWS Spot pricing

The data lake has removed obstacles and lowered the cost of curiosity. This allows analysts to quickly obtain a full picture of an order over time, helping to determine whether a rule violation has occurred. FINRA analysts are able to optimize batch and interactive workloads without compromise, and analyze years of historical market data within minutes or hours rather than weeks or months.

In addition to the big data and data lake use cases, FINRA will be moving approximately 200 relational databases to the cloud. By using Amazon RDS for PostgreSQL, we have put control back into the hands of our developers. They are now able to spin up instances to experiment and try new things versus having to provision a new database instance. It allows us to troubleshoot issues more quickly as well as experiment with newer versions and database technologies, like Amazon Aurora.

Learn how FINRA overcame security concerns with our migration to AWS in the second part in the blog series.