AWS Startups Blog

Machine Learning on Limit Order Book Data for Learning and Compliance

Guest post by John Macpherson, CEO at BMLL Technologies

There are two key types of market participants; those who are trying to make money from the markets and those who are assigned to police those trying to make money. Examples of the former type include investment banks, hedge funds and asset managers, while examples of the latter includes in-house compliance, financial regulators and exchange surveillance teams.

Post-financial crisis, a tidal wave of legislation has hit financial firms. Barely a day goes by when there’s not a news story about a market participant receiving a fine—and the size of these fines mean that preventing them from occurring in the first place is getting substantial attention from within those organizations. Trend-wise, regulators are encouraging the move of trading volumes to centralized venues and increasing reporting requirements, resulting in even more data being generated. This has created the need to carry out surveillance and compliance based on empirical data science, as opposed to human-led intelligence.

Democratization of Data

Wall Street is driven by data; it is an information processing machine. Leading trading firms owe their performance to the ability to leverage AI on big data, allowing them to master understanding complex market dynamics. Historically, these firms have run significant in-house architectures which are expensive to maintain. They’ve struggled to keep up with the rapid rate of change in the sector, and to a high-degree, we’re seeing this scenario replicated from firm to firm. Further to that, the high cost of these architectures has created a two-tier society between those that have the ability to access, process and analyze the data and those that do not. Financial regulators fall very much into the latter sector.

BMLL Technologies is the first company to build an outsourced version of the data-analytics architectures  Tier 1 firms run in-house. As the industry-wide adoption of cloud services accelerates, we’re looking to capitalize on the cost-cutting trends of firms decomposing their businesses into outsourced “plumbing”, while keeping “value-add” services in-house.

Extracting Value from Data

While quantitative analysis skills are prevalent in “front office” roles, surveillance and compliance teams tend to have legal or businesses backgrounds. This means they’re not able to interact with the data in its raw form, but must rely on derived representations of the data in order to extract value from it. While the data may be large in size and complex in structure, the derived findings must be consumable in an easy to access format. As with many machine-learning/big-data services, making use of data pipelining in conjunction with RESTful API technology requires presenting findings in web dashboards that leverage engaging visualization tools.

Surveillance and Compliance

The technology we offer enables individual firms to combine their order-flow and fill data (from sources such as Blackrock Aladdin, ULLink, EMSX, exchange drop-copy, etc…) with the order, by order recording the market data feed (the “L3 book”) to generate the “L4 book”. This combination takes place inside an individual firm’s AWS account, using a continuously deployed BMLL architecture.

Pipelining by using technologies such as the Apache Airflow Scheduler in conjunction with the AWS Batch executor, allows BMLL to apply a suite of proprietary pattern recognition algorithms to look for market abuse behavior—for example; spoofing, wash-trades, layering and order-book fade. By using the latest advances in Bayesian mathematics, the solution is able to reduce Type II errors seen in competitor offerings, without reducing positive detection rates by parameter tuning. The output from these algorithms provisioned on user request to a web front-end.

In a second example, in the post-MiFID II world, firms are required to provide audit trails as to how their broker algorithms performed during execution, with any aberrations being recorded and flagged. To address this, we provide a suite of broker-specific, parameterized algorithms such results can be benchmarked against. Further to that, by having a complete set of exchange specific trade condition codes, the user is able to define expected algorithm behavior in a highly granular fashion.

The Technology

The global financial markets generate approximately 3 PB of raw data per year. We’ve built up several years of historical data for learning purposes, which consists of millions of security identifiers originating from hundreds of trading venues. While each trading venue has its own data structure which changes several times a year, BMLL maps these structures to a common structure. Associated with each security is a vast and complex array of metadata fields describing the security. L3 data can be considered to be a highly multi-dimensional time-varying matrix, so our technology stores this data on S3 in a highly-sharded format to enable the application to rebuild the data structure on the fly.

Due to the size of the data, parallelism is a key concern in our architecture. One of the core technologies we use to solve this problem is Apache Spark through the AWS EMR product. By wrapping the boto3 SDK, we’re able to expose this technology to both our internal data science teams and our customers in a user-friendly manner. Extensive use of the EC2 spot market allows both us and our customers the ability to run analysis at scale in a cost-effective manner.

In Summary

Up to now, those involved in policing the financial markets have been out-gunned by those seeking to use them for wealth generation. We see the advent of cloud computing, RESFTful API technology and platform-based businesses as being game changers, enabling the diffusion of data and its analysis through the financial sector. Our hope is that this will lead to more efficient and transparent economic systems that benefits everyone.