AWS Machine Learning Blog

The tech behind the Bundesliga Match Facts xGoals: How machine learning is driving data-driven insights in soccer

It’s quite common to be watching a soccer match and, when seeing a player score a goal, surmise how difficult scoring that goal was. Your opinions may be further confirmed if you’re watching the match on television and hear the broadcaster exclaim how hard it was for that shot to find the back of the net. Previously, it was based on the naked eye and colored with assumptions based on the number of defenders present, where the goalkeeper was, or if a player was in front of the net or angled to the side. Now, with xGoals (short for “Expected Goals”), one of the Bundesliga Match Facts powered by AWS, it’s possible to put data and insights behind the wow factor, showing the fans the exact probability of a player scoring a goal when shooting from any position on the playing field.

Deutsche Fußball Liga (DFL) is responsible for the organization and marketing of Germany’s professional soccer league, Bundesliga and Bundesliga 2. In every match, DFL collects more than 3.6 million data points for deeper insights into what’s happening on the playing field. The vision is to become the most innovative sports league by enhancing the experiences of over 500 million Bundesliga fans and more than 70 media partners around the globe. DFL aims to achieve its vision by using technology in new ways to provide real-time statistics driven by machine learning (ML), build personalized content for the fans, and turn data into insights and action.

xGoals is one of the two new Match Facts (Average Positions being the second) that DFL and AWS officially launched at the end of May 2020, enhancing global fan engagement with Germany’s Bundesliga, the top soccer league in the country, and the league with the highest average number of goals per match. Using Amazon SageMaker, a fully managed service to build, train, and deploy ML models, xGoals can objectively evaluate the goal-scoring chances of Bundesliga players shooting from any position on the playing field. xGoals can also determine if a pass helped open up a better opportunity than if the player had taken a shot or passed the ball to a player in a better position to shoot.

xGoals and other Bundesliga Match Facts are setting new standards by providing data-driven insights in the world of soccer.

Quantifying goal-scoring chances

The xGoals Match Facts debuted on May 26, 2020, during the Borussia Dortmund vs. FC Bayern Munich match, which was broadcast in over 200 countries worldwide. In a game where little was given away and everything had to be fought for, FC Bayern Munich’s Joshua Kimmich managed to take a remarkable shot. Given the distance to goal, angle of the strike, number of surrounding players, and other factors, his goal-scoring probability in this specific situation was only 6%.

The xGoals ML model produces a probability figure between 0 and 1, after which the values are displayed as a percentage. For example, an evaluation of ML models trained on Bundesliga matches showed that every penalty kick has an xGoals (or “xG”) value of 0.77—meaning that the goal-scoring probability is 77%. xGoals introduces a value that qualitatively measures the utilization of goal-scoring chances of a player or a team and provides information about their performance.

At the end of each match, an aggregation of xGoals values by both teams is also shown. This way, viewers get an objective metric of goal-scoring chances. The particular match mentioned before had a high probability of a draw, if not for that one successful shot from Kimmich. xGoals can enhance the viewing experience and provide insights in several ways, keeping fans engaged and enabling them to understand the potential of players and teams throughout a match or a season.

Given the highly dynamic circumstances around scoring attempts prior to a goal, it’s very hard to achieve an xG value above 70%. Player positions constantly change, and players must make split-second decisions with limited information, mostly relying on intuition. Thus, even when positioned in close proximity to the goal, depending on the situation, the difficulty to score may vary significantly. Therefore, it’s important to have a data-driven, holistic view of all the events on the playing field at any given moment. Only then is it possible to make accurate predictions by also taking into account other players’ positions when feeding this information into the xGoals ML model.

It all starts with data

To bring Match Facts to life, several checks and processes happen before, during, and after a match. Various stakeholders are involved in data acquisition, data processing, graphics, content creation (such as TV feed editing), and live commentary. Each one of the Bundesliga soccer stadiums is equipped with up to 20 cameras for automatic optical tracking of player and ball positions. An editorial team processes additional video data and picks the ideal camera angles and scenes to broadcast. This also includes the decision of when exactly to display Match Facts on TV.

Nearly all match events, such as penalty kicks and shots at goals, are documented live and sent to the DFL systems for remote verification. Human annotators categorize and supplement events with additional situation-specific information. For example, they can add player and team assignments and the type of the shot taken (such as blocking or assisting).

Eventually, all the raw match data is ingested into the Bundesliga Match Facts system on AWS to calculate the xGoals values, which are then distributed worldwide for broadcasting.

In the case of the official Bundesliga app and website, Match Facts are continuously displayed on end-user devices as soon as possible. The same applies to other external customers of DFL with third-party digital platforms, which also offer the latest insights and advanced statistics to soccer fans around the globe.

Real-time content distribution and fan engagement are especially important now, because Bundesliga matches are being played in empty stadiums, which has impacted the in-person soccer viewing experience.

Our ML journey: Bringing code to production

DFL’s leadership, management, and developers have been working hand-in-hand with AWS Professional Services Teams through this cloud-adoption journey, enabling ML for an enhanced viewer experience. The mission of AWS Data Science consultants is to accelerate customer business outcomes through the effective use of ML. Customer engagements start with an initial assessment and taking a closer look at desired outcomes and feasibility from both a business and technical perspective. AWS Professional Services consultants supplement customers’ existing teams with specialized skill sets and industry experience, developing proof of concepts (POCs), minimal viable products (MVPs), and bringing ML solutions to production. At the same time, continued learning and knowledge transfer drive sustainable and directly attributable business value.

In addition to in-house experimentations and prototyping performed at DFL’s subsidiary Sportec Solutions, a well-established research community is already working on refining the performance and accuracy of xGoals calculations. Combining this domain knowledge with the right tech stack and establishing best practices allows for faster innovation and execution at scale while ensuring operational excellence, security, reliability, performance efficiency, and cost optimization.

Historical soccer match data is the foundation of state-of-the-art ML-based xGoals model-training approaches. We can use this data to train ML models to infer xGoals outcomes based on given conditions on the playing field. For data quality evaluations and initial experimentations, we need to perform exploratory data analysis, data visualization, data transformation, and data validation. As an example, this can be done in Amazon SageMaker notebooks. The next natural step is to move the ML workloads from research to development. Deploying ML models to production requires an interdisciplinary engineering approach involving a combination of data engineering, data science, and software development. Production settings require error handling, failover, and recovery plans. Overall, ML system development and operations (MLOps) necessitates code refactoring, re-engineering and optimization, automation, setting up the foundational cloud infrastructure, implementing DevOps and security patterns, end-to-end testing, monitoring, and proper system design. The goal should always be to automate as many system components as possible to minimize manual intervention and reduce the need for maintenance.

In the next sections, we further explore the tech stack behind the Bundesliga Match Facts powered by AWS and underlying considerations when streamlining the path to bring xGoals to production.

xGoals model training with Amazon SageMaker

Traditional xGoals ML models are based on event data only. This means that only the approximate position of a player and their distance to a goal are taken into account when evaluating goal-scoring chances. In the case of the Bundesliga, shot-at-goal events are combined with additional high-precision positional data obtained with a 25 Hz frame rate. This comes with additional overhead in data cleaning and data preprocessing within the necessary data stream analytics pipeline. However, the benefits of having more accurate results clearly outweigh the necessary engineering effort and complexity introduced. Based on the ball and player positions, which are constantly being tracked, the model can determine an array of additional features, such as the distance of a player to the goal, angle to the goal, a player’s speed, number of defenders in the line of shot, and goalkeeper coverage.

For xGoals, we used the Amazon SageMaker XGBoost algorithm to train an ML model on over 40,000 historical shots at goals in the Bundesliga since 2017. This can either be performed with the default training script (XGBoost as a built-in algorithm) or extended by adding preprocessing and postprocessing scripts (XGBoost as a framework). The Amazon SageMaker Python SDK makes it easy to perform training programmatically with built-in scaling. It also abstracts away the complexity of resource deployment and management needed for automatic XGBoost hyperparameter optimization. It’s advisable to start developing with small subsets of the available data for faster experimentation and gradually evolve and optimize towards more complex ML models trained on the full dataset.

An xGoals training job consists of a binary classification task with Area Under the ROC Curve (AUC) as the objective metric and a highly imbalanced training and validation dataset of shots at goals, which either did or didn’t lead to a goal being scored.

Given the various ML model candidates from the Bayesian search-based hyperparameter optimization job, the best-performing one is picked for deployment on an Amazon SageMaker endpoint. Due to differing resource requirements and longevity, ML model training is decoupled from hosting. The endpoint can be invoked from within applications such as AWS Lambda functions or from within Amazon SageMaker notebooks using an API call for real-time inference.

However, training an ML model using Amazon SageMaker isn’t enough. Other infrastructure components are necessary to handle the full cloud ML pipeline, which consists of data integration, data cleaning, data preprocessing, feature engineering, and ML model training and deployment. In addition, other application-specific cloud components need to be integrated.

xGoals architecture: Serverless ML

Before designing application architecture, we put a continuous integration and continuous delivery/deployment (CI/CD) pipeline in place. In accordance with the guidelines stated in the AWS Well-Architected Framework whitepaper, we followed a multi-account setup approach for independent development, staging, and production CI/CD pipeline stages. We paired this with an infrastructure as code (IaC) approach to provision these environments and have predictable deployments for each code change. This allows the team to have segregate environments, reduces release cycles, and facilitates testability of code. After the developer tools were in place, we started to draft the architecture for the application. The following diagram illustrates this architecture.

Data is ingested in two separate ways: AWS Fargate is used for (serverless compute engines for containers) receiving positional and event data streams, and Amazon API Gateway for receiving additional metadata such as team compositions and player names. This incoming data triggers a Lambda function. This Lambda function takes care of a variety of short-lived, one-time tasks such as automatic de-provisioning of idle resources; data preprocessing; simple extract, transform, and load (ETL) jobs; and several data quality tests that occur every time new match data is consumed. We also use Lambda to invoke the Amazon SageMaker endpoint to retrieve the xGoals predictions given a set of input features.

We use two databases to store the match states: Amazon DynamoDB, a key-value database, and Amazon DocumentDB (with MongoDB compatibility), a document database. The latter makes it easy to query and index position and event data in JSON format with nested structures. This is especially suitable if workloads require a flexible schema for fast, iterative development. For central storage of official match data, we use Amazon Simple Storage Service (Amazon S3). Amazon S3 stores the historical data from all match days, which is used to iteratively improve the xGoals model. Amazon S3 also stores metadata on model performance, model monitoring, and security metrics.

To monitor the performance of the application, we use an AWS Amplify web application. This gives the operations team and business stakeholders an overview of the system health and status of Match Facts calculations and its underlying cloud infrastructure in the form of a user-friendly dashboard. Such operational insights are important to capture and incorporate in post-match retrospective analyses to ensure continuous improvements of the current system. This dashboard also allows us to collect metrics to measure and evaluate the achievement of desired business outcomes. Continuous monitoring of relevant KPIs, such as overall system load and performance, end-to-end latency, and other non-functional requirements, ensures a holistic view of the current system from both business and technical perspectives.

The xGoals architecture is built in a fully serverless fashion for improved scalability and ease of use. Fully-managed services remove the undifferentiated heavy lifting of managing servers and other basic infrastructure components. The architecture allows us to dynamically support demand when matches start and release the resources at the end of the game without the need for manual actions, which reduces application costs and operational overhead.

Summary

Since naming AWS as its official technology provider in January 2020, the Bundesliga and AWS have embarked on a journey together to bring advanced analytics to life for soccer fans and broadcasters in over 200 countries. Bundesliga Match Facts powered by AWS helps audiences better understand the strategy involved in decision-making on the pitch. xGoals allows soccer viewers to quantitatively evaluate goal-scoring probabilities based on several conditions on the playing field. Other use cases include scoring chances aggregations in the form of individual players’ and goalkeepers’ performance metrics, and objective evaluations of whether or not the scoreline in a match is a fair reflection of what took place on the playing field.

AWS Professional Services has been working hand-in-hand with DFL and its subsidiary Sportec Solutions, advancing its digital transformation, accelerating business outcomes, and ensuring continuous innovation. Over the course of the coming seasons, DFL will introduce new Bundesliga Match Facts powered by AWS to keep fans engaged, entertained, and provide them with a world-class soccer viewing experience.

“We at Bundesliga are able to use this advanced technology from AWS, including statistics, analytics, and machine learning, to interpret the data and deliver more in-depth insights and better understanding of the split-second decisions made on the pitch. The use of Bundesliga Match Facts enables viewers to gain deeper insights into the key decisions in each match.”

— Andreas Heyden, Executive Vice President of Digital Innovations for the DFL Group


About the Authors

Marcelo Aberle is a Data Scientist in the AWS Professional Services team, working with customers to accelerate their business outcomes through the use of AI/ML. He was the lead developer of the Bundesliga Match Facts xGoals. He enjoys traveling for extended periods of time and is an avid admirer of minimalist design and architecture.

Mirko Janetzke is the Head of IT Development at Sportec Solutions GmbH, the DFL subsidiary responsible for data gathering, data and statistics systems, and soccer analytics within the DFL group. Mirko loves soccer and has been following the Bundesliga and his home team since he was a young boy. In his spare time, he likes to go hiking in the Bavarian Alps with his family and friends.

Lina Mongrand is a Senior Enterprise Services Manager at AWS Professional Services. Lina focuses on helping Media & Entertainment customers build their cloud strategies and approaches and guiding them through their transformation journeys. She is passionate about emerging technologies such as AI/ML and especially how these can help customers achieve their business outcomes. In her spare time, Lina enjoys mountaineering in the nearby Alps (she lives in Munich) with friends and family.

Luuk Figdor is a data scientist in the AWS Professional Services team. He works with clients across industries to help them tell stories with data using machine learning. In his spare time he likes to learn all about the mind and the intersection between psychology, economics and AI.