AWS for Industries
Building a scalable data science platform at EDF with Amazon SageMaker
EDF Energy (EDF) is one of the United Kingdom’s leading energy retailers, supplying over 3.5 million homes and businesses with low-carbon electricity and gas. EDF is on a mission to help Britain achieve net zero while being the United Kingdom’s biggest generator of zero-carbon electricity, with a mix of solar, wind, and nuclear generators.
The data science team at EDF develops data science products to serve customers better and to accelerate EDF’s net-zero mission. At a time of high energy costs, it has become increasingly important to use data to help customers, particularly those who are most vulnerable. One important project recently completed by the data science team involved developing a clustering algorithm to detect segments of financially vulnerable customers using smart-meter prepayment data, such as the amount and frequency of top-ups. This model has been used to help EDF proactively reach out to some of the most vulnerable customers to offer help and support.
The challenge
With successful trials having taken place, the data science team turned to the task of deploying the model to production so that it could be integrated into key business processes. After spending several weeks developing a solution to deploy this model using existing infrastructure, the data science team realized that in order to scale up the capabilities for model deployment, it needed a new approach.
A cross-functional product team consisting of DevOps engineers, machine learning (ML)Ops engineers, data engineers, and data scientists was mobilized to develop a new scalable platform for MLOps at EDF. Because EDF was already using Amazon Web Services (AWS) services for much of its wider data platform infrastructure, Amazon SageMaker—which developers use to build, train, and deploy ML models—was an obvious choice of technology for operationalizing ML. The enterprise data lake had also recently completed a migration to Snowflake, an AWS Partner, and the new Snowpark Python API was selected as the ideal tool for big data processing requirements. Developing connectivity to Snowflake through Snowpark in Amazon SageMaker was straightforward, requiring only a custom image to be created.
The new product team initially performed a proof of concept with these new tools, implementing Amazon SageMaker infrastructure and developing connectivity to Snowflake for the first proof of concept in a matter of days. Amazon SageMaker’s out-of-the-box wealth of MLOps tools helped the team to deploy the existing financial vulnerability detection algorithm in just a few days—a significant improvement on earlier deployment timescales.
The solution
EDF has since moved on to building a unified MLOps platform using Amazon SageMaker and Snowflake, which allows data scientists to process large volumes of data and develop and deploy new models to production, all using the tools that they are most familiar with, such as Python and Jupyter Notebooks.
The new platform consists of four environments, each holding AWS accounts with Amazon SageMaker initiated, a Snowflake database and connectivity to Snowflake through Snowpark, and a custom Amazon SageMaker image.
The first environment is called discovery and is where the first development of new data science products happens. The work in this environment is largely performed with Jupyter Notebooks using the standard data science Python tools, such as Pandas and scikit-learn, alongside Snowpark.
When a model is ready for production, the work moves to the next environment. This consists of three AWS accounts with the Amazon SageMaker API initiated and a corresponding Snowflake database. Here the standard software development process is initiated with environments supplying segregation for ML and data pipeline development and preproduction for testing code prior to release into the final production area. In these environments, code moves out of Jupyter Notebooks and into scripts. Added components facilitate the full MLOps capabilities and include Apache Airflow for orchestration and scheduling.
Conclusion
Amazon SageMaker has facilitated EDF in developing a full-fledged data science platform in a matter of weeks. The new platform has transformed the time to production for ML models from months to days and has led to a four times uplift in projected data products for 2023. Additionally, the improved data access through the Snowflake integration and the flexible compute provided by Amazon SageMaker Studio notebooks has reduced the time to develop new data science projects from weeks to days.
The data science team is currently using the platform to accelerate the development and deployment of further products to help serve EDF’s 3.5 million customers and to move Britain further toward net zero. One product, the implementation of the National Grid’s Energy Demand Flexibility Service, mitigated the risk of blackouts across the United Kingdom during winter 2022 using the platform and continues to help customers to reduce their energy use and save money.