Let’s Architect! Designing systems for batch data processing
When adding AI into products, you need to design and implement robust data pipelines to build datasets and generate reports for your business. But, data pipelines for batch processing present common challenges: you have to guarantee data quality to make sure the downstream systems receive good data. You also need orchestrators to coordinate different big data jobs, and the architecture should be scalable to process terabytes of data.
With this edition of Let’s Architect!, we’ll cover important things to keep in mind while working in the area of data engineering. Most of these concepts come directly from the principles of system design and software engineering. We’ll show you how to extend beyond the basics to ensure you can handle datasets of any size — including for training AI models.
In software engineering, building robust and stable applications tends to have a direct correlation with overall organization performance. Data engineering and machine learning add extra complexity: they not only have to manage software, but they also involve datasets, data and training pipelines, as well as models.
The data community is incorporating the core concepts of engineering best practices found in software communities, but there is still space for improvement. This video covers ways to leverage software engineering practices for data engineering and demonstrates how measuring key performance metrics can help build more robust and reliable data pipelines. You will learn from the direct experience of engineering teams to understand how they built their mental models.
Data quality is a fundamental requirement for data pipelines to make sure the downstream data consumers can run successfully and produce the expected output. For example, machine learning models are subject to garbage in, garbage-out effects. If we train a model on a corrupted dataset, the model learns inaccurate or incomplete data that may give incorrect predictions and impact your business.
Checking data quality is fundamental to make sure the jobs in our pipeline produce the right output. Deequ is a library built on top of Apache Spark that defines “unit tests for data” to find errors early, before the data gets fed to consuming systems or machine learning algorithms. Check it out on GitHub. To find out more, read Test data quality at scale with Deequ.
Big data pipelines are often built on frameworks like Apache Spark for transforming and joining datasets for machine learning. This session explains Amazon EMR, a managed service to run compute jobs at scale on managed clusters, an excellent fit for running Apache Spark in production.
In this session, you’ll discover how to process over 250 billion events from broker-dealers and over 1.7 trillion events from exchanges within 4 hours. FINRA shares how they designed their system to improve the SLA for data processing and how they optimized their platform for cost and performance.
Apache Airflow is an open-source workflow management platform for data engineering pipelines: you can define your workflows as a sequence of tasks and let the framework orchestrate their execution.
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed service for Apache Airflow in the AWS cloud. This workshop is a great starting point to learn more about Apache Airflow, understand how you can take advantage of it for your data pipelines, and get hands-on experience to run it on AWS.
See you next time!
Thanks for reading! Next time, we’ll talk about stream data processing. To find all the posts from this series, check the Let’s Architect! page.