AWS Machine Learning Blog

Data processing options for AI/ML

This blog post was reviewed and updated June, 2022 to include new features that have been added to the Data processing such as Amazon SageMaker Studio and EMR integration.

Training an accurate machine learning (ML) model requires many different steps, but none are potentially more important than data processing. Examples of processing steps include converting data to the input format expected by the ML algorithm, rescaling and normalizing, cleaning and tokenizing text, and many more. However, data processing at scale involves considerable operational overhead: managing complex infrastructure like processing clusters, writing code to tie all the moving pieces together, and implementing security and governance. Fortunately, AWS provides a wide variety of data processing options to suit every ML workload and teams’ preferred workflows. This set of options expanded even more at AWS re:Invent 2020, so now is the perfect time to examine how to choose between them.

In this post, we review the primary options and provide guidance on how to select one to match your use case and how your team prefers to work with Python, Spark, SQL, and other tools. Although the discussion centers around Amazon SageMaker for ML model building, training, and hosting, it’s equally applicable to workflows where other AWS services are used for these tasks, such as Amazon Personalize or Amazon Comprehend. The main assumption we make is that the decision is being made by those in data science, ML engineering, or MLOps roles. Other factors that are important in making the decision are team experience level, and inclination for writing code and managing infrastructure. Lower experience and inclination typically map to choosing a more fully managed option instead of a less managed or “roll your own” approach.

Prerequisite: Data lake or Lake House

Before we dive deep into the options, there is a question we must answer: how do we reconcile our chosen option with the preferred technology choices of data engineering teams? Different tools may be suited to different roles; the tools a data scientist may prefer for an ML workflow may have little overlap with the tools used by a data engineer to support analytics workloads such as reporting. The good news is that AWS makes it very easy for these roles to pick their own tools and apply them to their organization’s data without conflict. The key is to create a data lake in Amazon Simple Storage Service (Amazon S3) at the center of the organization’s architecture for all data. This separates data and compute and avoids the problem of each team having individual data silos.

With a data lake in the center of the architecture, data engineering teams can apply their own tools for analytics workloads. At the same time, data science teams can also use their own separate tools to access the same data for ML workloads. Multiple separate processing clusters run by various teams can access the same data, always keeping in mind the need to retain the raw data in Amazon S3 for all teams as a source of truth. Additionally, use of a feature store for transformed data, such as Amazon SageMaker Feature Store, by data science teams helps delineate the boundary with data engineering, as well as provide benefits such as feature discovery, sharing, updating, and reuse.

As an alternative to a “classic” data lake, the teams might build on top of a Lake House Architecture, an evolution of the concept of a data lake. Featuring support for ACID transactions, this architecture enables multiple users to concurrently insert, delete, and modify rows across tables, while still allowing other users to simultaneously run analytical queries and ML models on the same datasets. AWS Lake Formation recently added new features to support the Lake House Architecture (currently in preview).

Now that we’ve solved the conundrum of enabling data engineering and data science teams to use their separate, preferred tools without conflict, let’s examine the data processing options for ML workloads on AWS.

Options overview

In this post, we review the following processing options. They’re in categories ranked by the following formula: (user friendliness for data scientists and ML engineers) x (usefulness for ML-specific tasks).

  1. Amazon SageMaker features
  2. Low (or no) code solutions with other AWS services
  3. Spark in Amazon EMR
  4. Self-managed stack with Spark, Python or R

Keep in mind that these are not mutually exclusive; you can use them in various combinations to suit your team’s preferred workflow. For example, some teams may prefer to use SQL as much as possible, while others may use Spark for some tasks in addition to Python frameworks like Pandas. Another point to consider is that some services have built-in data visualization capabilities, while others do not and require use of other services for visualization. Let’s discuss the specifics of each option.

Amazon SageMaker features

SageMaker is a fully managed service that helps data scientists and developers prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML. For data processing and data preparation, there are two features: Amazon SageMaker Data Wrangler, Amazon SageMaker Processing and single universal notebook through built-in integration with Amazon EMR.

SageMaker Data Wrangler is a feature of SageMaker, enabled through SageMaker Studio, that makes it easy for data scientists and ML engineers to aggregate and prepare data for ML applications using a visual interface to accelerate data cleansing, exploration, and visualization. It allows you to easily connect to various data sources such as Amazon S3, Amazon Athena, and Amazon Redshift. SageMaker Data Wrangler enables a “no code” workflow where you can apply built-in transformations in lieu of writing code. Additionally, you have the option to write custom transformations in PySpark, Pandas, or SQL. SageMaker Data Wrangler also includes several convenient integrations with other SageMaker features such as SageMaker Clarify for bias detection, SageMaker Feature Store (mentioned above), and SageMaker Pipelines (purpose-built CI/CD for ML). Finally, you can export your entire data preparation workflow into executable Python code.

While SageMaker Data Wrangler is an interactive, visual data preparation and exploratory data analysis tool, SageMaker Processing lets you easily run your preprocessing, postprocessing and model evaluation workloads on fully managed infrastructure. It includes prebuilt containers for Spark and Scikit-learn, and offers an easy path for integrating your custom containers for frameworks and languages such as R. Unlike SageMaker Data Wrangler, if you use SageMaker Processing on its own you will need to supply a separate visualization tool. Any notebook environment such as Amazon SageMaker Studio or SageMaker notebook instances can be used. For a “lift and shift” migration to SageMaker of existing workloads based on processing scripts, SageMaker Processing may be a good fit.

While SageMaker Processing let’s you run your processing scripts at scale, and automates your data preparation workflow, built-in integration with Amazon EMR, enables you to do interactive data preparation and machine learning at peta-byte scale right within the single universal SageMaker Studio notebook. It also enables you to monitor and debug your Apache Spark jobs running on EMR from SageMaker Studio notebooks. Additionally, you can discover, connect to, create, terminate and manage EMR clusters directly from SageMaker Studio. For more details you can refer this blog post.

The SageMaker features option is ranked first due to its ease of use for data scientists and ML engineers, and its usefulness for ML-specific tasks; it was built from the ground up specifically to support ML workloads. Additionally, in regard to cost optimization, SageMaker Data Wrangler and SageMaker Processing instances are eligible for Amazon SageMaker Savings Plans. This is a flexible pricing model for SageMaker that may provide substantial discounts, in exchange for a commitment to a consistent amount of usage (measured in $/hour) for a one or three year term.

Besides SageMaker features, several other options may be useful even though they weren’t developed solely for, or dedicated specifically to, ML workloads. Let’s review those options next.

Low (or no) code

For our purposes, we consider a solution that requires SQL queries to be a low code solution, and one that doesn’t require any code, even SQL, to be a no code solution.

In this category we’ll review several services outside SageMaker that are serverless: infrastructure details and management are hidden under the hood. Using this option may lead to a relatively fast path to results, while potentially causing greater workflow friction by requiring you to switch between multiple services, UIs, and tools to build a complete workflow, and sacrificing some flexibility and ability to customize.

One low code possibility involves Amazon Athena, a serverless interactive query service, for transforming data using standard SQL queries, in combination with Amazon QuickSight, a serverless BI tool that offers no code, built-in visualizations. When evaluating this powerful combination, consider whether your data transformations can be accomplished with SQL. On the visualization side, an alternative to QuickSight is to use a library such as PyAthena to run SQL queries in SageMaker notebooks and visualize the results there with a declarative visualization library such as Altair (this might be considered “low code+” as it still involves minimal code).

Another low code possibility involves AWS Glue, a serverless ETL service that catalogs your data and offers built-in transforms, along with the ability to write custom PySpark code. For visualizations, besides QuickSight, you can attach either SageMaker or Zeppelin notebooks to an AWS Glue development endpoint. In the case where AWS Glue built-in transforms don’t fully cover the desired set of data transforms, choosing between Athena and AWS Glue may depend upon a team’s preference for implementing custom transforms using standard SQL versus PySpark code.

A no code possibility is AWS Glue DataBrew, a visual data preparation tool that enables users to clean and normalize data without writing any code. You can choose from over 250 ready-made transformations to automate data preparation tasks, such as filtering anomalies, converting data to standard formats, and correcting invalid values. The following table compares DataBrew and SageMaker Data Wrangler.

Spark in Amazon EMR

Many organizations use Spark for data processing and other purposes, such as the basis for a data warehouse. In these situations, Spark clusters are typically run in Amazon EMR, a managed service for Hadoop-ecosystem clusters, which reduces the need to do your own setup, tuning, and maintenance. From the perspective of a data scientist or ML engineer, Spark in Amazon EMR may be considered in the following circumstances:

  • Spark is already used for a data warehouse or other application with a persistent cluster. Unlike the other options we described, which only provision transient resources, Amazon EMR also enables creation of persistent clusters to support analytics applications.
  • The team already has a complete end-to-end pipeline in Spark and also the skillset and inclination to run a persistent Spark cluster for the long term. Otherwise, the SageMaker and AWS Glue options for Spark generally are preferable.

Another consideration is the wide range of instance types offered by Amazon EMR. These include AWS Graviton2 processors and Amazon EC2 Spot Instances for cost optimization.

For visualization with Amazon EMR, there are several choices. To keep your primary ML workflow within SageMaker, use SageMaker Studio and its built-in SparkMagic kernel to connect to Amazon EMR. You can start to query, analyze, and process data with Spark in a few steps. For added security, you can connect to EMR clusters using Kerberos authentication. Amazon EMR also features other integrations with SageMaker, for example you can start a SageMaker model training job from a Spark pipeline in Amazon EMR. Another visualization possibility is to use Amazon EMR Studio, which provides access to fully managed Jupyter notebooks, and includes the ability to log in via AWS Single Sign-On (AWS SSO). However, EMR Studio lacks the many SageMaker-specific UI integrations of SageMaker Studio.

There are other factors to consider when evaluating this option. Spark is based on the Scala/Java stack, with all the problems that entails in regard to dependency management and JVM issues that may be unfamiliar to data scientists. Also keep in mind that Spark’s PySpark API has often lagged behind its primary API in Scala, which is a language less familiar to data scientists. In this regard, if you prefer the alternative Pandas API-compatible Dask framework for your workloads, you can install Dask on your EMR clusters.

Self-managed stack using Spark, Python or R

Another option is to “roll your own” solution by deploying Spark, Python frameworks such as Dask, or R-based code in a self-managed cluster using Amazon Elastic Compute Cloud (Amazon EC2) compute resources, or the container services Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). Integration with SageMaker is most conveniently achieved using the Amazon SageMaker Python SDK.  Any machine with AWS Identity and Access Management (IAM) permissions to SageMaker can use the SageMaker Python SDK to invoke SageMaker functionality for model building, training, tuning, deployment, and more.

This option provides the most flexibility to mix and match any data processing tools and frameworks. It also offers access to the widest range of EC2 instance types and storage options. As with Amazon EMR, you can use Spot Instances to optimize data processing costs for this option.  Additionally, similarly to the SageMaker Savings Plans previously mentioned, you can also use Amazon EC2 Instance Savings Plans or Compute Savings Plans, which can be applied not only to EC2 resources, but also to serverless compute AWS Lambda resources, and serverless compute engine AWS Fargate resources.

However, keep in mind in regard to user-friendliness for data scientists and ML engineers, this option requires them to manage low-level infrastructure, a task better suited to other roles. Also, with respect to usefulness for ML-specific tasks, although there are many frameworks such as Spark and tools that can be layered on top of these services to make management easier and provide specific functionality for ML workloads, this option is still far less managed than the preceding options. It requires more personnel time to manage, tune, maintain infrastructure and dependencies, and write code to fill functionality gaps. As a result, this option also is likely to prove the most costly in the long run and cause the most workflow friction.

Review and conclusion

Your choice of a data processing option for ML workloads typically depends on your team’s preference for tools (Spark, SQL, Python, etc.) and inclination for writing code and managing infrastructure. The following table summarizes the options across several relevant dimensions. The first column emphasizes that separate services or features may be used for processing and related visualization, and the third column refers to resources used to process data rather than for visualization, which tends to be run on lighter-weight resources.

Workloads evolve over time, and you don’t need to be locked in to one set of tools forever. You can mix and match according to your use case. When you use Amazon S3 at the center of your data lake and the fully managed SageMaker service for core ML workflow steps, it’s easy to switch tools as needed or desired to accommodate the latest technologies. Whichever option you choose now, AWS provides the flexibility to evolve your tool chain to best fit the then-current data processing needs of your ML workloads.


About the Author

Brent Rabowsky focuses on data science at AWS, and leverages his expertise to help AWS customers with their own data science projects.