Posted On: Jun 16, 2022
Today, we are making it faster and easier to prepare and visualize data using PySpark and Altair with support for code snippets in Amazon SageMaker Data Wrangler. Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. With SageMaker Data Wrangler’s data selection tool, you can quickly select data from multiple data sources, such as Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, Amazon SageMaker Feature Store, Databricks, and Snowflake.
Starting today, you can prepare and visualize data faster using PySpark and Altair code snippets in Amazon SageMaker Data Wrangler. PySpark is an interface for Apache Spark in Python. Altair is a declarative statistical visualization library for Python that is based on Vega and Vega-Lite. Previously, data scientists using Data Wrangler would start from a blank editor or search the internet for code snippets if they wanted to write code in PySpark or Altair to prepare and visualize their data. Now, data scientists who wish to use PySpark to write a custom transform in SageMaker Data Wrangler can search from over 30 PySpark code snippets for data processing needs such as dropping rows, bulk renaming, casting and reorganizing columns, and filtering text columns for values that include a specific string. In addition, data scientists who wish to write Altair code to create visualizations in SageMaker Data Wrangler can search from Altair code snippets to create heat maps, binned scatter plots, and filled step charts from within SageMaker Data Wrangler.
To get started with new capabilities of Amazon SageMaker Data Wrangler, you can open Amazon SageMaker Studio after upgrading to the latest release and click File > New > Flow from the menu or “new data flow” from the SageMaker Studio launcher. To learn more about the new features read the blog and view the documentation.