Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation
Today, we’re extremely happy to launch Amazon SageMaker Processing, a new capability of Amazon SageMaker that lets you easily run your preprocessing, postprocessing and model evaluation workloads on fully managed infrastructure.
Training an accurate machine learning (ML) model requires many different steps, but none is potentially more important than preprocessing your data set, e.g.:
- Converting the data set to the input format expected by the ML algorithm you’re using,
- Transforming existing features to a more expressive representation, such as one-hot encoding categorical features,
- Rescaling or normalizing numerical features,
- Engineering high level features, e.g. replacing mailing addresses with GPS coordinates,
- Cleaning and tokenizing text for natural language processing applications,
- And more!
These tasks involve running bespoke scripts on your data set, (beneath a moonless sky, I’m told) and saving the processed version for later use by your training jobs. As you can guess, running them manually or having to build and scale automation tools is not an exciting prospect for ML teams. The same could be said about postprocessing jobs (filtering, collating, etc.) and model evaluation jobs (scoring models against different test sets).
Solving this problem is why we built Amazon SageMaker Processing. Let me tell you more.
Introducing Amazon SageMaker Processing
Amazon SageMaker Processing introduces a new Python SDK that lets data scientists and ML engineers easily run preprocessing, postprocessing and model evaluation workloads on Amazon SageMaker.
If you need something else, you also have the ability to use your own Docker images without having to conform to any Docker image specification: this gives you maximum flexibility in running any code you want, whether on SageMaker Processing, on AWS container services like Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (EKS), or even on premise.
How about a quick demo with scikit-learn? Then, I’ll briefly discuss using your own container. Of course, you’ll find complete examples on Github.
Preprocessing Data With The Built-In Scikit-Learn Container
Here’s how to use the SageMaker Processing SDK to run your scikit-learn jobs.
First, let’s create an
SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements.
from sagemaker.sklearn.processing import SKLearnProcessor sklearn_processor = SKLearnProcessor(framework_version='0.20.0', role=role, instance_count=1, instance_type='ml.m5.xlarge')
Then, we can run our preprocessing script (more on this fellow in a minute) like so:
- The data set (
dataset.csv) is automatically copied inside the container under the destination directory (
/input). We could add additional inputs if needed.
- This is where the Python script (
preprocessing.py) reads it. Optionally, we could pass command line arguments to the script.
- It preprocesses it, splits it three ways, and saves the files inside the container under
- Once the job completes, all outputs are automatically copied to your default SageMaker bucket in S3.
from sagemaker.processing import ProcessingInput, ProcessingOutput sklearn_processor.run( code='preprocessing.py', # arguments = ['arg1', 'arg2'], inputs=[ProcessingInput( source='dataset.csv', destination='/opt/ml/processing/input')], outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'), ProcessingOutput(source='/opt/ml/processing/output/validation'), ProcessingOutput(source='/opt/ml/processing/output/test')] )
That’s it! Let’s put everything together by looking at the skeleton of the preprocessing script.
import pandas as pd from sklearn.model_selection import train_test_split # Read data locally df = pd.read_csv('/opt/ml/processing/input/dataset.csv') # Preprocess the data set downsampled = apply_mad_data_science_skills(df) # Split data set into training, validation, and test train, test = train_test_split(downsampled, test_size=0.2) train, validation = train_test_split(train, test_size=0.2) # Create local output directories try: os.makedirs('/opt/ml/processing/output/train') os.makedirs('/opt/ml/processing/output/validation') os.makedirs('/opt/ml/processing/output/test') except: pass # Save data locally train.to_csv("/opt/ml/processing/output/train/train.csv") validation.to_csv("/opt/ml/processing/output/validation/validation.csv") test.to_csv("/opt/ml/processing/output/test/test.csv") print('Finished running processing job')
A quick look to the S3 bucket confirms that files have been successfully processed and saved. Now I could use them directly as input for a SageMaker training job.
$ aws s3 ls --recursive s3://sagemaker-us-west-2-123456789012/sagemaker-scikit-learn-2019-11-20-13-57-17-805/output
2019-11-20 15:03:22 19967 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/test.csv
2019-11-20 15:03:22 64998 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/train.csv
2019-11-20 15:03:22 18058 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/validation.csv
Now what about using your own container?
Processing Data With Your Own Container
Let’s say you’d like to preprocess text data with the popular spaCy library. Here’s how you could define a vanilla Docker container for it.
FROM python:3.7-slim-buster # Install spaCy, pandas, and an english language model for spaCy. RUN pip3 install spacy==2.2.2 && pip3 install pandas==0.25.3 RUN python3 -m spacy download en_core_web_md # Make sure python doesn't buffer stdout so we get logs ASAP. ENV PYTHONUNBUFFERED=TRUE ENTRYPOINT ["python3"]
Then, you would build the Docker container, test it locally, and push it to Amazon Elastic Container Registry, our managed Docker registry service.
The next step would be to configure a processing job using the
ScriptProcessor object, passing the name of the container you built and pushed.
from sagemaker.processing import ScriptProcessor script_processor = ScriptProcessor(image_uri='123456789012.dkr.ecr.us-west-2.amazonaws.com/sagemaker-spacy-container:latest', role=role, instance_count=1, instance_type='ml.m5.xlarge')
Finally, you would run the job just like in the previous example.
script_processor.run(code='spacy_script.py', inputs=[ProcessingInput( source='dataset.csv', destination='/opt/ml/processing/input_data')], outputs=[ProcessingOutput(source='/opt/ml/processing/processed_data')], arguments=['tokenizer', 'lemmatizer', 'pos-tagger'] )
The rest of the process is exactly the same as above: copy the input(s) inside the container, copy the output(s) from the container to S3.
Pretty simple, don’t you think? Again, I focused on preprocessing, but you can run similar jobs for postprocessing and model evaluation. Don’t forget to check out the examples in Github.