Reference Deployment

Predictive Data Science with Amazon SageMaker and a Data Lake on AWS

Store and transform data for building predictive and prescriptive applications

This Quick Start builds a data lake environment for building, training, and deploying machine learning (ML) models with Amazon SageMaker on the Amazon Web Services (AWS) Cloud. The deployment, which takes about 10-15 minutes, uses AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon API Gateway, AWS Lambda, Amazon Kinesis Data Streams, and Amazon Kinesis Data Firehose.

Amazon SageMaker is a managed platform that enables developers and data scientists to build, train, and deploy ML models quickly and easily.

This Quick Start is for users who want to unleash the power of their data to make predictive and prescriptive models for business value, without needing to configure complex ML hardware clusters. It enables end-to-end data science, starting with raw data and ending with a prediction REST API in a production system.

The Quick Start also provides a demo scenario developed by Pariveda Solutions. The demo shows how to store raw data in Amazon S3, transform the data for consumption in Amazon SageMaker, use Amazon SageMaker to build an ML model, and host the model in a prediction API for Amazon Elastic Compute Cloud (Amazon EC2) Spot pricing.

pariveda-data-lake-sagemaker-LP-logo

This Quick Start was developed by Pariveda Solutions, Inc., in collaboration with AWS. Pariveda is an APN Partner.

  •  What you'll build
  •  How to deploy
  •  Cost and licenses
  •  What you'll build
  • This Quick Start architecture builds the following:

    • A structured data lake in Amazon S3 to hold the raw, modeled, enhanced, and transformed data.
    • A staging bucket for the feature engineered and transformed data that will be ingested into Amazon SageMaker.
    • Data transformation code hosted on AWS Lambda to prepare the raw data for consumption and ML model training, and to transform data input and output.
    • Amazon SageMaker automation through Lambda functions to build, manage, and create REST endpoints for new models, based on a schedule or triggered by data changes in the data lake.
    • Amazon API Gateway endpoints to host public APIs for enabling developers to get historical data or predictions for their applications.
    • Amazon Kinesis Data Streams to enable real-time processing of new data across the Ingest, Model, Enhance, and Transform stages.
    • Amazon Kinesis Data Firehose to deliver the results of the Model and Enhance phases to Amazon S3 for durable storage.
    • An Amazon CloudWatch dashboard to provide monitoring of the data transformation, model training, and hosting components for the prediction endpoint.
    • An AWS SageMaker notebook server to enable data exploration by using a Jupyter notebook.
    • AWS Identity and Access Management (IAM) to enforce the principle of least privilege on each processing component. The IAM role and policy restrict access to only the resources that are necessary.
    • A demo scenario that builds and updates a predictive model for daily Amazon Elastic Compute Cloud (Amazon EC2) Spot pricing.
  •  How to deploy
  • You can build your predictive data science environment with Amazon SageMaker and a data lake on AWS in about 10-15 minutes by following a few simple steps:

    1. If you don't already have an AWS account, sign up at https://aws.amazon.com.
    2. Launch the Quick Start.
    3. (Optional) Test the deployment with the demo scenario provided.
    4. (Optional) Train an ML model on your own.
  •  Cost and licenses
  • You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start.

    The AWS CloudFormation template for this Quick Start includes configuration parameters that you can customize. Some of these settings, such as instance type, will affect the cost of deployment. For cost estimates, see the pricing pages for each AWS service you will be using. Prices are subject to change.

    Because this Quick Start uses native AWS services, no additional licensing is required.