reference deployment

Databricks on AWS

A collaborative workspace for data science, machine learning, and analytics

Databricks is a unified data-analytics platform for data engineering, machine learning, and collaborative data science. A Databricks workspace is a software-as-a-service (SaaS) environment for accessing all your Databricks assets. The workspace organizes objects (notebooks, libraries, and experiments) into folders and provides access to data and computational resources, such as clusters and jobs.

The Databricks platform helps cross-functional teams communicate securely. You can stay focused on your data science, data analytics, and data engineering tasks while Databricks manages many of the backend services.

This Quick Start is for IT infrastructure architects, administrators, and DevOps professionals who want to use the Databricks API to create Databricks workspaces on the Amazon Web Services (AWS) Cloud. This Quick Start creates a new workspace in your AWS account and sets up the environment for deploying more workspaces in the future.

IMPORTANT: This AWS Quick Start deployment requires that your Databricks account be on the E2 version of the platform. For questions about your Databricks account, contact your Databricks representative.

databricks-logo

This Quick Start was created by Databricks in collaboration with AWS. Databricks is an AWS Partner.

  •  What you'll build
  •  How to deploy
  •  Cost and licenses
  •  What you'll build
  • The Quick Start sets up the following, which constitutes the Databricks workspace:

    • A highly available architecture that spans at least three Availability Zones.
    • A Databricks-managed or customer-managed virtual private cloud (VPC) in the customer's AWS account. This VPC is configured with private subnets and a public subnet, according to AWS best practices, to provide you with your own virtual network on AWS.
    • In the private subnets:
      • Databricks clusters of Amazon Elastic Compute Cloud (Amazon EC2) instances.
      • One or more security groups to enable secure cluster connectivity.
    • In the public subnet:
      • A network address translation (NAT) gateway to allow outbound internet access.
    • Amazon CloudWatch for the Databricks workspace instance logs.
    • (Optional) A customer-managed AWS Key Management Service (AWS KMS) key to encrypt notebooks.
    • An Amazon Simple Storage Service (Amazon S3) bucket to store objects such as cluster logs, notebook revisions, and job results.
    • AWS Security Token Service (AWS STS) to enable you to request temporary, limited-privilege credentials for users to authenticate.
    • A VPC endpoint for access to S3 artifacts and logs.
    • A cross-account AWS Identity and Access Management (IAM) role to enable Databricks to deploy clusters in the VPC for the new workspace. Depending on the deployment option you choose, you either create this IAM role during deployment or use an existing IAM role.
  •  How to deploy
  • To deploy Databricks, follow the instructions in the deployment guide. Databricks needs access to a cross-account IAM role in your AWS account to launch clusters into the VPC of the new workspace. The deployment process, which takes about 15 minutes, includes these steps:

    1. If you don't already have an AWS account, sign up at https://aws.amazon.com, and sign in to your account.
    2. Launch the Quick Start, choosing from the following options:

    Amazon may share user-deployment information with the AWS Partner that collaborated with AWS on the Quick Start.  

  •  Cost and licenses
  • You are responsible for the cost of the AWS services used while running this Quick Start. There is no additional cost for using the Quick Start.

    The AWS CloudFormation template for this Quick Start includes configuration parameters that you can customize. Some of the settings, such as the instance type, affect the cost of deployment. For cost estimates, see the pricing pages for each AWS service you use. Prices are subject to change.

    Tip: After you deploy the Quick Start, enable the AWS Cost and Usage Report to deliver billing metrics to an Amazon S3 bucket in your account. It provides cost estimates based on usage throughout each month and aggregates the data at the end of the month. For more information, see What are AWS Cost and Usage Reports?

    For Databricks cost estimates, see the Databricks pricing page for product tiers and features.

    To launch the Quick Start, you need the following:

    • An AWS account.
    • An account ID for a Databricks account on the E2 version of the platform. If you have questions, contact your Databricks representative.
    • A Databricks user name and password.