AWS Big Data Blog

Build a Data Lake Foundation with AWS Glue and Amazon S3

A data lake is an increasingly popular way to store and analyze data that addresses the challenges of dealing with massive volumes of heterogeneous data. A data lake allows organizations to store all their data—structured and unstructured—in one centralized repository. Because data can be stored as-is, there is no need to convert it to a predefined schema.

Many organizations understand the benefits of using Amazon S3 as their data lake. For example, Amazon S3 is a highly durable, cost-effective object start that supports Open Data Formats while decoupling storage from compute, and it works with all the AWS analytic services. Although Amazon S3 provides the foundation of a data lake, you can add other services to tailor the data lake to your business needs. For more information about building data lakes on AWS, see What is a Data Lake?

Because one of the main challenges of using a data lake is finding the data and understanding the schema and data format, Amazon recently introduced AWS Glue. AWS Glue significantly reduces the time and effort that it takes to derive business insights quickly from an Amazon S3 data lake by discovering the structure and form of your data. AWS Glue automatically crawls your Amazon S3 data, identifies data formats, and then suggests schemas for use with other AWS analytic services.

This post walks you through the process of using AWS Glue to crawl your data on Amazon S3 and build a metadata store that can be used with other AWS offerings.

AWS Glue features

AWS Glue is a fully managed data catalog and ETL (extract, transform, and load) service that simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for popular data formats and data types, including CSV, Apache Parquet, JSON, and more.

Because AWS Glue is integrated with Amazon S3, Amazon RDS, Amazon Athena, Amazon Redshift, and Amazon Redshift Spectrum—the core components of a modern data architecture—it works seamlessly to orchestrate the movement and management of your data.

The AWS Glue Data Catalog is compatible with Apache Hive Metastore and supports popular tools such as Hive, Presto, Apache Spark, and Apache Pig.  It also integrates directly with Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

In addition, the AWS Glue Data Catalog features the following extensions for ease-of-use and data-management functionality:

  • Discover data with search
  • Identify and parse files with classification
  • Manage changing schemas with versioning

For more information, see the AWS Glue product details.

Amazon S3 data lake

AWS Glue is an essential component of an Amazon S3 data lake, providing the data catalog and transformation services for modern data analytics.

In the preceding figure, data is staged for different analytic use cases. Initially, the data is ingested in its raw format, which is the immutable copy of the data. The data is then transformed and enriched to make it more valuable for each use case. In this example, the raw CSV files are transformed into Apache Parquet for use by Amazon Athena to improve performance and reduce cost.

The data can also be enriched by blending it with other datasets to provide additional insights. An AWS Glue crawler creates a table for each stage of the data based on a job trigger or a predefined schedule. In this example, an AWS Lambda function is used to trigger the ETL process every time a new file is added to the Raw Data S3 bucket. The tables can be used by Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR to query the data at any stage using standard SQL or Apache Hive. This configuration is a popular design pattern that delivers Agile Business Intelligence to derive business value from a variety of data quickly and easily.

Walkthrough

In this walkthrough, you define a database, configure a crawler to explore data in an Amazon S3 bucket, create a table, transform the CSV file into Parquet, create a table for the Parquet data, and query the data with Amazon Athena.

Discover the data

Sign in to the AWS Management Console and open the AWS Glue console. You can find AWS Glue in the Analytics section. AWS Glue is currently available in US East (N. Virginia), US East (Ohio), and US West (Oregon). Additional AWS Regions are added frequently.

The first step to discovering the data is to add a database. A database is a collection of tables.

  1. In the console, choose Add database. In Database name, type nycitytaxi, and choose Create.
  2. Choose Tables in the navigation pane. A table consists of the names of columns, data type definitions, and other metadata about a dataset.
  3. Add a table to the database nycitytaxi.You can add a table manually or by using a crawler. A crawler is a program that connects to a data store and progresses through a prioritized list of classifiers to determine the schema for your data. AWS Glue provides classifiers for common file types like CSV, JSON, Avro, and others. You can also write your own classifier using a grok pattern.
  4. To add a crawler, enter the data source: an Amazon S3 bucket named s3://aws-bigdata-blog/artifacts/glue-data-lake/data/. This S3 bucket contains the data file consisting of all the rides for the green taxis for the month of January 2017.
  5. Choose Next.
  6. For IAM role, choose the default role AWSGlueServiceRoleDefault in the drop-down list.
  7. For Frequency, choose Run on demand. The crawler can be run on demand or set to run on a schedule.
  8. For Database, choose nycitytaxi.It is important to understand how AWS Glue deals with schema changes so that you can select the appropriate method. In this example, the table is updated with any change. For more information about schema changes, see Cataloging Tables with a Crawler in the AWS Glue Developer Guide.
  9. Review the steps, and choose Finish. The crawler is ready to run. Choose Run it now.

    When the crawler has finished, one table has been added.
  10. Choose Tables in the left navigation pane, and then choose data. This screen describes the table, including schema, properties, and other valuable information.

Transform the data from CSV to Parquet format

Now you can configure and run a job to transform the data from CSV to Parquet. Parquet is a columnar format that is well suited for AWS analytics services like Amazon Athena and Amazon Redshift Spectrum.

  1. Under ETL in the left navigation pane, choose Jobs, and then choose Add job.
  2. For the Name, type nytaxi-csv-parquet.
  3. For the IAM role, choose AWSGlueServiceRoleDefault.
  4. For This job runs, choose A proposed script generated by AWS Glue.
  5. Provide a unique Amazon S3 path to store the scripts.
  6. Provide a unique Amazon S3 directory for a temporary directory.
  7. Choose Next.
  8. Choose data as the data source.
  9. Choose Create tables in your data target.
  10. Choose Parquet as the format.
  11. Choose a new location (a new prefix location without any existing objects) to store the results.
  12. Verify the schema mapping, and choose Finish.
  13. View the job.This screen provides a complete view of the job and allows you to edit, save, and run the job.AWS Glue created this script. However, if required, you can create your own.
  14. Choose Save, and then choose Run job.

Add the Parquet table and crawler

When the job has finished, add a new table for the Parquet data using a crawler.

  1. For Crawler name, type nytaxiparquet.
  2. Choose S3 as the Data store.
  3. Include the Amazon S3 path chosen in the ETL
  4. For the IAM role, choose AWSGlueServiceRoleDefault.
  5. For Database, choose nycitytaxi.
  6. For Frequency, choose Run on demand.

After the crawler has finished, there are two tables in the nycitytaxi database: a table for the raw CSV data and a table for the transformed Parquet data.

Analyze the data with Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is capable of querying CSV data. However, the Parquet file format significantly reduces the time and cost of querying the data. For more information, see the blog post Analyzing Data in Amazon S3 using Amazon Athena.

To use AWS Glue with Amazon Athena, you must upgrade your Athena data catalog to the AWS Glue Data Catalog. For more information about upgrading your Athena data catalog, see this step-by-step guide.

    1. Open the AWS Management Console for Athena. The Query Editor displays both tables in the nycitytaxi

You can query the data using standard SQL.

  1. Choose the nytaxigreenparquet
  2. Type Select * From "nycitytaxi"."data" limit 10;
  3. Choose Run Query.

Conclusion

This post demonstrates how easy it is to build the foundation of a data lake using AWS Glue and Amazon S3. By using AWS Glue to crawl your data on Amazon S3 and build an Apache Hive-compatible metadata store, you can use the metadata across the AWS analytic services and popular Hadoop ecosystem tools. This combination of AWS services is powerful and easy to use, allowing you to get to business insights faster.

If you have questions or suggestions, please comment below.

Additional reading

See the following blog posts for more information:

About the author

Gordon Heinrich is a Solutions Architect working with global systems integrators. He works with our partners and customers to provide them architectural guidance for building data lakes and using AWS analytic services. In his spare time, he enjoys spending time with his family, skiing, hiking, and mountain biking in Colorado.