AWS Glue is a fully managed ETL service that makes it easy to move data between your data stores. AWS Glue simplifies and automates the difficult and time consuming data discovery, conversion, mapping, and job scheduling tasks. AWS Glue guides you through the process of moving your data with an easy to use console that helps you understand your data sources, prepare the data for analytics, and load it reliably from data sources to destinations.
AWS Glue is integrated with Amazon S3, Amazon RDS, and Amazon Redshift, and can connect to any JDBC-compliant data store. AWS Glue automatically crawls your data sources, identifies data formats, and then suggests schemas and transformations, so you don’t have to spend time hand-coding data flows. You can then edit these transformations, if necessary, using the tools and technologies you already know, such as Python, Spark, Git and your favorite integrated developer environment (IDE), and share them with other AWS Glue users. AWS Glue schedules your ETL jobs and provisions and scales all the infrastructure required so your ETL jobs run quickly and efficiently at any scale. There are no servers to manage, and you pay only for resources consumed by your ETL jobs.
For latest information about service availability, sign up here and we will keep you updated via email.
Step 1. Build Your Data Catalog
First, you use the AWS Management Console to register your data sources with AWS Glue. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for many popular source formats and data types, including JSON, CSV, Parquet, and more. You can also add your own classifiers or choose classifiers from the AWS Glue community to add to your crawls.
Click to view larger image
Step 2. Generate and Edit Transformations
Next, select a data source and target, and AWS Glue will generate Python code to extract data from the source, transform the data to match the target schema, and load it into the target. The auto-generated code handles common error cases, such as bad data or hardware failures. You can edit this code using your favorite IDE and test it with your own sample data. You can also browse code shared by other AWS Glue users and pull it into your jobs.
Click to view larger image
Step 3. Schedule and Run Your Jobs
Lastly, you can use AWS Glue’s flexible scheduler to run your flows on a recurring basis, in response to triggers, or even in response to AWS Lambda events. AWS Glue automatically distributes your ETL jobs on Apache Spark nodes, so that your ETL run times remain consistent as data volume grows. AWS Glue coordinates the execution of your jobs in the right sequence, and automatically re-tries failed jobs. AWS Glue elastically scales the infrastructure required to complete your jobs on time and minimize costs.
Click to view larger image
That’s it! Once the ETL jobs are in production, AWS Glue helps you track changes to metadata such as schema definitions and data formats, so you can keep your ETL jobs up to date.
AWS re:Invent is the largest gathering of the global AWS community. The conference allows you to gain a deeper knowledge of AWS services and learn best practices. We announced AWS Glue at re:Invent 2016. Watch the sessions below to learn more about AWS Glue and other related analytics, or check out the entire big data breakout sessions playlist.
AWS Glue is a fully managed ETL service that makes it easy to understand your data sources, prepare the data for analytics, and load it reliably to your data stores. In this session, we introduce AWS Glue, provide an overview of its components, and discuss how you can use the service to simplify and automate your ETL process. We also talk about when you can try out the service and how to sign up for a preview.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.