What is ETL?
Extract, transform, and load (ETL) is the process of combining data from multiple sources into a large, central repository called a data warehouse. ETL uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning (ML). You can address specific business intelligence needs through data analytics (such as predicting the outcome of business decisions, generating reports and dashboards, reducing operational inefficiency, and more).
Why is ETL important?
Organizations today have both structured and unstructured data from various sources including:
- Customer data from online payment and customer relationship management (CRM) systems
- Inventory and operations data from vendor systems
- Sensor data from Internet of Things (IoT) devices
- Marketing data from social media and customer feedback
- Employee data from internal human resources systems
By applying the process of extract, transform, and load (ETL), individual raw datasets can be prepared in a format and structure that is more consumable for analytics purposes, resulting in more meaningful insights. For example, online retailers can analyze data from points of sale to forecast demand and manage inventory. Marketing teams can integrate CRM data with customer feedback on social media to study consumer behavior.
How does ETL benefit business intelligence?
Extract, transform, and load (ETL) improves business intelligence and analytics by making the process more reliable, accurate, detailed, and efficient.
ETL gives deep historical context to the organization’s data. An enterprise can combine legacy data with data from new platforms and applications. You can view older datasets alongside more recent information, which gives you a long-term view of data.
Consolidated data view
ETL provides a consolidated view of data for in-depth analysis and reporting. Managing multiple datasets demands time and coordination and can result in inefficiencies and delays. ETL combines databases and various forms of data into a single, unified view. The data integration process improves the data quality and saves the time required to move, categorize, or standardize data. This makes it easier to analyze, visualize, and make sense of large datasets.
Accurate data analysis
ETL gives more accurate data analysis to meet compliance and regulatory standards. You can integrate ETL tools with data quality tools to profile, audit, and clean data, ensuring that the data is trustworthy.
ETL automates repeatable data processing tasks for efficient analysis. ETL tools automate the data migration process, and you can set them up to integrate data changes periodically or even at runtime. As a result, data engineers can spend more time innovating and less time managing tedious tasks like moving and formatting data.
How has ETL evolved?
Extract, transform, and load (ETL) originated with the emergence of relational databases that stored data in the form of tables for analysis. Early ETL tools attempted to convert data from transactional data formats to relational data formats for analysis.
Raw data was typically stored in transactional databases that supported many read and write requests but did not lend well to analytics. You can think of it as a row in a spreadsheet. For example, in an ecommerce system, the transactional database stored the purchased item, customer details, and order details in one transaction. Over the year, it contained a long list of transactions with repeat entries for the same customer who purchased multiple items during the year. Given the data duplication, it became cumbersome to analyze the most popular items or purchase trends in that year.
To overcome this issue, ETL tools automatically converted this transactional data into relational data with interconnected tables. Analysts could use queries to identify relationships between the tables, in addition to patterns and trends.
As ETL technology evolved, both data types and data sources increased exponentially. Cloud technology emerged to create vast databases (also called data sinks). Such data sinks can receive data from multiple sources and have underlying hardware resources that can scale over time. ETL tools have also become more sophisticated and can work with modern data sinks. They can convert data from legacy data formats to modern data formats. Examples of modern databases follow.
A data warehouse is a central repository that can store multiple databases. Within each database, you can organize your data into tables and columns that describe the data types in the table. The data warehouse software works across multiple types of storage hardware—such as solid state drives (SSDs), hard drives, and other cloud storage—to optimize your data processing.
With a data lake, you can store your structured and unstructured data in one centralized repository and at any scale. You can store data as is without having to first structure it based on questions you might have in the future. Data lakes also allow you to run different types of analytics on your data, like SQL queries, big data analytics, full-text search, real-time analytics, and machine learning (ML) to guide better decisions.
How does ETL work?
Extract, transform, and load (ETL) works by moving data from the source system to the destination system at periodic intervals. The ETL process works in three steps:
- Extract the relevant data from the source database
- Transform the data so that it is better suited for analytics
- Load the data into the target database
What is data extraction?
In data extraction, extract, transform, and load (ETL) tools extract or copy raw data from multiple sources and store it in a staging area. A staging area (or landing zone) is an intermediate storage area for temporarily storing extracted data. Data staging areas are often transient, meaning their contents are erased after data extraction is complete. However, the staging area might also retain a data archive for troubleshooting purposes.
How frequently the system sends data from the data source to the target data store depends on the underlying change data capture mechanism. Data extraction commonly happens in one of the three following ways.
In update notification, the source system notifies you when a data record changes. You can then run the extraction process for that change. Most databases and web applications provide update mechanisms to support this data integration method.
Some data sources can't provide update notifications but can identify and extract data that has been modified over a given time period. In this case, the system checks for changes at periodic intervals, such as once a week, once a month, or at the end of a campaign. You only need to extract data that has changed.
Some systems can't identify data changes or give notifications, so reloading all data is the only option. This extraction method requires you to keep a copy of the last extract to check which records are new. Because this approach involves high data transfer volumes, we recommend you use it only for small tables.
What is data transformation?
In data transformation, extract, transform, and load (ETL) tools transform and consolidate the raw data in the staging area to prepare it for the target data warehouse. The data transformation phase can involve the following types of data changes.
Basic data transformation
Basic transformations improve data quality by removing errors, emptying data fields, or simplifying data. Examples of these transformations follow.
Data cleansing removes errors and maps source data to the target data format. For example, you can map empty data fields to the number 0, map the data value “Parent” to “P,” or map “Child” to “C.”
Deduplication in data cleansing identifies and removes duplicate records.
Data format revision
Format revision converts data, such as character sets, measurement units, and date/time values, into a consistent format. For example, a food company might have different recipe databases with ingredients measured in kilograms and pounds. ETL will convert everything to pounds.
Advanced data transformation
Advanced transformations use business rules to optimize the data for easier analysis. Examples of these transformations follow.
Derivation applies business rules to your data to calculate new values from existing values. For example, you can convert revenue to profit by subtracting expenses or calculating the total cost of a purchase by multiplying the price of each item by the number of items ordered.
In data preparation, joining links the same data from different data sources. For example, you can find the total purchase cost of one item by adding the purchase value from different vendors and storing only the final total in the target system.
You can divide a column or data attribute into multiple columns in the target system. For example, if the data source saves the customer name as “Jane John Doe,” you can split it into a first, middle, and last name.
Summarization improves data quality by reducing a large number of data values into a smaller dataset. For example, customer order invoice values can have many different small amounts. You can summarize the data by adding them up over a given period to build a customer lifetime value (CLV) metric.
You can protect sensitive data to comply with data laws or data privacy by adding encryption before the data streams to the target database.
What is data loading?
In data loading, extract transform, and load (ETL) tools move the transformed data from the staging area into the target data warehouse. For most organizations that use ETL, the process is automated, well defined, continual, and batch driven. Two methods for loading data follow.
In full load, the entire data from the source is transformed and moved to the data warehouse. The full load usually takes place the first time you load data from a source system into the data warehouse.
In incremental load, the ETL tool loads the delta (or difference) between target and source systems at regular intervals. It stores the last extract date so that only records added after this date are loaded. There are two ways to implement incremental load.
Streaming incremental load
If you have small data volumes, you can stream continual changes over data pipelines to the target data warehouse. When the speed of data increases to millions of events per second, you can use event stream processing to monitor and process the data streams to make more-timely decisions.
Batch incremental load
If you have large data volumes, you can collect load data changes into batches periodically. During this set period of time, no actions can happen to either the source or target system as data is synchronized.
What is ELT?
Extract, load, and transform (ELT) is an extension of extract, transform, and load (ETL) that reverses the order of operations. You can load data directly into the target system before processing it. The intermediate staging area is not required because the target data warehouse has data mapping capabilities within it. ELT has become more popular with the adoption of cloud infrastructure, which gives target databases the processing power they need for transformations.
ETL compared to ELT
ELT works well for high-volume, unstructured datasets that require frequent loading. It is also ideal for big data because the planning for analytics can be done after data extraction and storage. It leaves the bulk of transformations for the analytics stage and focuses on loading minimally processed raw data into the data warehouse.
The ETL process requires more definition at the beginning. Analytics needs to be involved from the start to define target data types, structures, and relationships. Data scientists mainly use ETL to load legacy databases into the warehouse, and ELT has become the norm today.
What is data virtualization?
Data virtualization uses a software abstraction layer to create an integrated data view without physically extracting, transforming, or loading the data. Organizations use this functionality as a virtual unified data repository without the expense and complexity of building and managing separate platforms for source and target. While you can use data virtualization alongside extract, transform, and load (ETL), it is increasingly seen as an alternative to ETL and other physical data integration methods. For example, you can use AWS Glue Elastic Views to quickly create a virtual table—a materialized view—from multiple different source data stores.
What is AWS Glue?
AWS Glue is a serverless data integration service that makes it easier for analytics users to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning, and application development.
- You can discover and connect to 80+ diverse data stores.
- You can manage your data in a centralized data catalog.
- Data Engineers, ETL developers, data analysts, and business users can use AWS Glue Studio to create, run, and monitor ETL pipelines to load data into data lakes.
- AWS Glue Studio offers Visual ETL, Notebook, and code editor interfaces, so users have tools appropriate to their skillsets.
- With Interactive Sessions, data engineers can explore data as well as author and test jobs using their preferred IDE or notebook.
- AWS Glue is serverless and automatically scales on demand, so you can focus on gaining insights from petabyte-scale data without managing infrastructure.
Get started with AWS Glue by creating an AWS account.