What is a Data Integration Platform?

Modern organizations create and use data across tens to thousands of systems and formats. Data integration refers to the process of combining data from different systems and formats and normalizing it to make the data more useful. With integrated data, you can access a single, unified view of all the data for decision support and reporting, go on to analyze data, and consequently be more informed for decision-making.

Companies need consolidated data to support business analytics, customize machine learning models, and for other enterprise applications and business processes. Data integration involves collecting, transforming, and consolidating raw data so that businesses can benefit from its collective form. For example, data from transactions, number and type of accounts, and customer service records help create a unified view of customer data for a bank.

What is the data integration process?

Organizations understand the benefits of data integration for more productive workflows. Defining the data integration process helps organizations produce more reliable, repeatable outcomes.

1. Identify the different data sources

Identify the multiple sources of data that will need integration, either automatically or manually. Organizations create and store data across many different types of systems and data formats. For example, an organization might use various kinds of SQL databases, memory caches, and document stores. Applications within the organization might store data in proprietary formats without straightforward external data access.

2. Determine the integration strategy

Examine the relevant data storage and formats, alongside your organizational requirements, to determine the best ways to extract and transform the data into a normalized format. Here are some common data integration strategies:

The Extract, Transform, Load (ETL) pattern extracts data from current systems, transforms the data, and loads the data into the target system. ETL is a common pattern for data warehouse storage.
The Extract, Load, Transform (ELT) pattern extracts data from current systems, loads the data into the target system, and transforms the data. ELT allows the data to be left in an unstructured form until you need it for analytics. ELT is a common pattern for data lake storage.
Real-time streaming ingestion captures data from streams and performs data ingestion for near-real-time data integration.
Change data capture (CDC) is the process of discovering changes to data and publishing these changes to an event stream for data ingestion.

In this stage, you will also need to determine the target storage system or data repository, for example, a data warehouse or data lake.

3. Design the schema

Outline the data schema, or schema-less storage type, for the final state of the data. The schema must be extendable, versionable, and fit into enterprise data storage expectations. The new schema should preserve data quality and data accuracy, with corresponding data governance rules for future integration.

4. Extract the data

Determine the best methods for data extraction to minimize disruption to business operations. For example, many organizations use batch extraction after the close of business each day to integrate non-real-time data. Organizations might need to use APIs to extract data for proprietary application integration, or use services such as Amazon AppFlow to transfer data between software as a service (SaaS) applications and the cloud.

5. Move data to a centralized store

Transport the data to a centralized store. Sometimes the data source and its destination are in different locations, for example, moving data from on-premises to the cloud. Movement of data can need extra security measures, extra bandwidth, or data residency considerations.

6. Transform the data

Data might need transformation into its final form in the centralized store. Transformed data can be more than a format change, for example, calculating an average from multiple data points.

What is serverless data integration?

Businesses are shifting their data workflows from on-site infrastructures to modern cloud data platforms. Cloud architectures help organizations overcome physical hardware constraints and offer advanced, integrable cloud data analytics services, such as business intelligence and AI.

Serverless is a cloud computing concept that provides fully elastic, fault-tolerant cloud services and removes the complexities of server provisioning. Traditionally, when creating a data pipeline, you provision and maintain servers and code services for data ingestion, transformation, and manipulation. With a serverless data integration product, you have full scalability without the management overhead. Jobs run until completion, and the service lies dormant again until it is needed next.

Serverless is useful for on-demand data integration jobs, with a pay-per-use model that can help reduce infrastructure costs for companies.

For example, AWS Glue is a serverless data integration solution. AWS Glue allows you to discover and connect to more than 100 diverse data sources, manage your data in a centralized data catalog, and visually create, run, and monitor data pipelines to load data into your data lakes, data warehouses, and lakehouses.

With AWS Glue, you can use the appropriate data integration engine for any workload, based on the characteristics of your workload and the preferences of your developers and analysts. AWS Glue jobs can be invoked on a schedule, on demand, or based on an event.

Serverless ETL data integration with AWS Glue

To begin using AWS Glue, start the AWS Glue Studio console. Before you start using AWS Glue, set up the necessary IAM policies and roles in the console.

Step 1 - Add table definitions to AWS Glue Data Catalog

Navigate to the Data Catalog. Choose Add table using crawler and choose the source data stores you want to crawl to provide data mapping of your schemas and metadata, to create data definitions and tables in the Data Catalog.

Step 2 - Define your transformation job

Choose ETL jobs from the navigation pane and select Create job using Visual ETL. Add the data source and data target nodes in the visual editor and configure the data. Glue Studio generates code in the Script tab that will transform the data in the source table to the target table’s schema.

Step 3 - Run the AWS Glue job

You can set the parameters for the job run with the data governance tools in the Job details tab. When you have configured the parameters, select Save, and then select Run to initiate the data processes for transformation and integration.

Step 4 - Examine the output

In the Visual tab, select the target node to observe the data preview, to make sure that accurate data is in the node.

For more information, reference the AWS Glue: User Guide.

What is zero-ETL integration?

Zero-ETL is a set of integrations that minimizes the need to build ETL data pipelines. Typically, you create, configure, and run an ETL pipeline when transferring data from data sources to destinations. However, with the zero-ETL data integration method, the ETL process is automated and hidden within a software process.

After loading data from the source to the destination the first time, further data replication happens automatically every time the data in the source is updated. This zero-ETL process allows near-real-time analytics pipelines.

AWS has multiple services that support zero-ETL, including Amazon Redshift, Amazon RDS for MySQL, Amazon DynamoDB, Amazon DocumentDB, Amazon SageMaker, Amazon CloudWatch, Amazon OpenSearch Service, Amazon Security Lake, and Amazon Aurora.

Zero-ETL integration with Amazon Redshift and Amazon Aurora

Amazon Redshift is a cloud data warehouse that allows businesses to scale their analytics workloads affordably. Meanwhile, Amazon Aurora is a high-performance relational database compatible with MySQL and PostgreSQL.

Step 1 — Configure the integration source

Verify that your Amazon Aurora database supports zero-ETL integration with Amazon Redshift. Amazon Redshift supports the following zero-ETL integrations with Amazon Aurora at the time of writing.

Amazon Aurora MySQL
Amazon Aurora PostgreSQL

Configure binary logging in Aurora to make sure that you capture the data changes for replication. Choose encryption options for data at rest and in transit to meet security requirements. Finally, set up the necessary IAM policies and roles to grant permissions for integration with Amazon Redshift.

Amazon Redshift also supports zero ETL with Amazon RDS for MySQL, Amazon DynamoDB, and applications such as Salesforce, SAP, ServiceNow, and Zendesk.

Step 2 — Configure the destination

If you don’t have one, launch a new Redshift cluster with appropriate storage and compute configurations. Make sure that the Amazon Redshift cluster has the necessary network access and encryption settings. Modify the security groups and VPC settings to allow connectivity between Aurora and Redshift.

Step 3—Validate the integration

Amazon Redshift performs an initial load of data from Amazon Aurora. Afterward, it automatically monitors the source and replicates updated data in real time. You can run queries in Amazon Redshift to verify that the data matches the source.

How can AWS support your data integration needs?

Data integration is key to providing businesses with the complete picture of data from multiple data sources, feeding into visualizations and advanced analytics. Managing complex integration pipelines across growing unstructured, semi-structured, and structured data sources can be difficult. Cloud data integration helps simplify data management workflows with innovative data integration tools and services, such as serverless and zero-ETL. Explore AWS services that meet modern data integration needs here.

What is a Data Integration Platform?

Page topics