How ASCENDING Uses Data Schema Unification for Multi-Data Source Ingestion

By Yi Mu, Cloud Data Architect – ASCENDING
By Channa Basavaraja, Partner Solutions Architect – AWS
By Ryo Hang, Solutions Architect – ASCENDING
By Kelvin Yu and Celeste Shao, Data Engineers – ASCENDING

ASCENDING

Data ingestion and unification play a key role in ensuring consistency, reliability, and performance of data utilized across any organization. It’s crucial for data teams to have well-designed architecture for the ingestion and unification process that can meet business requirements and provide ready-to-use data to the downstream systems.

There are a few common challenges in a data ingestion and unification solution:

Multiple source ingestion
Parallel processing
Data quality
Scalability
Streaming and real-time ingestion

Among these challenges, multi-source ingestion is mostly seen in today’s interconnected organizations. Medium to large organizations tend to collaborate with various external data providers to consume multi-dimensional data to achieve sophisticated analytical capability.

This has created a big challenge for data solution teams to design and implement a unified ingestion process so downstream analytics workload can easily consume those data and bring the value of data.

In this post, we will discuss how ASCENDING set up a unified data ingestion system on Amazon Web Services (AWS) based on a client’s use case and requirements, and how ASCENDING overcame the technical challenges.

ASCENDING is an AWS Advanced Tier Services Partner and AWS Marketplace Seller with Competencies in DevOps and Data and Analytics. ASCENDING provides cloud migration, DevOps, and application development services to enterprise customers.

Data Integration Options

When it comes to data ingestion and integration methodology, there are two primary models: pull and push.

Figure 1 – Data acquisition pull model.

In the pull model, the pull module is provisioned with configurations and resources that are required by the pull clients. Customized pull clients are implemented for different data providers to suit their needs (various data sources and formats). Each of the pull clients has a scheduled event that performs a periodic request to fetch the data from providers.

Figure 2 – Data acquisition push model.

In the push model, data providers can set up their own schedules to upload data through a push agent. The push agent can handle both batch data using REST API or live data stream using Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

This model can provide more flexibility and reduce the compatibility issues on the providers’ side. After assessing clients’ environment and connection with data providers, ASCENDING selected the push model which also mitigated labor costs for third-party library integration.

Solving Data Management Challenges for CAHSS

The Canadian Animal Health Surveillance System (CAHSS) is an initiative of Animal Health Canada (AHC) with broad-based collaborative support from animal agricultural stakeholders and Canadian government bodies. It has been designed to fill the need for strengthened animal health surveillance in Canada.

The stated purpose of CAHSS is “A shared national vision leading to effective, responsive, integrated animal health surveillance in Canada.”

CAHSS embarked on building a data system that could collect laboratory and provincially-inspected abattoir data from different data providers. The new system needed to be cost-effective, secure, and scalable to intake data from various data sources derived from different data infrastructures. The dataset would then be transformed into a consumable format, according to a common schema and visualized as interactive dashboards.

Data Ingestion and Management

The client aimed to provide a centralized entry point for external data providers from different organizations to upload animal health surveillance data on a custom schedule with support for various data formats.
The client has information sharing agreements in place with multiple partners to meet their security and compliance standards by supporting different authentication methods.
The client needed to achieve a cross-platform, self-service, or automatic upload experience and reduce the system maintenance cost.

Data ETL and Schema Unification

The client needed an easier and more streamlined way to process different schemas from multiple providers and transform into a unified format for more efficient downstream analytics.
The client preferred automatic extract, transform, load (ETL) process to save both labor and infrastructure cost.

Building a Unified Data Ingestion and ETL Process

When choosing between push and pull models, ASCENDING made the call based on the client’s use case. Their external data providers have different data infrastructures and operating systems.

It’s hard to go with the pull model, as they don’t have API or data endpoint to pull data periodically. For the push model, ASCENDING chose REST API and worked with AWS Solutions Architects and figured out the best-fit solution for the client. The team came up with a REST API implementation empowering Amazon API Gateway for ingestion, and AWS Glue crawlers for ETL and schema unification.

REST API is the most commonly used APIs in the world of web services. It supports all data formats across different platforms and seamlessly integrates with client’s existing authentication methods. It provides a uniform interface that allows identification and manipulation of resources through representations.

Amazon API Gateway is a fully managed service that acts as the “front door” for applications to access data from backend services. It handles all of the tasks in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, cross-origin resource sharing (CORS) support, authorization and access control, throttling, monitoring, and API version management.

Amazon API Gateway can also collaborate and integrate with many other AWS services. In the case of CAHSS, ASCENDING implemented access control with AWS Identity and Access Management (IAM) and Amazon Cognito, AWS Glue, and data storage in Amazon Simple Storage Service (Amazon S3).

Using API Gateway and Glue is an AWS best practice that offers several other benefits:

Flexible security controls and scalability.
Easily monitor performance metrics.
Cost savings at scale.
Simplify the data onboarding process.
Automatically generated code for transformation and loading.

Figure 3 – Amazon API Gateway overview.

ASCENDING Solution

ASCENDING first built a unified data ingestion process compatible with manual or automatic upload methods for all external partners’ systems. The team implemented REST API through Amazon API Gateway, AWS Lambda, and other services to secure the system.

The API Gateway provides several endpoints (hosted on Amazon CloudFront for low latency and high transfer speed) with different authentication methods for uploading data from each data system. It then saves the raw file into Amazon S3 buckets (using prefixes to create separate storage space for different providers) through a Lambda function.

Each data provider has its own authentication configuration (API key and JSON Web Token for example), but in the push model the team streamlined it using an AWS Lambda authorizer to identify and grant access to client requests in conjunction with Amazon Cognito user pools.

In order to walk through the structure, we have shown a high-level architecture in the following diagram. Only two data providers are listed for illustration purposes, though the solution has various data providers from 10 Canadian provinces.

Figure 4 – Data ingestion from multiple data providers.

After data ingestion, AWS Glue does the ETL jobs and crawls the different schema as tables in the data catalog, and crawls the S3 static lookup tables into data catalog. Various crawl schedules can be created based on providers’ use cases.

The ETL job can read the lookup table from the catalog and transform different schemas into a unified format. The processed data will be stored in S3 buckets and go through the second Glue crawler to form a well-prepared data catalog for further analysis and visualization.

Figure 5 – Data ETL and schema unification.

With this approach, the team successfully unified the output schema of all input data and stored it in the S3 bucket in the unified format (Apache Parquet). This process makes the downstream data queries (using Amazon Athena), machine learning-based analytics (using Amazon SageMaker), and visualization (using Amazon QuickSight) more accessible.

In addition, the unified schema and format takes less storage space and provides more convenience for reading and writing. The client also gains the benefit of zero overhead on the IT infrastructure as AWS Glue is a fully managed service.

Milestones

Prior to reaching the agreement on the REST API solution, ASCENDING experimented with multiple solutions, such as Amazon Kinesis and AWS Command Line Interface (CLI) on the client side.

Eventually, the team finalized its solution with the client and implemented the data indigestion process. It reduced compatibility issues on the client-side and mitigated labor costs for the third party to integrate their system with client libraries such as Kinesis agent or CLI script.

The ETL and schema unification process ASCENDING implemented helped the client fulfill the goal of high-quality data for downstream analysis and visualization. It also optimized the data structure to save space and handling time, provided better data governance with fine-grained access control, and created greater scalability and extensibility with minimal maintenance cost.

Data Ingestion

There are several alternative options for data upload, such as Amazon Kinesis and CLI. After assessing the application environment of the client’s external data providers, ASCENDING picked a REST API solution.

It’s fully compatible with client-side systems and can be integrated with third-party libraries and authentication protocols.
AWS Lambda authorizers were used to control access to the API on all endpoints and methods.
For more complicated authentication scenarios, refer to this AWS blog post where ASCENDING shared how to implement object-based authorization using Amazon Cognito.

ETL Process

ASCENDING used AWS Glue for ETL, and there are two approaches for complex ETL activities. One option is creating workflows from a Glue blueprint, as different components can be added to the workflows and triggered with Amazon EventBridge.

Another option is using an AWS Glue crawler, which was adopted in this solution to support more customizations.

ASCENDING created a pipeline of Glue trigger > Glue crawler > Glue job > Glue crawler.
AWS CloudFormation templates were used to create and deploy Glue job with manual triggers.
In order to handle incremental data loads, the team used Glue bookmarks to load and process only the new data stream.
In the next phase, ASCENDING will enable Glue auto scaling to handle higher workload requirements from more data providers.
AWS recommends using AWS Glue DataBrew as it’s code free with over 250 pre-built transformations to automate data preparation tasks. However, in ASCENDING’s solution the choice was made to write a customized Glue crawler script to better suit the client’s needs.
Set up the necessary monitoring and alert around the Glue job for cost optimization purposes. Monitoring and alerts can prevent you from paying unnecessary long-running Glue jobs, which can be expensive.

Conclusion

In this post, we talked about challenges of data ingestion and schema unification. We used one of ASCENDING client’s use cases to walk through how the team tackled challenges by combining various AWS services such as Amazon API Gateway, AWS Lambda, and AWS Glue.

If you have any unique data analytics challenge in your organization, connect with ASCENDING. Here are some relevant hands-on videos:

You can also learn more about ASCENDING on AWS Marketplace.

.

.

ASCENDING – AWS Partner Spotlight

ASCENDING is an AWS Advanced Tier Services Partner that provides cloud migration, DevOps, and application development services to enterprise customers.

Contact ASCENDING | Partner Overview | AWS Marketplace

AWS Partner Network (APN) Blog