Overcoming Barriers to Large-scale Data Modernization

Data modernization is the process of updating or improving an organization’s data infrastructure and systems to make data more accessible, usable, and valuable. It involves implementing modern technologies and approaches to optimize data management, storage, processing, and analysis. Data modernization accelerates innovation and decision making by democratizing and simplifying data access across the enterprise. While organizations are creating, collecting, and storing more data than ever before, much of it remains underutilized, or not used at all.

Large volumes of data (Terabyte to Petabyte scale) often are a barrier to data modernization. This blog focuses on patterns to address this barrier. We will cover topics such as modern data architecture, data ingestion, data processing and data orchestration in depth to modernize large volumes of data. Beyond dealing with the sheer size of data, modernization also requires overcoming people, technology and process issues outlined below.

People: Modernizing data and databases requires expertise in emerging technologies like cloud databases and containerization. However, resistance from people due to familiarity with existing systems and a shortage of skilled personnel can impede the process.

Technology: Modernizing large databases faces challenges like outdated hardware, unsupported software, and complex legacy systems. Integrating modernized databases with existing systems can lead to interoperability issues and data inconsistencies.

Process: Mapping complex data structures and dependencies is crucial for designing a modern database architecture. Dealing with diverse data formats, data access techniques, and mixed workloads adds to the complexity. Ensuring data quality, implementing robust security measures, encryption, access controls, data governance, and compliance are essential when modernizing large sensitive datasets.

Best practices

Leverage AWS purpose-built databases and analytics services to migrate and modernize large data

There isn’t a one-size-fits-all solution. With AWS purpose-built database services, customers can select the most appropriate database for their specific workloads and application needs. The following diagram shows high-level steps involved in modernizing data using the AWS managed purpose-built databases and analytics services. This architecture uses AWS Snowball in combination with AWS Database Migration Service (DMS) and AWS Schema Conversation Tool (SCT) to write a large database locally on the device to help optimize network bandwidth, then ships the AWS Snowball device to AWS for import to the target cloud database for modernization.

Figure 1: AWS purposed-built data stores

Step 1-3 in Figure 1 covers the migration of large-scale data volume using AWS DMS/SCT and AWS Snow Family (petabyte-scale)

Data Migration

Data discovery is carried out to identify the required data and the target databases before migration starts. The data discovery includes structured, unstructured, historical, transactional, and log data. Data prioritization is done to determine the importance and order of the data to be migrated. The AWS SCT replication agent is installed on the on-premise server to copy the large data. The data is copied to AWS Snowball Edge, which can transport data at speeds faster than the internet. The Snowball Edge is then shipped to AWS to be restored to an Amazon S3 bucket for staging. For database backup files larger than 10 TB, AWS Snowball Edge is recommended to transfer the backup files to the S3 bucket. This helps minimize network bandwidth and speed up data transfer during the migration. The S3 bucket is used for staging the target database. If you are considering using Amazon DynamoDB as your target for example, store large objects like BLOBs in Amazon S3 and use a pointer to reference them in DynamoDB. DynamoDB has a maximum item size of 400KB. Objects larger than this limit can be stored in S3 while the meta data of this object stored in DynamoDB tables. Details of this architecture is explained in Large object storage strategies for Amazon DynamoDB

AWS DMS and AWS SCT are used for on-going replication and schema conversations. As your on-premise database remains online during this migration, DMS allows for on-going replication of changes to the target database. SCT help simplify the database migration process in a highly available manner by helping to convert codes and schema from one database engine to another. For more information on using Snowball and AWS DMS and SCT for large on-premises database migration, please see Enable large-scale database migrations with AWS DMS and AWS Snowball.

The database backup file is then imported into the chosen purpose-built database in AWS from the staging S3 bucket.

Data Modernization

In previous steps (1-3) we looked at patterns to migrate large datasets to AWS. Now let’s talk about modernization of the migrated data in following step 4-5

Modern data architecture involves transfer of data between purpose-built data services like scalable data lakes. AWS provides Zero-ETL integrations and AWS Glue to securely create and manage an end to end data pipeline to facilitate this transfer. Once the data is available the reporting can be enriched with AI/ML service like Amazon Q in QuickSight which can quickly create new visualization and scenarios.
Data is key differentiator and modernization of data unlocks the Machine Learning (ML), Artificial Intelligent (AI) and Generative AI (Gen AI) opportunity, that will drive business value.

Organizations need robust data ingestion and processing patterns like Extract, transform, and load (ETL), and Extract, Load, and Transform (ELT), and ETLT to handle diverse, high-volume data. The right infrastructure and a consistent data model are crucial to enable actionable insights from the processed data.

Data processing

Figure 2: Data processing and orchestration

AWS Glue simplifies and accelerates data transformation tasks. Its advanced features, including built-in machine learning for data cleansing and automated data preparation, reduce the time and effort required. AWS Glue’s Job Bookmarks and Workload Partitioning features optimize data processing by persisting job state and dividing complex tasks into smaller, more manageable units. AWS Glue DataBrew provides over 250 pre-built transformations, automating data preparation tasks and enabling businesses to efficiently extract insights from their data.
Amazon EMR is a powerful big data solution that enables customers to run petabyte-scale data processing, interactive analytics, and machine learning workloads using open-source frameworks like Spark, Hive, and Presto. With the flexibility to run on Amazon Elastic Compute Cloud (Amazon EC2) , Amazon Elastic Kubernetes Service (Amazon EKS), or serverless, EMR provides a decoupled compute and storage architecture. This allowing for dynamic scaling of resources based on workload requirements.
This empowers customers to modernize their legacy Hadoop platforms, such as Cloudera and MapR, by migrating them to the cloud. The business benefits include increased operational efficiency, improved scalability, and the ability to derive deeper insights from large-scale data assets, ultimately driving data-driven decision-making and competitive advantage.
You can migrate on-premises containerized data pipelines that require regular and ad-hoc process to the AWS by leveraging AWS Fargate. AWS Fargate, a serverless compute engine for containers, works seamlessly with both Amazon Elastic Container Service (Amazon ECS) and EKS. By migrating this containerized pipeline to Fargate on Amazon EKS, you can take advantage of the elastic scaling capabilities of Fargate. This allows you to automatically provision the necessary compute resources as the data volume increases.

Data orchestration

Traditional ETL infrastructure faces scalability, performance, and cost challenges because of the following reasons.

Traditional ETL systems are often built on a monolithic architecture, which can struggle to scale as the volume, velocity, and variety of data increase. Also, pipelines can lead to latency issues and limited parallelization, while complex data transformations can impact overall system performance.
Traditional ETL systems often requires significant upfront investment in hardware and software licenses, as well as ongoing maintenance and operational costs.

A robust workflow management is critical to manage large-scale data processing. The workflow management component must focus on scheduling, dependency handling, and monitoring for consistency, accuracy and freshness-of transformed data. Consider a scenario that data in CSV and Excel format coming from multiple sources, require cleansing and masking sensitive information, transforming the data into JSON format, and storing it in an Amazon S3 data lake. Additionally, it should process a historical petabyte-scale dataset for annual reporting. The following diagram (Figure 3: Data Orchestration) shows high level architecture of a scenario.

Figure 3: Data orchestration

At the core of the solution, AWS Glue Workflows orchestrate the data ingestion, profiling, and transformation processes. Glue Workflows integrate with AWS Glue DataBrew, a no-code, visual data preparation tool. AWS Glue DataBrew enables efficient data cleansing, masking, and conversion to the desired JSON format. The prepared data is then stored in the Amazon S3 data lake.
Amazon Managed Workflows for Apache Airflow (MWAA) is well-suited for orchestrating complex, multi-step data processing pipelines, especially for handling large-scale, petabyte-level datasets. MWAA’s integration with Amazon EMR on EKS allows it to leverage the distributed processing capabilities of Apache Spark, which is crucial for the computationally intensive annual reporting requirements. This meets the requirement to handle the large data volumes and complex computations required for the annual reporting.
AWS Step Functions provides a broader orchestration layer, coordinating the execution of the various data processing pipelines, including those orchestrated by MWAA. To manage the overall orchestration and state of the data pipelines, the solution utilises AWS Step Functions Standard Workflows and Express Workflows. Standard Workflows are used for long-running, complex data pipelines, ensuring reliable execution and comprehensive error handling. For example, the historical data processing pipeline orchestrated by MWAA would be managed by a Step Functions Standard Workflow. This workflow can handle failures and retries, ensuring successful completion. The Express Workflows are used for more time-sensitive, fast-paced data processing tasks. These workflows are designed for low-latency, high-throughput orchestration, making them ideal for scenarios like real-time data ingestion or event-driven data transformations. An example of an Express Workflow in this solution could be the orchestration of the data ingestion and transformation process handled by Glue Workflows, where immediate responsiveness is crucial.

By combining the strengths of MWAA and Step Functions, the solution can efficiently orchestrate and manage the entire data processing ecosystem. This includes the complex historical data processing to agile, real-time data transformation tasks. This comprehensive approach ensures the successful execution of the data pipelines, regardless of their complexity or time-sensitivity and help driving valuable insights from their data.

Conclusion

Data modernization helps customers drive business value and involves implementing modern technologies. However, this can present potential challenges from people, process, and technology perspectives. In this blog we explored different patterns to address large data modernization challenges. We also highlighted how AWS provides a comprehensive range of options from pre-build ETL to Augmenting GenAI. These options empower customers to extract value out of data faster and make the data available quicker to the right users and applications.

Resources

In this blog, we have described patterns for large-scale data migration and modernization. AWS provides playbooks that covers migration patterns for database engines like IBM, Microsoft, Open-source, Oracle, and SAP.

Migration & Modernization