Migration & Modernization
Overcoming Barriers to Large-scale Data Modernization
Data modernization is the process of updating or improving an organization’s data infrastructure and systems to make data more accessible, usable, and valuable. It involves implementing modern technologies and approaches to optimize data management, storage, processing, and analysis. Data modernization accelerates innovation and decision making by democratizing and simplifying data access across the enterprise. While organizations are creating, collecting, and storing more data than ever before, much of it remains underutilized, or not used at all.
Large volumes of data (Terabyte to Petabyte scale) often are a barrier to data modernization. This blog focuses on patterns to address this barrier. We will cover topics such as modern data architecture, data ingestion, data processing and data orchestration in depth to modernize large volumes of data. Beyond dealing with the sheer size of data, modernization also requires overcoming people, technology and process issues outlined below.
People: Modernizing data and databases requires expertise in emerging technologies like cloud databases and containerization. However, resistance from people due to familiarity with existing systems and a shortage of skilled personnel can impede the process.
Technology: Modernizing large databases faces challenges like outdated hardware, unsupported software, and complex legacy systems. Integrating modernized databases with existing systems can lead to interoperability issues and data inconsistencies.
Process: Mapping complex data structures and dependencies is crucial for designing a modern database architecture. Dealing with diverse data formats, data access techniques, and mixed workloads adds to the complexity. Ensuring data quality, implementing robust security measures, encryption, access controls, data governance, and compliance are essential when modernizing large sensitive datasets.
Best practices
Leverage AWS purpose-built databases and analytics services to migrate and modernize large data
There isn’t a one-size-fits-all solution. With AWS purpose-built database services, customers can select the most appropriate database for their specific workloads and application needs. The following diagram shows high-level steps involved in modernizing data using the AWS managed purpose-built databases and analytics services. This architecture uses AWS Snowball in combination with AWS Database Migration Service (DMS) and AWS Schema Conversation Tool (SCT) to write a large database locally on the device to help optimize network bandwidth, then ships the AWS Snowball device to AWS for import to the target cloud database for modernization.
Step 1-3 in Figure 1 covers the migration of large-scale data volume using AWS DMS/SCT and AWS Snow Family (petabyte-scale)
Data Migration
- Data discovery is carried out to identify the required data and the target databases before migration starts. The data discovery includes structured, unstructured, historical, transactional, and log data. Data prioritization is done to determine the importance and order of the data to be migrated. The AWS SCT replication agent is installed on the on-premise server to copy the large data. The data is copied to AWS Snowball Edge, which can transport data at speeds faster than the internet. The Snowball Edge is then shipped to AWS to be restored to an Amazon S3 bucket for staging. For database backup files larger than 10 TB, AWS Snowball Edge is recommended to transfer the backup files to the S3 bucket. This helps minimize network bandwidth and speed up data transfer during the migration. The S3 bucket is used for staging the target database. If you are considering using Amazon DynamoDB as your target for example, store large objects like BLOBs in Amazon S3 and use a pointer to reference them in DynamoDB. DynamoDB has a maximum item size of 400KB. Objects larger than this limit can be stored in S3 while the meta data of this object stored in DynamoDB tables. Details of this architecture is explained in Large object storage strategies for Amazon DynamoDB
- AWS DMS and AWS SCT are used for on-going replication and schema conversations. As your on-premise database remains online during this migration, DMS allows for on-going replication of changes to the target database. SCT help simplify the database migration process in a highly available manner by helping to convert codes and schema from one database engine to another. For more information on using Snowball and AWS DMS and SCT for large on-premises database migration, please see Enable large-scale database migrations with AWS DMS and AWS Snowball.
- The database backup file is then imported into the chosen purpose-built database in AWS from the staging S3 bucket.
Data Modernization
In previous steps (1-3) we looked at patterns to migrate large datasets to AWS. Now let’s talk about modernization of the migrated data in following step 4-5
- Modern data architecture involves transfer of data between purpose-built data services like scalable data lakes. AWS provides Zero-ETL integrations and AWS Glue to securely create and manage an end to end data pipeline to facilitate this transfer. Once the data is available the reporting can be enriched with AI/ML service like Amazon Q in QuickSight which can quickly create new visualization and scenarios.
- Data is key differentiator and modernization of data unlocks the Machine Learning (ML), Artificial Intelligent (AI) and Generative AI (Gen AI) opportunity, that will drive business value.
Organizations need robust data ingestion and processing patterns like Extract, transform, and load (ETL), and Extract, Load, and Transform (ELT), and ETLT to handle diverse, high-volume data. The right infrastructure and a consistent data model are crucial to enable actionable insights from the processed data.
Data processing
|
Data orchestration
Traditional ETL infrastructure faces scalability, performance, and cost challenges because of the following reasons.
- Traditional ETL systems are often built on a monolithic architecture, which can struggle to scale as the volume, velocity, and variety of data increase. Also, pipelines can lead to latency issues and limited parallelization, while complex data transformations can impact overall system performance.
- Traditional ETL systems often requires significant upfront investment in hardware and software licenses, as well as ongoing maintenance and operational costs.
A robust workflow management is critical to manage large-scale data processing. The workflow management component must focus on scheduling, dependency handling, and monitoring for consistency, accuracy and freshness-of transformed data. Consider a scenario that data in CSV and Excel format coming from multiple sources, require cleansing and masking sensitive information, transforming the data into JSON format, and storing it in an Amazon S3 data lake. Additionally, it should process a historical petabyte-scale dataset for annual reporting. The following diagram (Figure 3: Data Orchestration) shows high level architecture of a scenario.
|
By combining the strengths of MWAA and Step Functions, the solution can efficiently orchestrate and manage the entire data processing ecosystem. This includes the complex historical data processing to agile, real-time data transformation tasks. This comprehensive approach ensures the successful execution of the data pipelines, regardless of their complexity or time-sensitivity and help driving valuable insights from their data.
Conclusion
Data modernization helps customers drive business value and involves implementing modern technologies. However, this can present potential challenges from people, process, and technology perspectives. In this blog we explored different patterns to address large data modernization challenges. We also highlighted how AWS provides a comprehensive range of options from pre-build ETL to Augmenting GenAI. These options empower customers to extract value out of data faster and make the data available quicker to the right users and applications.
Resources
In this blog, we have described patterns for large-scale data migration and modernization. AWS provides playbooks that covers migration patterns for database engines like IBM, Microsoft, Open-source, Oracle, and SAP.