Amazon Migrates 50 PB of Analytics Data from Oracle to AWS
Amazon builds and operates thousands of microservices to serve millions of customers. These include catalog browsing, order placement, transaction processing, delivery scheduling, video services, and Prime registration. Each service publishes datasets to Amazon’s analytics infrastructure, including more than 50 petabytes of data and 75,000 tables, processing 600,000 user analytics jobs each day. Data is published by more than 1,800 teams, while more than 3,300 data consumer teams analyze this data to drive insights, identify opportunities, prepare reports, and evaluate business performance.
The on-premises Oracle database infrastructure that supported this system was not designed for processing petabytes of data and resulted in a monolithic solution that was hard to maintain and operate due to lack of separation on a functional and financial perspective. From an operational perspective, transformations of tables with more than 100 million rows consistently failed. This limited business teams’ ability to generate insights or deploy large-scale machine-learning solutions. Many abandoned the monolithic Oracle data warehouse in favor of custom solutions using Amazon Web Services (AWS) technologies.
Database administration for the Oracle data warehouse was complicated, expensive, and error-prone, requiring engineers to spend hundreds of hours each month on software upgrades, replication of data across multiple Oracle clusters, OS patching, and performance monitoring. Inefficient hardware provisioning required labor-intensive demand forecasting and capacity planning. It was also financially inefficient, being statically sized for peak loads and lacking the ability to dynamically scale for hardware cost optimization, with ever-increasing Oracle licensing costs.
AWS Services Used
Amazon is guided by four principles: customer obsession rather than competitor focus, passion for invention, commitment to operational excellence, and long-term thinking. Customer reviews, 1-Click shopping, personalized recommendations, Prime, Fulfillment by Amazon, AWS, Kindle Direct Publishing, Kindle, Fire tablets, Fire TV, Amazon Echo, and Alexa are some of the products and services pioneered by Amazon.
- Migrated system with 50 PB of analytics data
- Optimized scale and cost
- Eliminated Oracle licensing cost
- Expanded analytics toolset
- Migrated system with 50 PB of analytics data
AWS Services Used
Transforming Analytics at Amazon
In order to meet its growing needs, Amazon’s consumer business decided to migrate the Oracle data warehouse to an AWS-based solution. The new data lake solution uses a variety of AWS services to deliver performance and reliability at exabyte scale for data processing, streaming, and analytics.
The company used Amazon Simple Storage Service (Amazon S3) as a data lake to hold raw data in native format until required for analysis. Taking advantage of Amazon S3 gave Amazon the flexibility to manage a wide variety of data at scale with reduced costs, improved access control, and strong regulatory compliance. Beyond the natively supported governance and security features of Amazon S3, Amazon integrated its internal services for authentication, authorization, and data governance. It also developed a metadata service to simplify dataset discovery, which allows data consumers to easily search, sort, and identify datasets for analysis.
To enable self-service analytics for end users, Amazon developed a service that synchronizes data from the data lake with compute systems including Amazon Elastic MapReduce (Amazon EMR) and Amazon Redshift. Amazon EMR provides a managed Hadoop framework that can run Apache Spark, HBase, Presto, and Flink on Amazon Elastic Compute Cloud (Amazon EC2) instances and interact with data in Amazon S3. Amazon Redshift is the AWS data warehouse service that allows analytics end users to run complex queries and visualize results using tools such as Amazon QuickSight.
Further, Amazon integrated the data lake with the Amazon Redshift Spectrum feature, which allows users to query any dataset in the lake directly from Redshift without needing to synchronize the data to their cluster. This accelerated ad-hoc analysis across the consumer business and de-coupled capacity planning for analytics from the need to store local copies of the largest datasets. This enabled the federation of analytics and the visibility of the cost of these analyses, which was severely limited using the previous architecture.
To aid migration from the Oracle solution to the federated data lake architecture, Amazon developed tools for bulk query migration using the AWS Schema Conversion Tool (AWS SCT). This tooling was used to automatically convert and validate more than 80 percent of the 200,000 queries from Oracle SQL to Amazon Redshift SQL, saving more than 1,000 person-months of manual effort. For queries that could not be automatically converted, engineers documented and shared best practices with end users to enable conversion of these queries.
Shifting the Culture
The migration team educated users about the vision, mission, and goals of the migration through in-person training sessions, informal talks, webinars, and documentation. The move proceeded in waves, helping refine systems, tools, and processes as the project progressed. Each team submitted a project plan and allocated resources to migrate artifacts including ETL processes, business reports, stored procedures, and machine-learning algorithms.
The migration team seeded the data lake with active datasets from the Oracle data warehouse and created an automated system to keep the data sets updated in both systems. It provided migration tools, including AWS CloudFormation templates for provisioning AWS resources. Channels were created to enable data producers and consumers to monitor data availability, accuracy, and latency in the data lake, so they could raise issues directly. The central team established weekly, monthly, and quarterly reviews with each team to track and report progress, and it aggregated progress reports from both user groups for program status reporting.
Additionally, the migration to AWS redefined the career paths of legacy database engineers and managers. Their skills and expertise were turned to improving the performance of the Amazon Redshift or Amazon EMR solutions, which rely on database knowledge of designing optimal query plans and monitoring performance. The central team enabled these career transitions through extensive training and education.
New Scale and Agility
The new analytics infrastructure has one data lake with more than 200 petabytes of data—almost four times the size of the previous Oracle data warehouse. Teams across Amazon are now using more than 3,000 Amazon Redshift or Amazon EMR clusters to process data from the data lake.
Despite its larger size, business units are finding the new system is more cost-effective. This is because the migration team retired 30 percent of the workload that was no longer used and optimized queries for better system utilization. Teams are now able to monitor the usage of the systems and eliminate waste sooner, leading to ongoing cost efficiencies.
Amazon’s consumer businesses have benefited from the separation of data storage from data processing in AWS. AWS storage services made it easy to store data in any format, securely, at massive scale, low cost, and to move data quickly and easily. The data lake architecture allows each system to scale independently while reduced overall costs and broadening the arsenal of technologies available. Users can easily discover high-quality data in optimized formats, and teams are reporting reduced latency for their analytics results.
Using AWS, each Amazon business team manages its own compute instances with full control over capacity and costs, unlike on the legacy environment, where there were inefficiencies due to centralized infrastructure. Teams now use Amazon EC2 Reserved Instances as part of their cost-optimization strategies. The central team continuously monitors AWS analytics accounts to assess usage and optimize costs.
By moving to the AWS Cloud, Amazon empowered engineers to focus on developing insights for their businesses by using or building advanced analytics tools rather than spending their time keeping their legacy system running. Most importantly, the migration makes it easier for engineers in Amazon’s consumer business units to continuously analyze and improve the services they provide to customers.