We migrated billions of rows from Oracle to Amazon DynamoDB and increased elasticity and reliability with no downtime for our global customer base.
Tim Kohn Vice President of Technology, Prime Video
  • About Amazon.com

    Amazon.com is the world’s leading online retailer and the pioneer of customer reviews, 1-Click shopping, personalized recommendations, Prime, AWS, Kindle, Alexa, and many more products and services.

     

  • Benefits

    • Migrated with zero downtime
    • Improved latency by 30%
    • Achieved 100,000 transactions per second
    • Created a next-generation platform
  • AWS Services Used

Tens of millions of people watch movies and TV shows on Prime Video, the video-streaming service from Amazon that also allows viewers to purchase content and download it for offline viewing. Customers want instant access to videos—whether an on-demand program or the latest films—and this requires data horsepower.

At first, Amazon managed its videos using Customer Queue Service (CQS), which it built on Oracle in 2007 to support the initial launch of the product that would become Prime Video. Over the years, CQS expanded to cover a wide range of functionality, including playback, ownership, downloads, offers, order fulfillment, library management, season passes, subscriptions, rentals, and content discovery. It is critical to the daily operation of one of the largest video platforms in the world and the repository of billions of dollars in customer rights. If CQS were to go down, Prime Video would be inaccessible.

Multiple workarounds kept CQS running with sufficient performance for many years, during which the business grew significantly. However, the system became complex to operate and could not support continuous-deployment pipelines, meaning updating the system took a half day of engineering time. CQS lacked auto-rollback capabilities, meaning any impacts from a faulty update would persist longer than necessary. Over time, only 15 of 46 APIs defined in the service were in active use. Deprecated features had never been removed from the system.

These issues created many potential points of failure for Prime Video's performance. Problems with CQS caused 35 service disruptions from 2010 to 2018. For Prime Video, disruptions and outages are a serious business problem because customers expect the service to work without issues.

In 2011, CQS exceeded the volume of read operations the Oracle database could handle. Given that these read operations make up more than 99 percent of requests to the system, increasing the volume of requests affected performance significantly. The team began replicating Oracle data to SABLE, an internal Amazon solution, to handle read requests. However, SABLE's most active internal users experienced some performance degradation during the replication. The added management burden made technicians’ jobs unnecessarily difficult and prevented them from working on strategic activities—they were too busy putting out operational fires.

As part of its strategy to build a platform that could scale to meet projected needs for at least 10 years, Amazon decided to migrate Prime Video to Amazon Web Services (AWS.) The migration would replace CQS with a suite of 12 microservices, built using a range of AWS services that included Amazon DynamoDB, AWS Lambda, and Amazon Simple Queue Service (Amazon SQS). The IT team developed a thorough plan to get the migration done within two years, including assurances that customers would not be impacted at any point during the switchover to the new system. The migration plan was a result of estimation workshops, engineering reviews, and executive inputs. The team was measured on how fast it moved relative to expected schedules. Rigorous testing was performed on each API before it was allowed to change data.

The first use case completed by the team was video downloading. This included provisioning a brand-new, Tier 1 service and migrating more than a billion download records from Oracle to DynamoDB. The team invested 32 months of engineering effort in the project and delivered a download service that could handle high throughput.

More than 30 other applications regularly use CQS APIs. These applications had to be switched over to the new API endpoints. Finally, the team had to migrate all the ownership records—constituting billions of rows—that it stored in Oracle. APIs that wrote to the system were set to copy data simultaneously to Oracle and DynamoDB. The team was able to validate the data and test API performance while the system was still fully operational. The service exceeded the high scaling targets required to support the Prime Video business.

By applying the capabilities of AWS, the team was freed to innovate in ways that were not possible on the legacy system. Using Amazon DynamoDB Streams and AWS Lambda, the team built mechanisms to analyze discrepancies between the new and legacy systems in near-real time to ensure customers could not be affected when the cutover occurred. This analysis was performed without any impact on latency or availability of the service.

Using Amazon SQS, the team implemented a service to synchronize the ownership state of customers across the systems when migration errors were detected. These tools have positioned the team to move toward an event-driven architecture, which unlocks previously impossible use cases around customer engagement and system performance.

The team achieved an average of 30 percent improvement in latency for key performance indicators critical to the video-playback experience. For example, errors in the Entitlements service were reduced by 90 percent and its latency was improved by 15–50 percent. The latency incurred in retrieving a customer’s TV On Demand library was reduced by 85 percent, from 800 milliseconds to just 120.

The project resulted in a single, globally replicated data set with improved resiliency, scalability, latency, operational efficiency, and a 55 percent reduction in cost compared to the old system. Tim Kohn, vice president of technology for Prime Video, says, “We migrated billions of rows from Oracle to Amazon DynamoDB, and we increased elasticity and reliability with no downtime for our global customer base.”

Learn more about Amazon DynamoDB, AWS Lambda, and Amazon Simple Queue Service.