Prime Video Boosts Scale and Resilience Using Amazon DynamoDB
Tens of millions of people watch movies and TV shows on Amazon Prime Video, the streaming-video service from Amazon. Prime Video empowers viewers to purchase content and download it for offline viewing. Customers want easy, instant access to videos, whether an on-demand program or the latest films, and this requires data horsepower.
At first, video management relied on Customer Queue Service (CQS). Amazon built CQS on Oracle in 2007 to support the initial launch of the product that would later become Prime Video. Over the years, CQS expanded to cover a wide range of functionality, including playback, ownership, downloads, offers, order fulfillment, library management, season passes, subscriptions, rentals, and content discovery. It is critical to the daily operation of one of the largest video platforms in the world and the repository of billions of dollars in customer rights. If CQS were to go down, Prime Video would be inaccessible.
“We migrated billions of rows from Oracle to Amazon DynamoDB, and we increased elasticity and reliability with no downtime for our global customer base.”
Tim Kohn, Vice President of Technology, Prime Video
AWS Services Used
Amazon.com is the world’s largest online retailer. Amazon is guided by four principles: customer obsession rather than competitor focus, passion for invention, commitment to operational excellence, and long-term thinking. Customer reviews, 1-Click shopping, personalized recommendations, Prime, Fulfillment by Amazon, AWS, Kindle Direct Publishing, Kindle, Fire tablets, Fire TV, Amazon Echo, and Alexa are some of the products and services pioneered by Amazon.
- Migrated with zero downtime
- Improved latency by 30%
- Achieved 100,000 transactions per second
- Created a next-generation platform
- Migrated with zero downtime
AWS Services Used
A Legacy of Complexity
Multiple workarounds kept CQS running with sufficient performance for many years, during which the business grew significantly. However, the system became complex to operate and could not support continuous deployment pipelines, meaning that updating the system took a half day of engineering time. CQS lacked auto-rollback capabilities, meaning that any impacts from a faulty update would persist longer than necessary. Over time, only 15 of 46 APIs defined in the service were in active use. Deprecated features had never been removed from the system.
These issues created many potential points of failure for Prime Video performance. In fact, problems with CQS caused 35 service disruptions from 2010 to 2018. For Prime Video, disruptions and outages are a serious business problem because customers expect the service to work without issues.
In 2011, CQS exceeded the volume of read operations that the Oracle database could handle. Given that these read operations make up more than 99 percent of requests to the system, the excess volume affected performance significantly. The team began replicating Oracle data to SABLE, an internal Amazon solution, to handle read requests. However, since SABLE is the most used service at Amazon, performance degradation affected some of the service’s most active customers. Additionally, the added management burden made technicians’ jobs unnecessarily difficult and prevented them from working on strategic activities—they were too busy putting out operational fires.
A Plan to Move Forward
As part of its strategy to build a platform that could scale to meet projected needs for at least 10 years, Amazon decided to migrate Prime Video to Amazon Web Services (AWS). The migration would replace CQS with a suite of 12 microservices, built using a range of AWS services that included Amazon DynamoDB, AWS Lambda, and Amazon Simple Queue Service (Amazon SQS). The IT team developed a thorough plan to get the migration done within two years, including assurances that customers would not be impacted at any point during the switchover to the new system.
The migration plan came together with planning and estimation workshops, engineering reviews, and executive inputs. Rigorous testing was performed on each API before it was allowed to change data.
Migrating and Modernizing
The first use case completed by the team was video downloading. This included provisioning a brand-new, Tier 1 service and migrating more than a billion download records from Oracle to DynamoDB. The team invested 32 months of engineering effort in the project and delivered a download service that could handle high throughput.
More than 30 other applications regularly use CQS APIs. These applications had to be switched over to the new API endpoints. Finally, the team had to migrate all the ownership records—constituting billions of rows—that it stored in Oracle. APIs that wrote to the system were set to copy data simultaneously to Oracle and DynamoDB. The team was able to validate the data and test API performance while the system was still fully operational. The service exceeded the high scaling targets required to support the Prime Video business.
By applying the capabilities of AWS, the team was freed to innovate in ways that were not possible on the legacy system. Using Amazon DynamoDB Streams and AWS Lambda, the team built mechanisms to analyze discrepancies between the new and legacy systems in near-real time to ensure customers could not be affected when the cutover occurred. This analysis was performed without any impact on latency or availability of the service.
Using Amazon SQS, the team implemented a service to synchronize the ownership state of customers across the systems when migration errors were detected. These tools have positioned the team to move toward an event-driven architecture, which unlocks previously impossible use cases around customer engagement and system performance.
The team achieved an average of 30 percent improvement in latency for key performance indicators critical to the video playback experience. For example, errors in the “Entitlements” service were reduced by 90 percent and latency was improved by 15–50 percent. The latency incurred in retrieving a customer’s TV On Demand library was reduced by 85 percent, from 800 milliseconds to just 120.
The project resulted in a single, globally replicated data set with improved resiliency, scalability, latency, operational efficiency, and a 55 percent reduction in cost compared to the old system. Tim Kohn, vice president of technology for Prime Video, says, “We migrated billions of rows from Oracle to Amazon DynamoDB, and we increased elasticity and reliability with no downtime for our global customer base.”