Amazon Aurora was the easiest part of the migration. It never gave us the slightest problem.

 

Josh Gage Senior Software Development Engineer, Amazon.com
  • About Amazon.com

    Amazon.com is the world’s leading online retailer and the pioneer of customer reviews, 1-Click shopping, personalized recommendations, Prime, AWS, Kindle, Alexa, and many more products and services.

     

  • Benefits

    • Moved 40 TB of data with only 1 hour of downtime
    • Migrated in only 6 months
    • Scaled to 900 transactions per second per shard with minimal CPU usage
    • Delivers same performance as Oracle at half the cost
  • AWS Services Used

As Amazon.com grew from a one-person startup in 1994 to one of the leading e-commerce sites in the world today, the company overcame challenge after challenge. Success brings its own challenges, though, and now the company faces one that will only intensify the more successful Amazon becomes.

"As one of the world's largest online retailers, Amazon is also one of the world's largest targets for online fraud," says Balachandra Krishnamurthy, a software development manager on the Amazon Transaction Risk Management Services (TRMS) team. "Customers make hundreds of purchases per second on our website and mobile app, and every one of those transactions must be screened for fraud."

To do this, the TRMS team's Buyer Fraud Service (BFS) system collects more than 2,000 real-time and historical data points for each order and uses machine-learning algorithms to detect and prevent those with a high probability of being fraudulent. BFS prevents millions of dollars in fraudulent transactions every year.

"We put immense resources into this fight, not only to protect Amazon's bottom line but also to maintain the high trust our customers and sellers place in us," says Krishnamurthy. "Amazon has a reputation as a platform with very high security standards, and we are committed to upholding that reputation every second of every day."

With this commitment in mind, TRMS decided to migrate to Amazon Web Services (AWS) the more than 100 on-premises Oracle databases in which it stored the 40 TB of data its machine-learning models use to identify fraudulent transactions.

Running on Oracle posed many challenges for TRMS, including complicated database administration that required the full-time attention of three engineers. The TRMS team also experienced latency levels under peak loads that were not acceptable for it to operate effectively; these issues required complex, multiyear engineering projects to address. Finally, the team spent 100 hours provisioning hardware in 2017, not including installation and testing—time it hoped to allocate to more strategic work.

Because the Buyer Fraud Service is a critical application and must operate at 99.995 percent availability, TRMS decided to use PostgreSQL-compatible Amazon Aurora as the new platform to host its databases. Amazon Aurora, a cloud database service that also offers MySQL compatibility, combines the performance and availability of Oracle with the simplicity of open-source databases and is three times faster than standard PostgreSQL databases.

As strong as the case was for moving to Amazon Aurora, the team knew that migrating a large-scale system that operates at such high throughput and availability would also pose significant challenges. "The daunting part of this migration was having to move such a large database, with the number of transactions it handles, with minimal downtime," says Josh Gage, a senior software development engineer on the TRMS team. "At 40 TB, we were the largest database migration to AWS in the history of the company."

To minimize the technical complexity of the migration, TRMS decided to re-platform the Buyer Fraud Service and postpone re-architecting it. "We decided we wanted to re-platform so as to accelerate the migration as much as possible while minimizing disruptions," says Krishnamurthy. "We will look at further optimizing the service design and database schemas at a later phase."

To accomplish the project quickly and securely, the team used a migration stack that included AWS Database Migration Service (AWS DMS), which supports migrations to and from leading commercial and open-source databases. During migrations, AWS DMS automatically replicates any changes in the source data to the target database, so the source database can remain operational until the final switchover.

Despite the massive amount of data being moved, the migration project required only six months to complete and one hour of downtime. Gage gives much of the credit for the successful project to the flexibility and ease of use of Amazon Aurora. For example, the ability to create Aurora Read Replicas was a big help during the migration.

"As we were migrating, we were able to spin up new instances of our database that were fully synced with the Oracle database in about an hour," says Gage. "That gave us the flexibility to experiment with new approaches and find just the right ones."

Krishnamurthy says he is more than happy with the performance and stability of the new solution. "Things have been running very smoothly on Aurora since the migration," he says. "There have been zero database outages, and we no longer have to worry about the execution plan flips we used to experience on Oracle."

Now that AWS is responsible for most management tasks—such as patching, maintenance, backups, and upgrades—engineers can turn their attention to more valuable work. "We used to need three database engineers to keep Oracle up to date and take care of performance improvement tasks like repartitioning and index tuning," says Gage. "Because Aurora reduced our administrative overhead by about 70 percent, we don't need those resources just to keep our heads above water and can shift them to more valuable tasks."

The migration also resulted in considerable cost savings. "Another big benefit of the migration is the lower cost of database hosting on AWS," says Krishnamurthy. "On Amazon Aurora, we see performance levels similar to what we saw on Oracle at less than half the cost."

And, using AWS, the team no longer has any concerns about scale. "After the migration, we load-tested the new and the old systems up to 900 transactions per second per shard,” says Gage. "Aurora had no problem handling the load, with minimal CPU usage, while Oracle browned out. We also had no problems in any of our regions on Amazon Prime Day, showing that Aurora can handle our peak traffic with ease."

Gage says the project was much less challenging than some might have expected. "In the end, migrating from Oracle onto AWS turned out to be pretty simple. We hit obstacles, but we were able to overcome them either on our own or with the help of the AWS DMS team, who really provided exemplary support of their product. Amazon Aurora was the easiest part of the migration. It never gave us the slightest problem."