Dropbox Migrates 34 PB of Data to an Amazon S3 Data Lake for Analytics
In 2018, Dropbox identified a need to migrate away from its on-premises Hadoop clusters. Being responsible for the durability of the data it stored in Apache Hadoop, Dropbox had to be conservative with what technologies it could experiment with. The clusters held petabytes of analytical data, including server logs, instrumentation, and metadata related to Dropbox’s more than 600 million global customers. To successfully and efficiently innovate and improve customer experience, the company looked to Amazon Web Services (AWS).
Dropbox has since moved 34 PB of analytics data to a data lake in Amazon Simple Storage Service (Amazon S3)—an object storage service that offers industry-leading scalability, data availability, security, and performance—and uses Amazon Elastic Compute Cloud (Amazon EC2) and Amazon EC2 Spot Instances to power the compute for its Hadoop clusters. These AWS services enable Dropbox to cost-effectively scale storage and compute independently without planning for capacity and to test new technologies without fear of degrading its users’ experience, ultimately enabling Dropbox to innovate faster while saving money.
Dropbox migrates 34 PB to Amazon S3 for analytics
Amazon S3 works independently of Amazon EC2, so we can scale the services without having to wait for hardware lead times or account for depreciation.”
Technical Lead for Data Infrastructure Teams, Dropbox
Migrating Petabytes of Data to Amazon S3
Beginning as a central hub for file storage in 2007, Dropbox has evolved to offer many business solutions in the collaboration space and is now a global company. It serves more than 500,000 business teams and has more than half a billion users. Though best known for its file-syncing product, Dropbox also offers tools for productivity, team management, data security, and more.
Before moving its analytics data to AWS, Dropbox stored this data in on-premises Hadoop Distributed File System clusters. Dropbox had to invest in custom patches for open-source Hadoop so that these systems could scale to its needs. At-scale testing of new versions of big data frameworks like Hadoop or Apache Hive—open-source frameworks for the distributed processing of large datasets—was an expensive and inherently risky process. In the worst-case scenario, software upgrades risked having to be rolled back, resulting in downtime. Over time, the company’s on-premises Hadoop clusters also required operationalization with custom automation, an additional burden for Dropbox’s engineering team of eight engineers. The team had to plan in advance for capacity: “We had to predict our needs at least 3 years into the future,” says Ashish Gandhi, technical lead for data infrastructure teams at Dropbox. “Doing that is a very tedious process, and it means time spent not building. It also locks you into certain assumptions about your workload profile for a long period of time.”
Dropbox migrated all its analytics data from its on-premises Hadoop Distributed File System infrastructure to a data lake built on Amazon S3 in 2019, and this data is growing by more than 1 PB a month. Dropbox followed a lift-and-shift migration strategy, initially minimizing changes to its architecture. After setting up the AWS environment to shadow all production workloads, Dropbox operated two live environments for a period of 1 month and used a validation system to compare outputs generated in AWS against what was being produced on premises. Any unacceptable deviation triggered a custom-build incremental synchronization system for data and metadata, resetting the AWS environment to be identical to the on-premises environment and also resetting the month-long validation period. Now Dropbox teams run their analytics workloads on Hadoop clusters on AWS, powered by Amazon EC2 and Amazon EC2 Spot Instances, which lets them take advantage of unused Amazon EC2 capacity. The analytics environment on AWS enables Dropbox’s product owners and engineers to make decisions about what features to build and helps them monitor their success. It also enables the sales and marketing teams to provide critical business intelligence and the data scientists to build sophisticated machine learning functionality to serve Dropbox’s customers.
Realizing Scalability, Flexibility, and Savings on AWS
Using an Amazon S3 data lake and Hadoop clusters on Amazon EC2 enabled Dropbox to separate storage and computing. “Now we don’t have to predict how much data we need to store,” says Gandhi. “With on-premises machines, if you suddenly need more compute than storage, all hardware planning goes to waste because changing the mix requires a different machine type to be cost efficient. Being in the cloud, Amazon S3—which handles our storage—works independently of Amazon EC2, which handles our compute needs. So we can scale the services completely independently without having to wait for hardware lead times or account for depreciation.” On AWS, Dropbox runs more than 100,000 analytics jobs and tens of thousands of one-time jobs daily.
Since the same Amazon S3 data can be accessed from multiple Hadoop environments on Amazon EC2, Dropbox can quickly iterate and try new technologies by launching experimental clusters, without risking damaging the data or negatively impacting its users. “As long as we store data in standard formats on Amazon S3, we unlock the environment of tools and services that can operate on Amazon S3 directly,” explains Gandhi. For example, Dropbox tried cutting-edge versions of Hadoop, and when it saw problems, it rolled them back without affecting users. The company has also been able to roll out Apache Tez as an execution engine, which has resulted in improvements of up to six times the performance capabilities.
Dropbox takes advantage of Amazon S3 versioning to achieve pain-free restores of earlier versions of objects stored in Amazon S3, enabling Dropbox to recover from unintended user actions. It also adds protection against software defects that can corrupt data at the lower levels of the stack. “We definitely have had users who have said, ‘Oh, I didn’t mean to delete this or write that!’ And we’ve been able to help them go back and look over their old stuff with ease because of Amazon S3 versioning,” says Gandhi.
The use of Amazon EC2 Spot Instances, which cost up to 90 percent less compared to On-Demand Instances, has resulted in significant savings for Dropbox. On average, Dropbox uses Spot Instances for 15 percent of its compute capacity, but it has increased that to 50 percent when it needed burst capacity for backfills, among other things. “We have doubled our footprint when necessary through Spot Instances,” explains Gandhi.
To further reduce costs, Dropbox began to migrate data that it infrequently or never used—such as older analytics data or data kept for just-in-case scenarios—to Amazon S3 Glacier Deep Archive, a secure, durable, and extremely low-cost Amazon S3 storage class for data archiving and long-term backup. So far, Dropbox has moved about 30 percent of its data. “Amazon S3 Glacier Deep Archive has enabled us to improve our data lifecycle management,” says Gandhi. “We always have the option to restore or delete data. It’s more cost effective than the Amazon S3 standard storage class, so for infrequently accessed data, it makes complete sense.”
In September 2020, Dropbox began to evaluate the private beta of Amazon S3 strong consistency. With strong consistency, Amazon S3 delivers strong read-after-write consistency automatically for all applications for any storage request, without changes to performance or availability, without sacrificing regional isolation for applications, and at no additional cost. With strong consistency for Amazon S3, Dropbox has been able to simplify the data lake architecture by eliminating the use of S3Guard, an open-source software that was previously used to manage consistency on Amazon S3. Amazon S3 strong consistency has increased the performance of the jobs running in the environment, and the company has cut the amount of time needed to delete hundreds of files from 30–40 minutes (in extreme cases) to just a few seconds. In one incident, when the S3Guard-based system was not scaling well, thereby increasing latencies for the storage operations and affecting job performance, the Dropbox team had to spend 4–5 days implementing a custom solution for that use case. This wouldn’t have been necessary with strong consistency for Amazon S3.
Pursuing Even Better Performance on AWS
Looking forward, Dropbox anticipates additional performance improvements with Apache Tez and Apache Spark. “Being on AWS helps with rolling out new frameworks because a lot of open-source tools and third-party systems work really well alongside AWS services like Amazon S3 and the open-standard Amazon S3 APIs,” says Gandhi.
Dropbox is also building a data catalog management and data lineage solution on AWS. By the end of 2020, Dropbox expects to increase confidence in the metrics powered by the analytics infrastructure. Accomplishing this had been a major goal of the company, according to Gandhi: “For important company metrics, we want to have trustworthy data and pipelines. We have a small engineering team, but on AWS, we are actually making really good progress on that front.” Gandhi was also appreciative of the help AWS provided in building this environment: “The AWS team—both account managers and service team engineers—are highly responsive and accessible in their approach. Our collaboration has been a great experience.”
Using Amazon S3 and Amazon EC2 instances, Dropbox has decreased operational overhead, successfully implemented new technologies that its on-premises solution had blocked and improved its customer experience. According to Gandhi, “On AWS, we have the freedom to move forward.”
Founded in 2007 and headquartered in San Francisco, Dropbox offers workspace management services like online file storage, file sharing, and syncing for more than 450,000 business teams and more than 500 million users across the world.
Benefits of AWS
- Hosts 40 PB of analytics data and supports 1 PB of data growth a month
- Optimizes costs by moving cold data to Amazon S3 Glacier Deep Archive
- Uses Amazon EC2 Spot Instances for 15–50% of compute capacity
- Doubles compute footprint using Amazon EC2 Spot Instances
- Enables the testing of new technologies without damaging data or affecting users
- Improved performance by six times for some job types
- Deletes hundreds of files in a few seconds compared to 30–40 minutes
- Runs more than 100,000 analytics jobs and tens of thousands of one-time jobs daily
AWS Services Used
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.
S3 Glacier Deep Archive
Amazon S3 Glacier and S3 Glacier Deep Archive are a secure, durable, and extremely low-cost Amazon S3 cloud storage classes for data archiving and long-term backup.