Artificial Intelligence
Facebook uses Amazon EC2 to evaluate the Deepfake Detection Challenge
In October 2019, AWS announced that it was working with Facebook, Microsoft, and the Partnership on AI on the first Deepfake Detection Challenge. Deepfake algorithms are the same as the underlying technology that has given us realistic animation effects in movies and video games. Unfortunately, those same algorithms have been used by bad actors to blur the distinction between reality and fiction. Deepfake videos result from using artificial intelligence to manipulate audio and video to make it appear as though someone did or said something they didn’t. For more information about deepfake content, see The Partnership on AI Steering Committee on AI and Media Integrity.
In machine learning (ML) terms, the Generative Adversarial Networks (GAN) algorithm has been the most popular algorithm to create deepfakes. GANs use a pair of neural networks: a generative network that produces candidates by adding noise to the original data, and a discriminative network that evaluates the data until it determines they aren’t synthesized. GANs matches one network against the other in an adversarial manner to generate new, synthetic instances of data that can pass for real data. This means the deepfake is indistinguishable from a normal dataset.
The goal of this challenge was to incentivize researchers around the world to build innovative methods that can help detect deepfakes and manipulated media. The competition, which ended on March 31, 2020, was popular amongst the Kaggle data science community. The deepfake project emphasized the benefits of scaling and optimizing the cost of deep learning batch inference. Once the competition was complete, the team at Facebook hosted the deepfake competition data on AWS and made it available to the world, encouraging researchers to keep fighting this problem.
There were over 4,200 total submissions from over 2,300 teams worldwide. The participating submissions are scored with the following log loss function, where a smaller score is better (for more information about scoring, see the contest rules):

Four groups of datasets were associated with the competition:
- Training – The participating teams used this set for training their model. It consisted of 470 GB of video files, with real and fake labels for each video.
- Public validation – Consisted of a sample of 400 videos from the test dataset.
- Public test – Used by the Kaggle platform to compute the public leaderboard.
- Private test – Held by the Facebook team, the host outside of the Kaggle competition platform for scoring the competition. The results from using the private test set were displayed on the competition’s private leaderboard. This set contains videos with a similar format and nature as the training and public validation and test sets, but contain real, organic videos as well as deepfakes.
After the competition deadline, Kaggle transferred the code for the two final submissions from each team to the competition host. The hosting team re-ran the submission code against this private dataset and returned prediction submissions to Kaggle to compute the final private leaderboard scores. The submissions were based on two types of compute virtual machines (VMs): GPU-based and CPU-based. Most of the submissions were GPU-based.
The competition hosting team at Facebook recognized several challenges in conducting an evaluation from the unexpectedly large number of participants. With over 4,200 total submissions and 9 GPU hours of runtime required for each using a p3.2xl Amazon Elastic Compute Cloud (Amazon EC2) P3 instance; they would need an estimated 42,000 GPU compute hours (or almost 5 years’ worth of compute hours) to complete the competition. To make the project even more challenging, they needed to do 5 years of GPU compute in 3 weeks.
Given the tight deadline, the host team had to address several constraints to complete the evaluation within the time and budget allotted.
Operational efficiency
To meet the tight timeframes for the competition and make the workload efficient due to the small team size, the solution must be low-code. To address the low-code requirement, they chose AWS Batch for scheduling and scaling out the compute workload. The following diagram illustrates the solution architecture.

AWS Batch was originally designed for developers, scientists, and engineers to easily and efficiently manage large numbers of batch computing jobs on AWS with little coding or cloud infrastructure deployment experience. There’s no need to install and manage batch computing software or server clusters, which allows you to focus on analyzing and solving problems. AWS Batch provides scheduling and scales out batch computing workloads across the full range of AWS compute services, such as Amazon EC2 and Spot Instances. Furthermore, AWS Batch has no additional charges for managing cluster resources. In this use case, the host simply submitted 4,200 compute jobs, which registered each Kaggle submission container, which ran for about 9 hours each. Using a cluster of instances, all jobs were complete in less than three weeks.
Elasticity
The tight timeframes for the competition, as well as requiring those instances for only a short period, speaks to the need for elasticity in compute. For example, the team estimated they would need a minimum of 85 Amazon EC2 P3 GPUs running in parallel around the clock to complete the evaluation. To account for restarts and other issues causing lost time, there was the potential for an additional 50% in capacity. Facebook was able to quickly scale up the number of GPUs and CPUs needed for the evaluation and scale them down when finished, only paying for what they used. This was much more efficient in terms of budget and operations effort than acquiring, installing, and configuring the compute on-premises.
Security
Security was another significant concern. Submissions from such a wide array of participants could contain viruses, malware, bots, or rootkits. Running these containers in a sandboxed, cloud environment avoided that risk. If the evaluation environment was exposed to various infectious agents, the environment could be terminated and easily rebuilt without exposing any production systems to downtime or data loss.
Privacy and confidentiality
Privacy and confidentiality are closely related to the security concerns. To address those concerns, all the submissions and data were held in a single, closely held AWS account with private virtual private clouds (VPCs) and restrictive permissions using AWS Identity and Access Management (IAM). To ensure privacy and confidentiality of the submitted models, and fairness in grading, a single, dedicated engineer was responsible for conducting the evaluation without looking into any of the Docker images submitted by the various teams.
Cost
Cost was another important constraint the team had to consider. A rough estimate of 42,000 hours of Amazon EC2 P3 instance runtime would cost about $125,000.
To lower the cost of GPU compute, the host team determined that the Amazon EC2 G4 (Nvida Tesla T4 GPUs) instance type was more cost-effective for this workload than the P3 instance (Volta 100 GPUs). Amongst the GPU instances in the cloud, Amazon EC2 G4 are cost-effective and versatile GPU instances for deploying ML models.
These instances are optimized for ML application deployments (inference), such as image classification, object detection, recommendation engines, automated speech recognition, and language translation, which push the boundary on AI innovation and latency.
The host team completed a few test runs with the G4 instance type. The test runtime for each submission resulted in a little over twice the comparative runtime of the P3 instances, resulting in the need for approximately 90,000 compute hours. The G4 instances cost up to 83% less per hour than the P3 instances. Even with longer runtimes per job with the G4 instances, the total compute cost decreased from $125,000 to just under $50,000. The following table illustrates the cost-effectiveness of the G4 instance type per inference.
| p3.2xl | g4dn.8xl | |
| Runtime (hours) | 90,000 | 25,000 | 
| Cost (USD) | $125,000 | $50,000 | 
| Cost per Inference | $30 | $12 | 
The host team shared that many of the submission runs completed with less compute time than originally projected. The initial projection was based upon early model submissions, which were larger than the average size for all models submitted. About 80% of the runs took advantage of the G4 instance type, while some had to be run on the P3 instances due to slight differences in available GPU memory between the two instance types. The final numbers were 25,000 G4 (GPU) compute hours, 5,000 C4 (CPU) compute hours, and 800 P3 (GPU) compute hours, totaling $20,000 in compute cost. After approximately two weeks of around-the-clock evaluation, the host team completed the challenging task of evaluating all the submissions early and consumed less than half of the $50,000 estimate.
Conclusion
The host team was able to complete a full evaluation of the over 4,200 submission evaluations in less time than was available, while meeting the grading fairness criteria and coming in under budget. The host team successfully replicated the evaluation environment with a success rate of 94%, which is high for a two-stage competition.
Software projects are often risk-prone due to technological uncertainties, and perhaps even more so due to inherent complexity and constraints. The breadth and depth of AWS services running on Amazon EC2 allow you to solve your unique challenges by reducing technology uncertainty. In this case, the Facebook team completed the deepfake evaluation challenge on time and under budget with only one software engineer. The engineer started by selecting a low-code solution, AWS Batch, which is a proven service for even larger-scale HPC workloads, and reduced the evaluation cost by 2/3 through the choice of the AI inference-optimized G4 EC2 instance type.
AWS believes there’s no one solution to a problem. Solutions often consist of multiple and flexible building blocks from which you can craft solutions that meet your needs and priorities.
About the Authors
 Wenming Ye is an AI and ML specialist architect at Amazon Web Services, helping researchers and enterprise customers use cloud-based machine learning services to rapidly scale their innovations. Previously, Wenming had a diverse R&D experience at Microsoft Research, SQL engineering team, and successful startups.
Wenming Ye is an AI and ML specialist architect at Amazon Web Services, helping researchers and enterprise customers use cloud-based machine learning services to rapidly scale their innovations. Previously, Wenming had a diverse R&D experience at Microsoft Research, SQL engineering team, and successful startups.
 Tim O’Brien is a Senior Solutions Architect at AWS focused on Machine Learning and Artificial Intelligence. He has over 30 years of experience in information technology, security, and accounting. In his spare time, he likes hiking, climbing, and skiing with his wife and two dogs.
Tim O’Brien is a Senior Solutions Architect at AWS focused on Machine Learning and Artificial Intelligence. He has over 30 years of experience in information technology, security, and accounting. In his spare time, he likes hiking, climbing, and skiing with his wife and two dogs.