AWS for M&E Blog

How GoPro optimized its media transcoding pipeline at scale with Amazon EC2 Spot Instances

This blog was co-authored by Vlad-Alexandru Voinea, DevOps Engineer at GoPro, and Zaven Boni, DevOps Manager at GoPro.

A photo example showing kayaking taken by GoPro camera

GoPro helps the world capture and share imagery in immersive and exciting ways. Since its founding in 2002, the brand has produced versatile cameras and software tools to help its customers get the most value and enjoyment from their content. Beyond GoPro’s core lineup of cameras, mounts, accessories, and lifestyle gear, the GoPro Subscription, which is supported by Amazon Web Services (AWS), ties together the GoPro experience—including unlimited cloud storage, automatic highlight videos, convenient editing capabilities synced across devices, a live streaming platform, and the ability to share content directly to social media platforms. Currently, GoPro handles massive amounts of video data each month, serving approximately 2.5 million subscribers.

Figure 1: GoPro media processing pipleline showing ingest, processing, output.

Figure 1: GoPro media processing.

GoPro uses a proprietary transcoding and media processing pipeline where massive compute capacity supports a variety of devices, bitrates, codecs, and bandwidths. When content is ingested into Amazon Simple Storage Service (Amazon S3), preprocessing jobs run to collect metadata, resolution, and codec information to create video thumbnails. Different classes of transcoding jobs run on different Amazon Elastic Container Service (Amazon ECS) clusters, docker containers, and custom configuration builds with FFmpeg on Amazon EC2. Other proprietary tools are used to combine video and audio, GPMF (GoPro Metadata Format or General Purpose Metadata Format), demux, and mux transcoding on Amazon EC2.

GoPro originally deployed its subscription service using AWS to enable end users to ingest, process, store, and share content. As subscriber numbers and user generated content volumes grew over the time, GoPro required a more adaptive content processing pipeline with flexible and scalable infrastructure.

For seasonal traffic with high peaks and low valleys, instance auto-scaling enables elasticity and scalability to manage compute capacity fluctuations during transcoding. Because the transcoding workflow is asynchronous and message and event-driven, multiple jobs can run in parallel to provide efficiency and the workload can tolerate interruptions. GoPro deployed its Amazon ECS cluster to Amazon EC2 Spot Instances for more cost-effective compute, saving the company around 50%-70% on on-demand spend. Amazon EC2 Spot Instances are available at up to a 90% discount compared to on-demand prices.

Figure 2: GoPro media processing workflow in AWS

Figure 2: AWS media processing overview.

GoPro implemented an auto-fallback function to on-demand transcoding in the case of spot request failure. When auto scaling groups launch instances to meet demand and handle peak traffic, and if spot instances are not available, transcoding falls back to on-demand instances so that the workflows can continue. GoPro also implemented the ability to perform automated draining of interrupted spot instances with automation through a series of Terraform scripts.

GoPro uses attribute-based instance type selection for Amazon ECS clusters and uses both CPU and GPU instances in auto scaling groups.

instance_requirements = {
  memory_mib_min          = 32768
  memory_mib_max          = 131072
  vcpu_count_min          = 32
  vcpu_count_max          = 64
  cpu_manufacturers       = ["amd", "intel"]
  excluded_instance_types = (["a*", "t*", "m6g*", "c6g*", "c7g*", "d*", "h*", "i*", "g*", "p*", "inf1*", "dl*", "vt*", "x*", "r6g*", "z1*", "f1*"])
}

All other auto scaling groups have instance lists as shown in the following example.

spot-instance-type = [
  { instance_type = "r5b.4xlarge" },
  { instance_type = "r4.4xlarge" },
  	.
	.
	.
]
spot-instance-type = [{ instance_type = "g4dn.xlarge" }]
spot-instance-type = [{ instance_type = "c3.4xlarge" },
  { instance_type = "c3.8xlarge" },
  { instance_type = "c4.4xlarge" },
  { instance_type = "c4.8xlarge" },
  	.
	.
	.
]

Auto scaling groups are configured to be capacity optimized, which identifies Spot Instance pools that are optimized only for capacity availability.

spot-allocation-strategy     = "capacity-optimized"

The workflow resiliency is handled through a series of AWS Lambda functions:

  1. AutoScaling Group ScaleIn CloudWatch events trigger a Lambda function (called General Drain) that performs a drain of the terminating instance.
  2. Another Lambda function (called Spot Drain) is specific to Spot ASG’s and is triggered on incoming EC2 Spot Instance Interruption Warning
  3. Yet another Lambda function is used to collect spot metrics and this is triggered on InstanceTerminatedRunInstances,and EC2 Spot Interruption 

Figure 3: Transcoding workflow with Amazon EC2 scaling

The entire workflow is monitored on a custom-built dashboard for GoPro, powered by AWS Lambda. This tooling provides monitoring and visibility of AWS Spot instances with Amazon ECS to show spot interrupt rates, instance types, and other statistics related to Spot Instances (Figure 4).

igure 4: an example of a dashboard showing spot-related statistics.

Figure 4: Dashboard display of spot-related statistics.

After adding Amazon EC2 Spot Instances and architecting for potential interrupts, GoPro is running over 70% of all containerized workloads on Spot Instances, driving significant savings.

Conclusion

In this blog post, we described how Amazon ECS Spot Instances enabled GoPro to leverage compute resources more cost effectively and to improve its transcoding process in the media supply chain at scale. More details about Spot Instances are available in the Best practices for EC2 Spot documentation.

Jenny Oshima

Jenny Oshima

Jenny Oshima is a Technical Account Manager for AWS in Northern California. She works with AWS customers to help them build highly reliable, resilient, and cost effective systems and achieve operational excellence for their workloads on AWS. She enjoys writing about various innovative technologies around AWS.