How GoPro optimized its media transcoding pipeline at scale with Amazon EC2 Spot Instances

This blog was co-authored by Vlad-Alexandru Voinea, DevOps Engineer at GoPro, and Zaven Boni, DevOps Manager at GoPro.

A photo example showing kayaking taken by GoPro camera

GoPro helps the world capture and share imagery in immersive and exciting ways. Since its founding in 2002, the brand has produced versatile cameras and software tools to help its customers get the most value and enjoyment from their content. Beyond GoPro’s core lineup of cameras, mounts, accessories, and lifestyle gear, the GoPro Subscription, which is supported by Amazon Web Services (AWS), ties together the GoPro experience—including unlimited cloud storage, automatic highlight videos, convenient editing capabilities synced across devices, a live streaming platform, and the ability to share content directly to social media platforms. Currently, GoPro handles massive amounts of video data each month, serving approximately 2.5 million subscribers.

Figure 1: GoPro media processing.

GoPro uses a proprietary transcoding and media processing pipeline where massive compute capacity supports a variety of devices, bitrates, codecs, and bandwidths. When content is ingested into Amazon Simple Storage Service (Amazon S3), preprocessing jobs run to collect metadata, resolution, and codec information to create video thumbnails. Different classes of transcoding jobs run on different Amazon Elastic Container Service (Amazon ECS) clusters, docker containers, and custom configuration builds with FFmpeg on Amazon EC2. Other proprietary tools are used to combine video and audio, GPMF (GoPro Metadata Format or General Purpose Metadata Format), demux, and mux transcoding on Amazon EC2.

GoPro originally deployed its subscription service using AWS to enable end users to ingest, process, store, and share content. As subscriber numbers and user generated content volumes grew over the time, GoPro required a more adaptive content processing pipeline with flexible and scalable infrastructure.

For seasonal traffic with high peaks and low valleys, instance auto-scaling enables elasticity and scalability to manage compute capacity fluctuations during transcoding. Because the transcoding workflow is asynchronous and message and event-driven, multiple jobs can run in parallel to provide efficiency and the workload can tolerate interruptions. GoPro deployed its Amazon ECS cluster to Amazon EC2 Spot Instances for more cost-effective compute, saving the company around 50%-70% on on-demand spend. Amazon EC2 Spot Instances are available at up to a 90% discount compared to on-demand prices.

Figure 2: GoPro media processing workflow in AWS

Figure 2: AWS media processing overview.

GoPro implements EC2 Spot instances best practices. By diversifying and tapping on as many pools of capacity, ECS is able to replace spot instances that get terminated with other Spot instances, and meet demand to handle peak traffic. To minimize the frequency of spot interruption, GoPro uses the capacity-optimized allocation strategy. GoPro also implemented the ability to perform automated draining of interrupted spot instances with automation through a series of Terraform scripts.

GoPro uses attribute-based instance type selection for Amazon ECS clusters and uses both CPU and GPU instances in auto scaling groups.

instance_requirements = {
  memory_mib_min          = 32768
  memory_mib_max          = 131072
  vcpu_count_min          = 32
  vcpu_count_max          = 64
  cpu_manufacturers       = ["amd", "intel"]
  excluded_instance_types = (["a*", "t*", "m6g*", "c6g*", "c7g*", "d*", "h*", "i*", "g*", "p*", "inf1*", "dl*", "vt*", "x*", "r6g*", "z1*", "f1*"])
}

All other auto scaling groups have instance lists as shown in the following example.

spot-instance-type = [
  { instance_type = "r5b.4xlarge" },
  { instance_type = "r4.4xlarge" },
  	.
	.
	.
]
spot-instance-type = [{ instance_type = "g4dn.xlarge" }]
spot-instance-type = [{ instance_type = "c3.4xlarge" },
  { instance_type = "c3.8xlarge" },
  { instance_type = "c4.4xlarge" },
  { instance_type = "c4.8xlarge" },
  	.
	.
	.
]

Auto scaling groups are configured to be capacity optimized, which identifies Spot Instance pools that are optimized only for capacity availability.

spot-allocation-strategy = "capacity-optimized"

The workflow resiliency is handled through a series of AWS Lambda functions:

AutoScaling Group ScaleIn CloudWatch events trigger a Lambda function (called General Drain) that performs a drain of the terminating instance.
Another Lambda function (called Spot Drain) is specific to Spot ASG’s and is triggered on incoming EC2 Spot Instance Interruption Warning
Yet another Lambda function is used to collect spot metrics and this is triggered on InstanceTerminated, RunInstances,and EC2 Spot Interruption

Figure 3: Transcoding workflow with Amazon EC2 scaling

The entire workflow is monitored on a custom-built dashboard for GoPro, powered by AWS Lambda. This tooling provides monitoring and visibility of AWS Spot instances with Amazon ECS to show spot interrupt rates, instance types, and other statistics related to Spot Instances (Figure 4).

igure 4: an example of a dashboard showing spot-related statistics.

Figure 4: Dashboard display of spot-related statistics.

After adding Amazon EC2 Spot Instances and architecting for potential interrupts, GoPro is running over 70% of all containerized workloads on Spot Instances, driving significant savings.

Conclusion

In this blog post, we described how Amazon ECS Spot Instances enabled GoPro to leverage compute resources more cost effectively and to improve its transcoding process in the media supply chain at scale. More details about Spot Instances are available in the Best practices for EC2 Spot documentation.

Select your cookie preferences

AWS for M&E Blog

How GoPro optimized its media transcoding pipeline at scale with Amazon EC2 Spot Instances

Resources

Follow

Learn

Resources

Developers

Help