Upload and transcode video with AWS Elemental MediaConvert and Magine Pro

Authored by Mateusz Herczka, Software Engineer at Magine Pro. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

API design from a UX perspective

File-based video workloads can be highly complex. Media customers want a solution that simplifies this complexity, yet retains the flexibility to handle a wide range of use cases. Magine Pro provides a solution that combines transcoding and DRM in a single easy to use service. This solution supports a practical range of use cases in order to bring content to market faster with less effort.

In this post, we present three customer use cases that require an intermediate transcoding step before submission to an automated transcoding pipeline. We discuss the design of a simple but effective API and its mapping to an AWS Elemental MediaConvert job. The API eliminates the need for intermediate transcoding through careful consideration of expected input and commonality of structure between a media collection and the CreateJob API.

AWS Elemental Media Convert API

AWS Elemental MediaConvert is a file-based video transcoding service with broadcast-grade features. It allows you to easily create video-on-demand (VOD) content for broadcast and multiscreen delivery at scale. To transcode an asset, we make a CreateJob call to the MediaConvert application programming interface (API), either directly with HTTP, or through a library like boto3. This API exposes a complex, low-level interface covering all aspects of creating a transcoding job.

A CreateJob can contain hundreds of parameters grouped into several dozen objects with dozens of lists. Our goal is to simplify this job operation for our customers.

Simplified file-based workflow using Magine Pro and MediaConvert

A program tree

We are in the midst of an explosion in media technologies, including Virtual Reality (VR), Augmented Reality (AR,) multiple angles, live graphics, and social media integration. However, many movies, shows, and sports events are still played as a single video track with multiple selectable audio tracks and subtitles. A program can thus be organized conceptually as a tree with leaf parameters, for example:

- Program
- Video (file, size, FPS, interlace)
   - Audio
     - Track (file, language, layout)
     ...
   - Subtitles
     - Subtitle (file, language, kind)
     ...

Simple Program Tree

Audio tracks have a spatial layout with mono, stereo, or surround, and popular programs might be overdubbed into several languages. Subtitles come in different languages as well, but sometimes one language has more than one kind of subtitles. In the U.S, for example, there are both captions and Subtitles for the Deaf and Hard of Hearing (SDH). SDH provide important non-dialog information such as speaker identification and sound effects.

For video, we define the important format parameters. The resolution of the source determines maximum possible resolution of the transcoded stream. Frame rate and interlace are the main stumbling blocks for smooth playback. Given this information, we can infer the correct transcoder settings for all outputs that are required for adaptive bitrate (ABR) streaming.

This structure covers a wide range of use cases and, as it turns out, bears some similarity to the MediaConvert CreateJob API.

In the following sections, we differentiate between the keywords that MediaConvert with blue text, for example, KeyWord. All other keywords are shown as red text.

A MediaConvert job tree

A MediaConvert JobInput is organized into selectors. Exposing a few relevant parameters, the hierarchy is as follows:

- JobInput (FileInput)
   - VideoSelector
   - AudioSelectors
     - AudioSelector (ExternalAudioFileInput, Tracks, RemixSettings)
     ...
   - CaptionSelectors
     - CaptionSelector (SourceFile, SourceSettings)
     ...

A MediaConvert Job Tree

A JobInput is a node in a larger tree, the MediaConvert job. FileInput, ExternalAudioFileInput and SourceFile are source file references, typically Amazon Simple Storage Service (Amazon S3) Uniform Resource Identifiers (URIs). There is exactly one VideoSelector tied to FileInput, that is, an input supports exactly one video track.

In MediaConvert terminology, subtitles are “captions.” CaptionSelectors and AudioSelectors are collections with each selector associated with either FileInput or an external file. For the captions, SourceSettings is a large object describing the submitted caption format, and these files can vary greatly.

The Tracks parameter specifies which tracks are the container file are used in the audio selector. RemixSettings allows the mapping of an input to output tracks. We use this feature to arrange mono tracks into stereo or surround layouts, and correct a flipped stereo image from our three use cases introduced earlier.

We are now ready to design an API that allows mapping of our conceptual program to a JobInput.

An API tree

Given some API input X, we apply transformation T such that T(X) -> JobInput. Considering necessary data, we formulate dependencies and create keys in plain English. To describe the submitted asset, we say:

- a source is a reference to a file - an input is a named leaf holding a source and parameters describing the storage - a track is a named leaf holding a source and parameters describing the associated media

We also want to operate on the asset, for example remap the audio tracks.

- an output is a named leaf holding tracks and parameters describing the operation

The general hierarchy of a submitted asset is outlined:

- asset
   - inputs
   - tracks
     - audio
     - video
     - subtitles
   -outputs

We define each leaf as a JSON object with a unique name:

"inputs": [
    {
        "name": "myInput",
        "source": "s3://bucket/key.suf",
        ...
    }
]

Other JSON objects use this name as a reference:

"tracks": {
    "audio": {
        "name": "myAudioTrack",
        "input": "myInput",
        "containerIndex": 1,
        ...
    }
},
"outputs": [
    {
        "name": "myOutput",
        "tracks": [
            "myAudioTrack"
        ],
        ...
    }
]

The containerIndex parameter is defined to map 1:1 to the way MediaConvert indexes audio tracks which is analogous to probing the container file with the free utility Mediainfo

Let L be an ordered list of audio tracks as listed by Mediainfo
then containerIndex = {1,2,...} is the index of a track in L.

We now have all necessary information about the location of submitted media. To enable audio channel operations, we add channelOrder in the output. A simple and human readable way to describe channel order is a list of symbols, like so:

- stereo: ["1","r"]
- stereo where channels have been flipped: ["r", "l"]
- surround 5.1: ["fl", "fr", "fc", "lfe", "bl", "br"]
- surround 5.1 where front and back channels have been swapped: ["bl", "br", "fc", "lfe", "fl", "fr"]

For surround, we are assuming a standard DOLBY 5.1 channel layout as the target channel map. Given the swapped channelOrder of

["bl", "br", "fc", "lfe", "fl", "fr"]

we configure the RemixSettings to produce a surround track ordered

["fl", "fr", "fc", "lfe", "bl", "br"]

This principle requires only one channelOrder to be submitted – a description of the input – and we infer the best RemixSettings. Same idea is applied to stereo and can be extended to other scenarios.

Frame rates, interlace, and more

Some of the most noticeable transcoding mistakes are due to mismatched interlace fields or incorrect drop/non-drop frame rates. These parameters can cause a wide range of motion artifacts when set incorrectly, and the human eye is an expert motion detector.

Our experience is that interlacing and drop-frame can cause playback problems for certain legacy devices. One solution that looks great in most cases is to upsample drop-frame video to integer frame rates and deinterlace when necessary using a high-quality algorithm.

Drop-frame is commonly written as a decimal such as 23.98, but this is a rounded floating-point number. In a JSON context, it is better to express such FPS exactly as a fraction:

{
  "frameRate": {
    "numerator": 24000
    "denominator": 1001,
  },
  "interlaced": true,
  ...
}

We are ready to start solving real customer use cases.

Three use cases

For a number of years, we have offered an in-house transcoding workflow similar to the Video on Demand on AWS solution. A user uploads assets to Amazon S3 and we probe the assets and formulate a transcoding job based on submitted media and subtitle files.

Sample file-based workflow to produce a mezzanine file

This approach is successful if the assets are well formed and, given the large space of possible media file permutations, conformant to our specifications for a mezzanine file. In practice, customers often approach us with more complicated assets and requirements. The following three examples are some of these unique use cases:

A single container file contains several audio tracks. The file includes final downmixed stereo audio as two mono tracks. Other audio tracks are also present, for example, “music/effects” and “foley.” The customer wants to only keep the main downmixed audio and ignore the rest.
A sports event is edited live with a multi-camera setup and the resulting stream is recorded with a hardware device. The output is an external reference MOV file. Several media container files are stored in a separate folder. Again, some of the audio tracks are extra commentary and are not used. The customer wants to select one video and one stereo audio file.
A single container file contains one video and one audio track. The left and right stereo channels are flipped in the audio track. The customer wants to correct the stereo image.

Nonconformant assets require an extra transcoding step to create a mezzanine file within our specifications while minimizing quality loss. This effort requires detailed knowledge about transcoder settings, extra CPU and storage resources, as well as work-hours and transcoding time. Some of our customers choose to outsource mezzanine transcoding to a third party at additional cost, but would prefer to submit assets directly to us. Our goal is to accommodate a wide range of possible assets in a single automated process.

Use case 1: A single container file containing multiple audio tracks

We specify the single container file and the two relevant mono audio tracks. The video is interlaced with a framerate of 29.98 FPS drop-frame. A single audio output refers to the mono tracks, and channelOrder describes their spacial layout.

For best results, our algorithm configures the MediaConvert job to deinterlace the video and upsample it to 30 FPS. RemixSettings are configured to produce a stereo track from the submitted mono tracks. The language code propagates to the GUI in our players.

The output is an ABR stream with only the desired audio.

{
  "inputs": [
    {
      "name": "media-container",
      "source": "s3://partner-bucket/use-case-1.mxf"
    }
  ],
  "tracks": {
    "video": {
      "frameRate": {
        "denominator": 1001,
        "numerator": 30000
      },
      "input": "media-container",
      "interlaced": true,
      "name": "video-track",
      "resolution": {
        "height": 1080,
        "width": 1920
      }
    },
    "audio": [
      {
        "channels": 1,
        "containerIndex": 1,
        "input": "media-container",
        "language": "eng",
        "name": "left-audio"
      },
      {
        "channels": 1,
        "containerIndex": 2,
        "input": "media-container",
        "language": "eng",
        "name": "right-audio"
      }
    ]
  },
  "outputs": {
    "audio": [
      {
        "channelOrder": [
          "l",
          "r"
        ],
        "tracks": [
          "left-audio",
          "right-audio"
        ]
      }
    ]
  },

  "drm": true
}

Use Case 2: Live multi-camera sports event

We ignore the main container file and specify the external media files directly. Audio and video were captured in real time by a hardware device and the tracks are already in sync. The video will be deinterlaced with no changes to the 25 FPS frame rate.

For this asset, there are several wav stereo audio files, and we select the relevant one. It’s a correct stereo track so channelOrder is omitted in the output, which signals a standard channel order of ["l", "r"] . One subtitle file for SDH is also present.

The resulting stream includes video, the chosen stereo audio and the SDH subtitles.

{
  "inputs": [
    {
      "name": "video-container",
      "source": "s3://partner-bucket/asset/media.dir/video.m2v"
    },
    {
      "name": "audio-container",
      "source": "s3://partner-bucket/asset/media.dir/audio-42.wav"
    },
    {
      "name": "subtitle-container",
      "source": "s3://partner-bucket/asset/subs-eng.srt"
    }
  ],
  "tracks": {
    "video": {
      "frameRate": {
        "denominator": 1,
        "numerator": 25
      },
      "input": "video-container",
      "interlaced": true,
      "name": "video-track",
      "resolution": {
        "height": 1080,
        "width": 1920
      }
    },
    "audio": [
      {
        "channels": 2,
        "containerIndex": 1,
        "input": "audio-container",
        "language": "eng",
        "name": "main-audio-track"
      }
    ],
    "subtitles": [
        {
          "name": "english-subtitle",
          "input": "subtitle-container",
          "language": "eng",
          "kind": "hearing-impaired"
        }
    ]
  },

  "drm": true
}

Use case 3: Flipped right and left stereo channels

Specifying the flipped channelOrder of the source as ["r", "l"] will configure RemixSettings to produce a stereo track with channels in the correct order ["l", "r"]. The 23.98 FPS will be upsampled to 24 FPS without deinterlacing.

The result is a stream with corrected stereo image.

{
  "inputs": [
    {
      "name": "matroska-container",
      "source": "s3://partner-bucket/asset/use-case-3.mkv"
    }
  ],
  "tracks": {
    "video": {
      "frameRate": {
        "denominator": 1001,
        "numerator": 24000
      },
      "input": "matroska-container",
      "interlaced": false,
      "name": "video-track",
      "resolution": {
        "height": 1080,
        "width": 1920
      }
    },
    "audio": [
      {
        "channels": 2,
        "containerIndex": 0,
        "input": "matroska-container",
        "language": "eng",
        "name": "flipped-audio"
      }
    ]
  },
  "outputs": {
    "audio": [
      {
        "tracks": ["flipped-audio"],
        "channelOrder": ["r", "l"]
      }
    ]
  },

  "drm": true
}

Conclusion

By considering the structure of media assets, we applied a UX mindset to the design of a simple API that uses the capabilities of MediaConvert to address three unique use cases. Using a tree as the structural model, we have demonstrated how such an API can map onto a MediaConvert job. The solution allows the customer to submit a wide range of file combinations, select relevant media, and specify correct audio channel layouts. It also handles common frame rate and interlacing issues automatically.

To learn more about the latest OTT video delivery trends and solutions, join Magine Pro CEO, Matthew Wilkinson and a panel of industry experts for an on-demand webinar, “OTT Video Delivery: Industry experts share experience & best practices.”

About Magine Pro

Over-the-top (OTT) platform operator Magine Pro enables global content owners, broadcasters, and telcos to build their own OTT businesses with live events, linear TV and Video-On-Demand streaming platforms on any device. Magine Pro’s customers are located in Europe and the United States, as well as in emerging markets. Contact us to speak with a member of our team and demo our solutions.

AWS for M&E Blog