Content Preparation

Preparing your media for OTT is important to enable efficient delivery and a good end-user experience.

Also, it is key for processing your content later for purposes like archiving, clipping and replay.

Having source content correctly formatted will make the processing by Unified Packager and Unified Origin easier and improve performance.


This section is about the format of the content at the media source. For dynamic delivery of VOD or Live, Unified Origin will repackage this source on-the-fly (i.e., 'just-in-time') for delivery in the requested output format (to support different end user devices). For static delivery of VOD, this source may be repackaged with Unified Packager into the intended delivery format.

Required: source content is stored as (f)MP4 (preferably CMAF)

In general, source content must be stored as MP4. More specifically:

  • For dynamic delivery of VOD (Unified Origin), source content should be fragmented MP4, preferably CMAF. If it is stored as progressive MP4, a dref MP4 should be created as an intermediary format to mitigate any negative impact on performance.
  • For dynamic delivery of Live (Unified Origin), the ingested livestream must be fragmented MP4, preferably CMAF. This ingest must be compliant with either Interface 1 of the DASH-IF Live Media Ingest specification (i.e., CMAF ingest) or Azure Media Services fragmented MP4 live ingest specification.
  • For static delivery of VOD (Unified Packager), source content must be fragmented or progressive MP4, which can be repackaged in any of the supported delivery formats. CMAF is preferred.

A fragmented MP4 starts with a FileType ('ftyp') and MovieBox ('moov') that contains the track mediadata. In CMAF, these two boxes are referred to as the CMAF Header. The header is followed by subsequent pairs of MovieFragmentBoxes and MediaDataBoxes that are interleaved, of which the latter contains the actual media data, one fragment per box.


Unified Packager may be used to repackage progressive or fragmented MP4's to CMAF (which is only relevant for fragmented MP4's if they are not yet CMAF compliant). However, do note that some encoder specific aspects (e.g., Visual Usability Information (VUI) and aspect ratio) cannot be changed easily and may require re-encoding with different settings.

Required: a suitable bitrate ladder (content dependent)

One should prepare video content as a set of different bitrate tracks, with each of those tracks representing a different quality level. The selection of different bitrates is called a bitrate ladder.

A potential ladder could be (this should only be regarded as an example of how a ladder might look like, not as a recommendation to use this particular one):

Resolution (16:9) Bitrate (H264)
416x234 145 kb/s
640x360 365 kb/s
768x432 730 kb/s
768x432 1100 kb/s
960x540 2000 kb/s
1280x720 4500 kb/s
1920x1080 6000 kb/s
1920x1080 7800 kb/s

Choosing a good ladder is important for quality and efficient delivery. What is 'good' depends on the capabilities of the end users' devices, network capacity, the codec that is used, and the content itself. Ideally, you adjust the bitrate ladder per asset in your library (as some content requires higher bitrates to achieve the same quality, and other content can be encoded more efficiently).

Choosing a bitrate ladder is subject to different opinions and research. For example, the above ladder is taken from Apple HLS Authoring Specification. A more in-depth look at bitrate ladders can be found in Optimal Design of Encoding Profiles for ABR Streaming, a paper that Yuri Reznik presented at the Packet Video Workshop during ACM MMSys 2018.

A short summary of Reznik's paper is that the first thing to explore would be the quality versus bitrate curve of your content, which, as noted earlier, will differ per asset. That is, for some assets introducing a higher bitrate won't offer significant gains in quality (as measured by PSNR and SSIM).

The second thing to determine is the steps between the bitrates that you want to offer. These steps will often be bigger for Live content than for VOD.

Finally, you can optimize a bitrate ladder for the characteristics of your network. This step is a bit more challenging, as you'll need a model for your network bandwidth and client behavior. In Reznik's paper, a simple approach based on an LTE model is used, with a client that will always choose the highest possible bitrate within the constraints of estimated available bandwidth.


The aspect ratio must remain the same across your entire bitrate ladder.


The need for an audio specific bitrate ladder is less obvious, since audio can be encoded at high quality using relatively low bitrates (compared to video). That is, the difference between a 128 kb/s encoded AAC stereo track and a 64 kb/s version of that same track might not be worth complicating your setup for when streaming video.

This is different from a setup where you expand the audio that you offer beyond stereo, to include surround sound. You may even use a variety of codecs for your surround sound offerings (e.g., Dolby EC-3 and DTS:X). In such cases, it is recommended to follow the Apple HLS Authoring Specification and offer one bitrate per combination of codec and audio offering, while you do make sure to always include one stereo track at minimum.


When multiple languages are made available, all the audio profiles (codec and bitrate combinations) must be present for each language for Origin's HLS output to be compliant with the Apple HLS Authoring Specification.

Exception: radio (with audio only streams)

When an audio only streams are offered it may become more worthwhile to differentiate different bitrates for stereo tracks, so that end users can enjoy these streams even on very limited connections. In such cases, HE-AAC might be used for lower bitrates, and AAC-LC for higher ones. Perhaps even using different sample rates:

Bitrate (kbps) Samplerate (KHz) Audio codec
24 24 HE-AAC
64 32 HE-AAC
96 48 HE-AAC
128 48 AAC-LC
320 48 AAC-LC
384 48 Dolby AC-3

Required: alignment of Groups of Pictures (GOPs) across bitrates

Each bitrate needs to be GOP aligned. This enables the player to switch between adaptive bitrate video components without significant degradation of the rendered video.

Verifying your configuration

Essentially, perform the above on all adaptive bitrate components and ensure there is no difference. The example below compares two adaptive bitrate segments:

# Use 'inputs' array to specify multiple input files
# All files should have an equal number of tracks and contain video only
inputs=("tears-of-steel-hvc1-1500k.mp4" "tears-of-steel-avc1-1500k.mp4" "tears-of-steel-hvc1-2200k.mp4")

for i in "${inputs[@]}"
  mp4box -dtsx ${i} && \
  awk -F $'\t' '{ print $1, $5 }' < ${i%.*}_ts.txt > ${i%.*}_awk.txt && \
  rm ${i%.*}_ts.txt
diff --to-file "${inputs[1]%.*}_awk.txt" $(for i in "${inputs[@]}"; echo "${i%.*}_awk.txt") \
  && echo "Alignment of sync samples across all tracks :-)" \
  || echo "No alignment of sync samples across all tracks :-("
rm  $(for i in "${inputs[@]}"; echo "${i%.*}_awk.txt")

Required: each video segment starts with an IDR frame

Each segment must start with an Instant Decoder Refresh (IDR) frame that is signaled as being a sync-sample, so that the segment can be considered discrete from a decoding perspective. This enables the player to switch between adaptive bitrate video components without significant degradation of the rendered video.

Verifying your configuration

The example below uses ffprobe to display the video frame type for the first 15 frames of a given segment. Importantly, the first frame must be of type pict_type=I. Do note that this does not check whether the frame is signaled as being a sync-sample, which, in the case of MP4's, is a requirement too:


curl -s > tmp.ts && \
  ffprobe -hide_banner -show_frames tmp.ts | \
  grep "pict_type" | \
  awk 'FNR <= 15' && rm tmp.ts

Required: track metadata is signaled in "moov" box / .init segment

The most important of these data are:

  • For audio the language must be set in the MediaHeaderBox ("moov/trak/mdia/mdhd")
  • For audio the bitrate must be set in the MP4AudioSampleEntry ("moov/trak/mdia/minf/stbl/stsd/esds")
  • For video the average and maximum bitrate must be set in the AVCSampleEntry/HEVCSampleEntry as BitRateBox

How this metadata should be signalled is specified in the "ISO base media file format" specification (ISO/IEC 14496-12).

Furthermore, to be able to calculate and signal the framerate of the video the following must be signaled in the elementary AVC / HEVC video stream:

  • The timescale and number of units in a tick must be set in the respective VUI parameters ('time_scale' and 'num_units_in_tick')
  • In addition the Boolean value for the VUI parameters 'timing_info_present_flag' and 'fixed_frame_rate_flag' must be set to 'true', to signal that the timing info is present and that the framerate is fixed.


Unified Packager or Unified Origin may be used to add or override metadata like track language or bitrate. However, do note that we strongly recommend to fix these problems at the source instead.

Required: additional IDR frames are present at splice points (SCTE 35 use cases only)

A splice point is a specific timestamp in a stream by a SCTE 35 marker. In the streams, this timestamp must correspond to an IDR frame (which needs to be signaled as a sync-sample). If this is the case the splice point offers the opportunity to seamlessly switch the stream to a different clip. Splice points can be used to cue:

  • Content replacement and insertion opportunities (e.g., ads)
  • Start and endpoint of a program

In Live streaming use cases the Live encoder is responsible for adding the additional IDR frames at splice points. For VOD use cases this task can either be fulfilled by the encoder itself, or by relying on the transcoding functionality that is part of Remix AVOD.

Requirement: subtitle cues follow a sequential timeline aligned with other tracks

Subtitle cues, whether formatted as WebVTT or TTML, must be sequential and their timing must be aligned with all other tracks. This probably sounds like common-sense, but this requirement is especially relevant for fragmented TTML subtitles, as these subtitles signal timestamps both on a sample (MP4) and a cue (TTML) level, where a sample can contain multiple cues.

Possible problems are erroneous cues that signal a time range that does not align with the timeline of media in other tracks, a time range of later cue that predates earlier cues, or cues that end earlier than they start.