Content Preparation

Preparing your media for OTT is important to enable efficient delivery and a good end-user experience.

Also, it is key for processing your content later for purposes like archiving, clipping and replay.

Having source content correctly formatted will make the processing by Unified Packager and Unified Origin easier and improve performance.

Note

This section is about the format of the content at the media source. For dynamic delivery of VOD or Live, Unified Origin will repackage this source on-the-fly (i.e., 'just-in-time') for delivery in the requested output format (to support different end user devices). For static delivery of VOD, this source may be repackaged with Unified Packager into the intended delivery format.

Req.: track metadata is signaled in "moov" box / .init segment

The most important of these data are:

  • For audio the language must be set in the MediaHeaderBox ("moov/trak/mdia/mdhd")
  • For audio the bitrate must be set in the MP4AudioSampleEntry ("moov/trak/mdia/minf/stbl/stsd/esds")
  • For video the average and maximum bitrate must be set in the AVCSampleEntry/HEVCSampleEntry as BitRateBox

How this metadata should be signalled is specified in the "ISO base media file format" specification (ISO/IEC 14496-12).

Furthermore, to be able to calculate and signal the framerate of the video the following must be signaled in the elementary AVC / HEVC video stream:

  • The timescale and number of units in a tick must be set in the respective VUI parameters ('time_scale' and 'num_units_in_tick')
  • In addition the Boolean value for the VUI parameters 'timing_info_present_flag' and 'fixed_frame_rate_flag' must be set to 'true', to signal that the timing info is present and that the framerate is fixed.

Note

Unified Packager or Unified Origin may be used to add or override metadata like track language or bitrate. However, do note that we recommend to fix this at the source instead.

Rec.: in case of B-frames, use negative composition time offsets (and no edit lists)

The order in which frames need to be decoded (DTS) is not always equal to the order in which they should be presented (PTS). That's why each frame has a decode timestamp (DTS) and a presentation timestamp (PTS). B-frames are an important reason for this, as they increase encoding efficiency by not only relying on data from prior frames, but from frames that need to be presented after the B-frame as well.

The PTS of a frame (or sample, which basically means the same, but is used more often within this context) is calculated based on its DTS. If a sample's PTS is not equal to its DTS, there is an offset. Offsetting PTS relative to DTS can be done using two instruments:

  • A track level edit list in 'moov.trak.edts.elst'
  • A sample level composition time offset (CTO) in 'moof.traf.trun'

To get a better understanding of this, take a look at the start of this track without B-frames first, where there is no need for CTOs or an edit list and the PTS of each sample is equal to its DTS:

DTS   0    1    2    3    4    5    6    7    8    9    10   11   12   13
    [IDR][ P ][ P ][ P ][ P ][ P ][ P ][IDR][ P ][ P ][ P ][ P ][ P ][ P ]
PTS   0    1    2    3    4    5    6    7    8    9    10   11   12   13

The start of a track with B-frames looks very different, as the decode and presentation order of the samples cannot be the same anymore. The below track includes positive CTOs to account for this. They ensure that the P-frame that is to be presented fourth, is decoded second, because the B-frames that are to be presented second third rely on information from this P-frame:

DTS   0    1    2    3    4    5    6    7    8    9    10   11   12   13
    [IDR][ P ][ B ][ B ][ P ][ B ][ B ][IDR][ P ][ B ][ B ][ P ][ B ][ B ]
CTO   1    3    0    0    3    0    0    1    3    0    0    3    0    0
PTS   1    4    2    3    7    5    6    8    11   9    10   14   12   13

As you can see, introducing these positive CTOs in this case necessitates that the PTS of the very first frame is no longer '0', but '1' instead. To make sure the track still starts at '0', and edit list is present as well, which in this case signals that media_time=1, or, in other words, that PTS '1' should actually be considered '0'.

The problem is that this can potentially lead to sync issues, since certain packaging workflows may remove edit lists, leading to misalignment when different tracks that originally contained different edit lists are bundled together in a stream.

Furthermore, it remains open to interpretation whether the start times for all fragments, which are signaled in the fMP4's index ('mfra', or 'sidx' in the case of CMAF), should be understood as referring to DTS, or PTS. As long as DTS does not equal PTS at the start of each fragment, this is a problem.

Fortunately, these issues can be solved by introducing negative CTOs. This approach can guarantee that PTS equals DTS for the first sample of each fragment without the need for an edit list. This also makes sure that the PTS of the first sample of each track aligns across tracks that are encoded according to different video profiles (with and without B-frames).

When we take the earlier example that used B-frames with positive CTO's and an edit list, but now introduce negative CTO's so that the edit list can be left out, it looks this:

DTS   0    1    2    3    4    5    6    7    8    9    10   11   12   13
    [IDR][ P ][ B ][ B ][ P ][ B ][ B ][IDR][ P ][ B ][ B ][ P ][ B ][ B ]
CTO   0    2   -1   -1    2   -1   -1    0    2   -1   -1    2   -1   -1
PTS   0    3    1    2    6    4    5    7    10   8    9    13   11   12

In practice, this recommendation means that you should use version 1 "trun" boxes and the DTS of the first keyframe in a fragment should be equal to its PTS (i.e., no CTO and no edit list). Any samples that follow should use a CTO where applicable, negative or positive.

DASH, Smooth and HLS all support the use of negative CTOs.

Verifying your configuration

#!/bin/bash

# Use 'input' variable to specify input file
# Input file may have multiple tracks, but should contain video only
# One-liner below will check if sync samples (IDR frames) in input track(s) have CTO.
input=tears-of-steel-avc1-1000k.mp4

awk '{ CTO = $6 - $4 } ; \
  $8 == "1" { print "#"$2": ", $6, "(PTS) -", $4,"(DTS) =", CTO, "(CTO)" } ; \
  CTO != 0 && $8 == "1" { yes++ } ; CTO == 0 && $8 == "1" { no++ } \
  END { print "Found", yes+0, "sync samples with composition time offset, and", no+0, "without" }' \
  <(MP4Box -dtsx -std -quiet ${input})

Note

From input with negative CTO's, Origin (and Packager) will produce HLS TS output where the PTS of some frames is smaller than their DTS. This may result in errors when trying to verify the stream with certain transport stream specific tooling. However, PTS < DTS should not cause any issue for OTT delivered transport streams, as content is not streamed continuously but in self-contained segments. In fact, some of Apple's HLS example streams have PTS < DTS: https://developer.apple.com/streaming/examples/.

However, if you want to avoid PTS < DTS and rely on edit lists instead, it is possible to instruct Packager to do so using --positive_composition_offsets (but note that we do not recommend this).

Edit lists for audio tracks

Edit lists do have a clear use for audio tracks, as they often contain samples for initialization that must not to be rendered. Using an edit list, the PTS of these samples can be shifted such that the first sample that should be rendered is equal to the PTS of the first sample in each of the video tracks.