Encoding Requirements

Introduction

This guide describes how encoders must be configured in order to guarantee correct operation with Unified Origin Live. It has been created for customers and partners who are configuring encoders to publish to Unified Origin Live.

Background

This guide pro-actively addresses issues by providing a thorough checklist to ensure your encoder is configured correctly. You will also find the technical reasons for those configurations and some scripts to validate that your settings and configurations are indeed correct.

Unified Streaming have worked with several encoder technology vendors to qualify their solution with our technology platform, see here Factsheet for supported encoders, however any encoder output that conforms with the requirements detailed in this page will work. If you are an encoder vendor and wish to schedule an interoperability exercise with our product team, please contact sales@unified-streaming.com.

Definitions

  • REQUIRED; if this characteristic is not satisfied your Unified Origin Live setup will not work; these are MUST have configuration options.
  • RECOMMENDED; without this characteristic your Unified Origin will lack features and/or may not work; these are SHOULD have configuration options.

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Verifying Configuration

This page includes several examples to assist in the verification of your configuration. We have chosen to use commonly available tools and Unified Streaming products to assist with this where practicable.

References

Unified Streaming Origin Live supports the ingest of fragmented MPEG-4 (fMP4) also known as ISOBMFF (ISO/IEC 14996-12) streams following the fragmented MP4 live ingest specification by Microsoft Azure.

UTC Timestamps SHOULD be used

If you need seamless encoder failover please see Dual ingest setup (Failover), this requires that the segment contents (including timing information) generated by each encoder must be identical for a given segment towards a given publish point. For this reason, we recommend that Encoders generate all stream components with a UTC Timestamp.

Unified Origin accepts timestamps in either “tfxd” or “tfdt” box.

Verifying your configuration

You can verify your configuration through a GET request to:

http(s)://example/path/<channel-name>.isml/archive

The end time attribute in each stream should be (close to) the actual current UTC time.

Manifest information MUST be sent at the start of a stream.

At the start of a Live streaming event the encoder MUST include manifest information. The information in the manifest enables the server to interpret the incoming live stream and assign semantic meaning to the stream’s tracks.

For Smooth Streaming ingest, the “ftyp”, LiveServerManifestBox, and “moov” boxes MUST be sent at the start of each POST request. For DASH ingest, the DASH MPD MUST be sent at the start of each POST request.

Verifying your configuration

While the encoder is running, query the REST API of the Live Publishing point and verify that state is either STARTING or STARTED, and that the archive contains the expected number of tracks, see here Retrieving additional publishing point information for more information.

End-Of-Stream (EOS) signal SHOULD be sent

At the end of a Live streaming event the encoder SHOULD send an EOS signal for all tracks contained in a livestream. This will properly signify the end of a Live broadcast and make the publishing point switch to a “stopped” state. An EOS signal consists of an empty “mfra” box with no embedded sample entries in the “tfra” box and no “mfro” box following, as specified by ISO/IEC-14496-12.

The empty mfra box containing the 8 byte sequence looks like the following:

00 00 00 08 6d 66 72 61

Verifying your configuration

Stop your encoder and head to the REST API of the Live Publishing point and it should be in the state of “STOPPED”, see here Retrieving additional publishing point information for more information.

Group of pictures MUST be aligned across bitrates

Each bitrate needs to be GOP aligned. This enables the player to switch between adaptive bitrate video components without significant degradation of the rendered video.

Verifying your configuration

Essentially, perform the above on all adaptive bitrate components and ensure there is no difference. The example below compares two adaptive bitrate segments:

#!/bin/bash

source1="http://demo.unified-streaming.com/video/tears-of-steel/tears-of-steel.ism/tears-of-steel-audio_eng=64013-video_eng=407000-10.ts"
source2="http://demo.unified-streaming.com/video/tears-of-steel/tears-of-steel.ism/tears-of-steel-audio_eng=134998-video_eng=755000-10.ts"

# Retrieve the first item you are comparing
echo "Retrieving adaptive source1"
curl -s $source1 > tmp.ts && \
  ffprobe -hide_banner -show_frames tmp.ts | grep "pict_type" > source1

# Retrieve the second item you are comparing
echo "Retrieving adaptive source1"
curl -s $source2 > tmp.ts && \
  ffprobe -hide_banner -show_frames tmp.ts | grep "pict_type" > source2

# Compare the two sources
echo "Comparing sources"
if [ $(diff source1 source2 | wc -l) -lt 1 ]; then
  echo -e "\033[92mNO GOP ALIGNMENT DISCREPANCY\033[39m"
else
  echo -e "\033[91mGOP ALIGNMENT DISCREPANCY\033[39m"
fi

Note

In addition to the requirement that all bitrates are aligned, as described above, it is advised that all segments have an equal duration. This will shorten your client manifests considerably, because it allows the timeline of each track to be represented on a single line, using the set duration in combination with a number of repeats (one for each additional segment).

Video segments MUST start with an IDR frame

Each segment must start with an IDR frame so that the segment can be considered discrete from a decoding perspective. This enables the player to switch between adaptive bitrate video components without significant degradation of the rendered video.

Verifying your configuration

The example below uses ffprobe to display the video frame type for the first 15 frames of a given segment. Importantly, the first frame must be of type pict_type=I

#!/bin/bash

curl -s http://demo.unified-streaming.com/video/ateam/ateam.ism/ateam-audio=128000-video=400000-12.ts > tmp.ts && \
  ffprobe -hide_banner -show_frames tmp.ts | \
  grep "pict_type" | \
  awk 'FNR <= 15' && rm tmp.ts

Audio tracks SHOULD have a timescale that matches their sample rate

To avoid potential timing issues audio tracks should use a TimeScale which matches the sample rate. If the samplerate timescales do not match (an integer multiple of each other) some samples will not be accurately addressable, this may cause discontinuities.

Verifying your configuration

This can be verified with ffprobe:

#!/bin/bash

ffprobe -hide_banner -show_streams main.mp4 | grep codec_time_base

Which will give you an output like below, where the value of codec_time_base needs to be equal to or a multiple of the sample rate that is listed as part of the stream’s properties:

codec_time_base=1/44100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'main.mp4':
  Metadata:
    major_brand     : iso6
    minor_version   : 0
    compatible_brands: iso6
  Duration: 00:00:46.58, start: 0.000000, bitrate: 281 kb/s
    Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, 5.1, fltp, 280 kb/s (default)
    Metadata:
      handler_name    : USP Sound Handler

HTTP POST to a Publishing Point MUST be used

Encoder MUST send mp4 payload using HTTP POST. Segment payload MAY be posted either individually or as a long-running (chunked) post. It is recommended that encoder sends initial HTTP 1.1 handshake (“Expect: 100-Continue” header + empty body) before sending (chunked) payload.

Encoder MUST post to a publishing point using the following URL syntax:

http(s)://example/path/<channel-name>.isml/Streams(<stream-identifier>)

Where the live server manifest and the directory it resides in MUST be writable by the webserver.

The <stream-identifier> SHOULD be unique for each stream.

The main advantage to posting multiple (multiplexed) tracks in a single stream is that they are implicitly synchronized. i.e., either all bitrates are there for some interval or none are.

Track bit rate MAY be signalled (as a last resort hack) in the stream identifier, where a decimal number following a dash is interpreted as kilobits associated with (first, second etc.) track as found in stream, for instance:

<channel-name>.isml/Streams(stream-3000k-128k)

This brittle setup is not recommended and is intended for testing purposes only. It SHOULD NOT be used in any production setup.

Verifying your configuration

To verify that a publishing point is ready to ingest from an encoder, check that the webserver has write access to the live server manifest and the directory it resides in. The live server manifest should not contain any tracks and referenced or implied database (db3) file should not exist.

A GET request to:

http(s)://example/path/<channel-name>.isml/state

should indicate state is idle.

To verify that the ingest was started successfully, a GET request to:

http(s)://example/path/<channel-name>.isml/state

should indicate that state is started. A database file (db3) will be created, the live server manifest will list tracks and ingested media will be stored in one or more ismv files. More detailed information about individual tracks can be obtained through a GET request to:

http(s)://example/path/<channel-name>.isml/statistics
http(s)://example/path/<channel-name>.isml/archive

Metadata Signalling

All metadata MUST be signaled in “moov” box / .init segment.

The most important of these data are:

  • For audio the language MUST be set in the MediaHeaderBox (“moov/trak/mdia/mdhd”)
  • For audio the bitrate MUST be set in the MP4AudioSampleEntry (“moov/trak/mdia/minf/stbl/stsd/esds”)
  • For video the average and maximum bitrate MUST be set in the AVCSampleEntry/HEVCSampleEntry as BitRateBox

How this metadata should be signalled is specified in the “ISO base media file format” specification (ISO/IEC 14496-12).

Furthermore, to be able to calculate and signal the framerate of the video the following MUST be signaled in the elementary AVC / HEVC video stream:

  • The timescale and number of units in a tick MUST be set in the respective VUI parameters (‘time_scale’ and ‘num_units_in_tick’)
  • In addition the boolean value for the VUI parameters ‘timing_info_present_flag’ and ‘fixed_frame_rate_flag’ MUST be set to ‘true’, to signal that the timing info is present and that the framerate is fixed.

Encoding Profiles

Suggested profiles for Video and Audio

The profiles below are examples of good encoding strategies that are known to work well with ABR streaming and target many different devices effectively.

They were designed to be encoded with a 3.84 second chunk size. This enables video and audio access units to be aligned for HTTP Live Streaming (HLS), where audio and video frames are multiplexed together within a single MPEG-2 transport stream, ensuring some decoders make a clean switch between profiles.

It is not expected that all profiles are presented to all devices. Rather it is usually the case that different devices are presented their own ideal set of profiles, be that through manifest filtering or by giving them their own device manifest. Examples of this can be found at Playout Control and Using different manifests.

Looking at research done by the BBC for example, it has been found large screen devices such as TVs tend to benefit more from higher frame rates than small screen devices such as tablets, where spatial resolution is of greater importance.

Video profiles

Rate (kbps) FPS Resolution H.264 Profile
31 6.25 192x108p Baseline
86 25 192x108p Baseline
156 25 256x144p Baseline
281 25 384x216p Baseline
437 25 448x252p Baseline
437 25 512x288p Main
688 25 640x360p Baseline
827 25 704x396p Main
929 25 544x576p High
1570 50 704X396p Main
1570 25 704x576i High
1374 25 896x504p High
2812 50 960x540p High
5070 50 1280x720p High

Audio profiles

The audio that goes with a given video format depends on the device capabilities.

Rate (kbps) Samplerate (Hz) Audio Codec
24 24000 HE-AAC
64 32000 HE-AAC
96 48000 HE-AAC
128 48000 AAC-LC
320 48000 AAC-LC
384 48000 Dolby AC-3

Radio (Mixing audio profiles)

For Radio streams, it may be necessary to provide multiple audio profiles as shown above to the client.

In order to achieve consistent fragment durations and timestamp boundaries the media fragments should contain an integer number of AAC access units (frames).

To achieve this you can use the fragment length settings that the origin provides, e.g. --minimum_fragment_length=.

However without fractional notation the lowest value that can be used that satisfies both AAC-LC and HE-AAC is 16 seconds. By using fractional notation you can set this to a lower value like 3.2 (153600/48000) or 6.4 (307200/48000) seconds and have perfect alignment which will both prevent errors and reduce latency.

For specific publishing point settings, please see below. The minimum_fragment_length settings for each format are expressed using fractional notation which is perfect for a mixed profile scenario like the table above.

#!/bin/bash

mp4split -o radio.isml \
  --archiving=1 --archive_length=28800 --archive_segment_length=120 \
  --dvr_window_length=7200 \
  --restart_on_encoder_reconnect \
  --iss.minimum_fragment_length=307200/48000 \
  --hls.minimum_fragment_length=307200/48000 \
  --mpd.minimum_fragment_length=307200/48000 \
  --hds.minimum_fragment_length=307200/48000

Stream alignment

To ensure maximum compatibility, audio and video tracks should be aligned as much as possible. This is important when multiplexing tracks within a single MPEG-2 transport stream as described above, but also when creating Virtual subclips and when using Capture without frame accuracy (i.e, without transcoding), because perfect alignment within a stream will eliminate the possibility of an audio track offset at the start of the resulting clips.

Ideally, alignments means that every start of a GOP is also the start of an audio frame. This is the case when the length of a GOP fits an exact amount of audio frames, each of which is 1024 audio samples long (in the case of AAC).

In an ideal scenario the length of a GOP is equal to or a multiple of the lowest common multiple of the length of video frame and the length of an audio frame.

Attaining such a degree of alignment across all of a stream’s audio and video tracks may not be possible. In such cases, it is advisable to make sure that the start of an audio frame at least aligns with the start of a GOP once every few GOPs.

For example, with a video track of 25 FPS and an audio track that has a sample rate of 32 KHz, the length of a video frame is 0.04 seconds and the length of an audio frame is (1024 / 32000) = 0.032 seconds. The lowest common multiple of both numbers is 4/25, or 0.16 seconds. The GOP should be a multiple of this, e.g., 2.4 seconds. A GOP of 2.4 seconds would mean that each GOP contains 60 video frames and 75 AAC frames exactly.

Sync Issues

A sync issue can be caused by the video samples having a composition-time-offset. To mitigate the problem of composition time offsets, alignment, and a possible missing edit-lit it is recommended to use negative-composition-times in the “trun” boxes. I.e. you should use version 1 “trun” boxes and the dts should equal the pts for the first keyframe in the media segment. Following samples should use a zero, or negative offset where applicable.

Playout of DASH as well as Smooth Streaming support negative-composition times. HLS (e.g. the Apple example stream) is also okay with PTS < DTS values.

When using negative-composition-times it is easier for an encoder to synchronize audio and video, since it can simply use the DTS/PTS for the start of each media segment (since they are equal), and compensate for b-frames using negative offsets. This also allows alignment across different video profiles (with and without b-frames), since the DTS equals the PTS for the beginning of a media segment, and a possible composition-time-offset only signals the ordering of the frames.