Ideal profile settings¶
Suggested profiles for Video and Audio¶
The profiles below are examples of good encoding strategies that are known to work well with ABR streaming and target many different devices effectively.
They were designed to be encoded with a 3.84 second chunk size. This enables video and audio access units to be aligned for HTTP Live Streaming (HLS), where audio and video frames are multiplexed together within a single MPEG-2 transport stream, ensuring some decoders make a clean switch between profiles.
It is not expected that all profiles are presented to all devices. Rather it is usually the case that different devices are presented their own ideal set of profiles, be that through manifest filtering or by giving them their own device manifest. Examples of this can be found at Playout Control and Using different manifests.
Looking at research done by the BBC for example, it has been found large screen devices such as TVs tend to benefit more from higher frame rates than small screen devices such as tablets, where spatial resolution is of greater importance.
|Rate (kbps)||FPS||Resolution||H.264 Profile|
The audio that goes with a given video format depends on the device capabilities.
|Rate (kbps)||Samplerate (Hz)||Audio Codec|
Radio (Mixing audio profiles)¶
For Radio streams, it may be necessary to provide multiple audio profiles as shown above to the client.
In order to achieve consistent fragment durations and timestamp boundaries the media fragments should contain an integer number of AAC access units (frames).
To achieve this you can use the fragment length settings that the origin provides,
However without fractional notation the lowest value that can be used that satisfies both AAC-LC and HE-AAC is 16 seconds. By using fractional notation you can set this to a lower value like 3.2 (153600/48000) or 6.4 (307200/48000) seconds and have perfect alignment which will both prevent errors and reduce latency.
For specific publishing point settings, please see below. The
minimum_fragment_length settings for each format are expressed using
fractional notation which is perfect for a mixed profile scenario like the table
#!/bin/bash mp4split -o radio.isml \ --archiving=1 --archive_length=28800 --archive_segment_length=120 \ --dvr_window_length=7200 \ --restart_on_encoder_reconnect \ --iss.minimum_fragment_length=307200/48000 \ --hls.minimum_fragment_length=307200/48000 \ --mpd.minimum_fragment_length=307200/48000 \ --hds.minimum_fragment_length=307200/48000
To ensure maximum compatibility, audio and video tracks should be aligned as much as possible. This is important when multiplexing tracks within a single MPEG-2 transport stream as described above, but also when creating Virtual subclips and when using Capture without frame accuracy (i.e, without transcoding), because perfect alignment within a stream will eliminate the possibility of an audio track offset at the start of the resulting clips.
Ideally, alignments means that every start of a GOP is also the start of an audio frame. This is the case when the length of a GOP fits an exact amount of audio frames, each of which is 1024 audio samples long (in the case of AAC).
In an ideal scenario the length of a GOP is equal to or a multiple of the lowest common multiple of the length of video frame and the length of an audio frame.
Attaining such a degree of alignment across all of a stream's audio and video tracks may not be possible. In such cases, it is advisable to make sure that the start of an audio frame at least aligns with the start of a GOP once every few GOPs.
For example, with a video track of 25 FPS and an audio track that has a sample rate of 32 KHz, the length of a video frame is 0.04 seconds and the length of an audio frame is (1024 / 32000) = 0.032 seconds. The lowest common multiple of both numbers is 4/25, or 0.16 seconds. The GOP should be a multiple of this, e.g., 2.4 seconds. A GOP of 2.4 seconds would mean that each GOP contains 60 video frames and 75 AAC frames exactly.
A sync issue can be caused by the video samples having a composition-time-offset. To mitigate the problem of composition time offsets, alignment, and a possible missing edit-lit it is recommended to use negative-composition-times in the "trun" boxes. I.e. you should use version 1 "trun" boxes and the dts should equal the pts for the first keyframe in the media segment. Following samples should use a zero, or negative offset where applicable.
Playout of DASH as well as Smooth Streaming support negative-composition times. HLS (e.g. the Apple example stream) is also okay with PTS < DTS values.
When using negative-composition-times it is easier for an encoder to synchronize audio and video, since it can simply use the DTS/PTS for the start of each media segment (since they are equal), and compensate for b-frames using negative offsets. This also allows alignment across different video profiles (with and without b-frames), since the DTS equals the PTS for the beginning of a media segment, and a possible composition-time-offset only signals the ordering of the frames.