Packaging Subtitles

There are many subtitle / caption formats to choose from. HTTP Smooth Streaming supports the XML-based Timed Text Markup Language (TTML) format and the SMPTE-TT, EBU-TT and DFXP profiles that are derived from it. HTTP Live Streaming supports Web Video Text Tracks (WebVTT), which is based on plain-text SRT or VTT and MPEG DASH supports both WebVTT and TTML (as well as the aforementioned derivatives).

Unified Packager can fragment and package TTML and WebVTT into (fragmented) MP4, using stpp (or dfxp) and wvtt codecs, respectively. It can also convert plain text SubRip Text (.srt) and WebVTT into TTML. When offline packaging DASH, an additional option is to add TTML or WebVTT as a sidecar (but do note that sidecar subtitles are not supported by Unified Origin).

The preferred format for subtitles is DFXP (TTML), as it is the most constrained format and therefore the least likely to cause issues when Creating the media files (.ismt).

Converting SubRip Text to Timed Text Markup Language (TTML)

MP4Split supports reading SubRip Text files. It assumes the input file is encoded in ASCII unless the file starts with a Byte Order Marker (BOM) that describes how the input should be transformed to Unicode.

The output TTML has a default styling and layout, which in general will work well. You are free to change this, but keep in mind that most players have limited capabilities and Unified Packager expects valid TTML.

Example:

#!/bin/bash

mp4split -o video.ttml \
  video.srt --track_language="nld"

When you have created a TTML file from an SRT, the language defaults to English, i.e. the TTML file contains the XML tag xml:lang="en".

Note

Make sure that you use the correct extension for the TTML files (.ttml or .dfxp)

Converting WebVTT to Timed Text Markup Language (TTML)

Generating a TTML file from WebVTT is similar to the conversion from SubRip files, except that the input is always interpreted as Unicode (regardless of any BOM), because WebVTT is UTF-8 by definition.

#!/bin/bash

mp4split -o video.ttml \
  video.webvtt --track_language=spa

A limited set of markup components in WebVTT’s cue payloads can be converted to their TTML equivalents. This allows for portability across formats regarding the basic styling that is supported by most devices and players.

Supported WebVTT cue components

Name Description
<b></b> Bolds the textual content
<i></i> Italicises the text
<u></u> Underlines the textual content
<s></s> Specifies a line strike through on the text

Here is an example of a regular WebVTT file with some cue point component elements:

WebVTT cue point example :

WEBVTT

1
00:00:15,000 --> 00:00:18,000
At the <u>left</u> we can see...

2
00:00:18,167 --> 00:00:20,083 position:35% line:20 align:left
At the <u>right</u> we can see the...

3
00:00:20,083 --> 00:00:22,000
...the <c.highlight>head-snarlers</c>

4
00:00:22,000 --> 00:00:24,417
Everything is safe.
<i>Perfectly</i> safe.

Result after converting to TTML:

<?xml version="1.0" encoding="utf-8"?>
<tt xmlns="..." xml:lang="en">
  <head>...</head>
  <body>
    <div style="default" xml:lang="en">
      <p begin="00:00:15.000" end="00:00:18.000" region="speaker">
        At the <span tts:textDecoration="underline">left</span> we can see...
      </p>
      <p begin="00:00:18.167" end="00:00:20.083" region="speaker">
        At the <span tts:textDecoration="underline">right</span> we can see the...
      </p>
      <p begin="00:00:20.083" end="00:00:22.000" region="speaker">
        ...the &lt;c.highlight&gt;head-snarlers&lt;/c&gt;
      </p>
      <p begin="00:00:22.000" end="00:00:24.417" region="speaker">
        Everything is safe.<br />
        <span tts:fontStyle="italic">Perfectly</span> safe.
      </p>
    </div>
  </body>
</tt>

Note

The settings (cue 2) are ignored when converting to TTML and unrecognized styling in the payload is escaped (cue 3).

Creating the media files (.ismt)

Fragmented MP4 (.ismt) files that contain captions or subtitles can be created from WebVTT or TTML files.

WVTT

New in version 1.7.31.

(Web)VTT is packaged as specified by ISO/IEC 14496-30:2014 - Web Video Text Tracks, using the WVTTSampleEntry(wvtt). This format allows WebVTT specific cue settings to define individual subtitle positioning, region and styling information. Playout only works for HLS and DASH, in the players that support the wvtt codec.

#!/bin/bash

mp4split -o subtitles.ismt --fragment_duration=10000 \
  subtitles.webvtt --track_language=spa

When packaging WebVTT subtitles, using the --track_language option is necessary because (unlike TTML) WebVTT files do not define a language attribute. The –fragment_duration option specifies fragment length in milliseconds.

Besides packaging WebVTT as a fragmented MP4, packaging it as a progressive MP4 is possible as well:

#!/bin/bash

mp4split -o subtitles.mp4 \
  subtitles.webvtt --track_language=spa

TTML

TTML samples (either DFXP, EBU-TT, SMPTE-TT or CFF-TT) are stored in a subtitle track that uses the XMLSubtitleSampleEntry(stpp) with timing (@begin and @end attributes) relative to the start of the track [1]. Packaging them as fragemented MP4 (.ismt) files is done like so:

#!/bin/bash

mp4split -o video.ismt \
  video.ttml

This command creates a file with a single track, which is why the TTML input file should contain only one language. If you have a single TTML file that contains multiple languages then you will have to extract separate TTML files for each language first.

MPEG-DASH players (e.g. DASH-JS 1.4 reference player) support this format for both VOD and LIVE playback. Other players (e.g. Google Shaka pre-release) that support only WebVTT may benefit from Adding TTML or WebVTT sidecar subtitles for MPEG-DASH or using the wvtt codec.

The ISO 14496-30 format is the preferred format. When Packaging for HTTP Smooth Streaming (HSS) the TTML is stored in a similar, but incompatible way in a fragmented MP4 (.ismt) container. The samples are stored in a text track and use the PlainTextSampleEntry(dfxp) as their format. The timing of the @begin and @end attributes is relative to the start of the sample. If you want to write this older format, then you have to add --brand=piff to the command line.

Note that Unified Origin supports both formats and adjusts the timing when necessary.

Adding TTML or WebVTT sidecar subtitles for MPEG-DASH

New in version 1.7.12.

While ISMT is the format of choice for streaming subtitles, occasionally it may be desirable or necessary to expose raw unsegmented subtitles to the player. In these cases, a WebVTT or TTML sidecar file can be added to the MPD.

For instance, to cater (pre-release) Google Shaka player (which supports WebVTT rather than fragmented TTML) the following commands expose German subtitles as WebVTT (as well as fragmented TTML):

#!/bin/bash

mp4split -o subtitles.ttml \
  subtitles_deu.webvtt --track_language=ger

mp4split --package-mpd -o subtitles.ismt subtitles.ttml

mp4split --package-mpd -o movie.mpd \
  [audio/video] \
  subtitles.ismt \
  subtitles_deu.webvtt --track_language=ger

This adds an adaptation set with mime type text/vtt (or application/ttml+xml for TTML):

<AdaptationSet contentType="text" lang="de" mimeType="text/vtt">
  <Representation id="textstream_ger=0" bandwidth="0">
     <BaseURL>subtitles_deu.webvtt</BaseURL>
  </Representation>
</AdaptationSet>

Important

When you add sidecar subtitles, they are added as-is. That is, mp4split won’t any metadata from the file. This means that metadata that is of importance for the file should be passed on the command-line (like the track’s language, for example).

Footnote

[1]To create text samples it is important that Unified Packager can derive correct timing information from TTML source. While the TTML spec is liberal (and sometimes ambiguous) in this respect, Packager assumes timing in HH:MM:SS.mmm format in the @begin and @end attributes of tt/body/div/p element. Timing at a different element under tt/body is allowed, but only at the same level. For instance, SMPTE-TT encoders may choose either tt/body/div or tt/body/div/div but should not use both in one file.