Using Cloud Storage

Introduction

Using some form of cloud storage has become a common use case with services provided by external parties, such as Amazon S3, Google Cloud Storage or Azure Storage.

Letting our command-line tools read from cloud storage is done by simply providing the URLs to the content as input. Authentication can be added as outlined in Authenticate requests to AWS S3.

However, writing to cloud storage in an optimal way is a little more complex. This guide will cover writing directly to AWS S3, Google Cloud Storage and Azure Blob storage. The approach described here pipes the output of our command-line tools to the native tools provided by these platforms to guarantee the best interoperability and other benefits:

  • Orthogonal between cloud vendors (it works the same)

  • Agnostic to chosen architecture (it works anywhere, from on-prem hardware to cloud to hybrid/multi cloud)

  • Streaming: not having arbitrary file limits when for instance using PUT

  • Scalable: adding or removing processing entities based on load

  • Separation of concern: each cloud vendor supports their own tooling (and can be asked about it)

  • Being failsafe and able to handle error situations

  • Not implement any proprietary protocols but use standard practice

Amazon S3 (Simple Storage Service)

To write the output of mp4split directly to S3 we use the AWS CLI.

Linux distributions typically have this available as a prebuilt package, although often these are outdated versions. Updating to the latest recommended AWS CLI version 2 is relatively simple, as shown on the installation page.

To manipulate resources on S3, you use the aws s3 subcommand, which has many sub-commands. To download or upload files, you use the aws s3 cp sub-command.

Why we use Amazon S3 Multipart Uploads

We prefer to upload in "streaming" mode, which means that our tools produce chunks of data, and upload these chunks as they become available. This means that no temporary files are required, and data that already has been uploaded can be discarded on the client side.

Amazon S3 supports this use case through their so-called Multipart Upload method. This is a proprietary protocol designed by Amazon, broadly consisting of three steps:

  • Initiating the multipart upload, via a special HTTP POST request.

  • Uploading the actual data in multiple parts, via separate HTTP PUT requests.

  • Completing the multipart upload, via another special HTTP POST request.

How to write directly to Amazon S3

To accept input from another program, the aws s3 cp command accepts - as the local file argument, to signify the standard input. Our CLI tools in turn accept stdout: as their output, optionally followed by a file extension, to write their output on the standard output.

For example, to convert an MP4 file "test.mp4" to fragmented MP4 using Unified Packager, and simultaneously upload it to an S3 bucket named "mybucket", you run:

#!/bin/bash

mp4split -o stdout:.ismv test.mp4 | aws s3 cp - s3://mybucket/test.ismv

Unified Packager will write its output in 4MB chunks, and the AWS CLI will read as much as it needs to perform its multipart uploads, if necessary. (By default, the AWS CLI S3 sub-command uses 8MB chunks.)

Error handling while writing to Amazon S3

Both our CLI tools and the AWS CLI can encounter errors during their operation. For example, out tools could find a problem in the input file(s), and be unable to continue processing, or the AWS CLI could encounter a connection failure.

Typically, if our CLI tools encounter a fatal error, they prints an error message, and exits with a non-zero exit code. The AWS CLI will only notice that its standard input reaches end-of-file (EOF), but cannot distinguish between a successful or failed status. Therefore, the uploaded file will likely be cut short, and should therefore not be used.

When using shell pipelines to connect commands, the return status of the whole pipeline is normally the exit status of the last command. Therefore, when running mp4split | aws s3 cp, only a failure of the AWS CLI can be detected.

As this is a pretty fundamental design problem in shell pipelines, most shells, such as bash, zsh and others offer a pipefail option. If this option is enabled, the pipeline's return status becomes the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands exist successfully.

We can use this to detect failures during uploads, and in case of such failure, get rid of the uploaded file. For example:

#!/bin/bash

set -o pipefail; \
  mp4split -o stdout:.ismv test.mp4 | \
  aws s3 cp - s3://mybucket/test.ismv || \
  aws s3 rm s3://mybucket/test.ismv

In this example, if either mp4split throws an error while processing "test.mp4", or aws s3 cp throws an error while uploading the result, the aws s3 rm command after the || will be run, deleting the partial output on S3.

Another possible approach is to use the bash-specific internal array variable PIPESTATUS, which contains the exit status values from the processes in the most-recently-executed foreground pipeline. For example:

#!/bin/bash

mp4split -o stdout:.ismv test.mp4 | aws s3 cp - s3://mybucket/test.ismv
if test ${PIPESTATUS[0]} != 0 -o ${PIPESTATUS[1]} != 0; then
  aws s3 rm s3://mybucket/test.ismv
fi

Azure Blob Storage

Microsoft's Azure Storage has multiple features, one of which is the Azure Blob Storage. Although this service is comparable to Amazon S3 and Google Cloud Storage it is slightly different in that it distinguishes between types of 'blobs', namely:

  • Block blobs: similar to S3 objects, being individual 'files'.

  • Append blobs: optimized for append operations, such as logs.

  • Page blobs: optimized for random read/write access, such as virtual machine disks, or database engines.

For our use cases, only the 'block blobs' are relevant and we will write to them using the AzCopy tool, azcopy. This is written in Go and distributed as a single executable file that can be downloaded from the AzCopy download page. Put the executable in any directory in your PATH, and it is ready to run.

After installing AzCopy and logging into Azure (using azcopy login), the azcopy tool can be used to list, download, upload and otherwise manage Azure Storage blobs.

Note

When running azcopy login, make sure to use the --tenant-id option, otherwise it might choose the wrong one, and show an incomprehensible (and non-actionable) error message. See the Azure documentation on how to find the correct tenant ID.

How to write directly to Azure Blob Storage

To accept input from another program, and upload it to Azure Blob Storage, the azcopy cp command uses the --from-to PipeBlob option. Our CLI tools in turn accept stdout: as their output, optionally followed by a file extension, to write their output on the standard output.

For example, to convert an MP4 file "test.mp4" to fragmented MP4 using Unified Packager, and simultaneously upload it to an Azure Storage container named "mycontainer", under storage account "myaccount", you run:

#!/bin/bash

mp4split -o stdout:.ismv test.mp4 | azcopy cp --from-to PipeBlob \
  https://myaccount.blob.core.windows.net/mycontainer/test.ismv

Unified Packager will write its output in 4MB chunks, and azcopy will upload the data to Azure Storage.

Error handling while writing to Azure Blob Storage

Similar to the error handling method used for uploading to Amazon S3 and Google Cloud Storage, the set -o pipefail shell command can be used to detect that errors occurred in any of the commands used in a shell pipeline. For example:

#!/bin/bash

set -o pipefail; \
  mp4split -o stdout:.ismv test.mp4 | \
  azcopy cp --from-to PipeBlob \
    https://myaccount.blob.core.windows.net/mycontainer/test.ismv || \
  azcopy rm https://myaccount.blob.core.windows.net/mycontainer/test.ismv

In this example, if either mp4split throws an error while processing "test.mp4", or azcopy cp throws an error while uploading the result, the azcopy rm command after the || will be run, deleting the partial output on Azure Blob Storage.

Google Cloud Storage

Google's Cloud Storage service is fairly similar to Amazon S3, using the same terminology like "buckets". As with Amazon S3, the most straightforward way of uploading data to Google Cloud Storage is the gsutil tool, contained in the Google Cloud SDK. This can be obtained from the Google Cloud SDK site, and has installation instructions for several Linux distributions (both apt-based and rpm-based ones), macOS and Windows.

After installing the Google Cloud SDK and authorizing it (using gcloud init or gcloud auth login), the gsutil tool can be used to list, download, upload and otherwise manage Google Cloud Storage files.

How to write directly to Google Cloud Storage

To accept input from another program, the gsutil cp command accepts - as the local file argument, to signify the standard input. Our CLI tools in turn accept stdout: as their output, optionally followed by a file extension, to write their output on the standard output.

For example, to convert an MP4 file "test.mp4" to fragmented MP4 using Unified Packager, and simultaneously upload it to an Google Cloud Storage bucket named "mybucket", you run:

#!/bin/bash

mp4split -o stdout:.ismv test.mp4 | gsutil cp - gs://mybucket/test.ismv

Unified Packager will write its output in 4MB chunks, and gsutil will read as much as it needs to perform its multiple chunk uploads, if necessary.

Error handling while writing to Google Cloud Storage

Similar to the error handling method used for uploading to S3, the set -o pipefail shell command can be used to detect that errors occurred in any of the commands used in a shell pipeline. For example:

#!/bin/bash

set -o pipefail; \
  mp4split -o stdout:.ismv test.mp4 | \
  gsutil cp - gs://mybucket/test.ismv ||
  gsutil rm gs://mybucket/test.ismv

In this example, if either mp4split throws an error while processing "test.mp4", or gsutil cp throws an error while uploading the result, the gsutil rm command after the || will be run, deleting the partial output on Google Storage.