Storage Proxy

New in version 1.10.27.

Attention

Eventually ProxyPass will replace IsmProxyPass meaning full Apache 'Proxy' functionality will be utilised. An intermediate workflow is outlined below which uses Proxy with subrequests to better handle media requests.

../../_images/storage_proxy.svg

Introduction

Often content requested by HTTP clients is stored remotely and must be accessed through HTTP(s) requests, historically the Origin has relied on cURL to handle these requests. While providing a robust solution this has two distinct disadvantages.

  • Each request requires establishing a new connection between webservers. This is the most performance limiting step, in terms of network latency and the request process.
  • cURL doesn't pass through client headers.

These are resolved by using Apache subrequests to handle all upstream HTTP(s) requests.

For backwards compatibility the cURL functionality remains unchanged and still available.

See Configuration below on how to set up the use of subrequests.

Benefits

The key benefits include performance gains, greater flexibility and opportunities for setup optimization.

Performance Gains

Using Apache's subrequests

Apache subrequests establishes and maintains a pool of connections between Webservers 'intelligently' reusing or culling them as required.

To achieve performance gains, subrequests must be enabled and individual <Proxy> sections defined for each backend storage server. See below.

Throughput Performance Gains with Remote Storage

A fairly optimal setup may achieve performance improvements of around 10-20% while less optimized setups will see even greater gains. These improvements are largely due to caching of DNS lookups.

Adding Caching Layers to Improve Performance

Adding a further caching layer between the Origin and storage populated with index files of stored content significantly reduces the amount of requests leading to a reduction in fragment latency and an increase in throughput.

Using subrequests rather than curl requests will always be more efficient because of Apache's internal caching mechanisms (for DNS and other TCP managerial processes). The storage cache will be even more efficient if it supports HTTP keepalive [1], and the Origin is correctly configured for this, as connections between the Origin and storage cache can be pooled and re-used. See Object Storage Reducing Latency for further information.

Flexibility and Optimization

Adding Custom Headers

When utilizing subrequests Adding custom HTTP headers, our webserver modules propagate request headers transparently which can be added to web frontends and passed through Origin or Remix to arrive at storage backends. Custom headers can be added, removed or modified as required using Apache's mod_headers module to provide additional information to aid trouble shooting or implement server side logic for setup optimization, for example;

  • Informing users whether an asset has been delivered between servers
  • Rate limiting bandwidth on your origin server
  • Restricting CDN traffic
  • Collecting statistics
  • Using headers to include/exclude proxy requests into a billing system
  • Controlling the routing of the mp4 proxy requests according to routing policy rules

Trace-ID Headers

There are multiple proprietary and standardized methods for tracing requests through web services available. Unified Streaming reference the W3C Trace Context standard [2].

Amazon X-Amzn-Trace-Id Header
Amazon's Application Load Balancer [3] defines a X-Amzn-Trace-Id header, to identify when many similar requests are received from the same client within a short time. If there are many layers in the Amazon stack, the header can also be used to track a unique request across all the layers.
Google X-Cloud-Trace-Context Header
Google's Cloud Trace [4] is a distributed tracing system for Google Cloud that collects latency data from applications and displays it near real-time in the Google Cloud Console.
Microsoft Request-Id Header
Microsoft Azure has supported the Request-Id and Correlation-Context headers for some time, however this will be deprecated in favor of the upcoming Trace Context standard.
W3C Trace Context
W3C has recently published a draft of their Trace Context standard, which is co-authored by several Google, Dynatrace and Microsoft employees. It is intended as a replacement for Microsoft's Request-Id and Correlation-Context headers (see HTTP Correlation Protocol [5]).
Forwarded: header (RFC 7239)
In RFC 7239 [6] the Forwarded header is defined, this allows proxy components to disclose information lost in the proxying process.

To manage (e.g. add, remove or modify) tracing headers used by Apache, it is recommended to use subrequests alongside <Proxy> sections, and directives from mod_headers.

Amazon S3 Authentication Using Headers

Authentication is sometimes required when accessing Amazon S3 buckets.

To aid workflow simplification, provide greater flexibility and offer improvements to user setups, the Amazon S3 API has been integrated into Origin. This enables authentication to be handled by the Apache Proxy. A separate module mod_unified_s3_auth handles the configuration and signing of authentication logic.

This enables AWS authentication parameters to be placed at the more logical point, where the S3 bucket is defined. Secondly, the signing method used has been changed; signing is now performed using the header approach not the query parameter approach, providing a better fit for the use of headers as described below.

origin --> storage-proxy+cache --> mod_unified_s3_auth --> storage (s3)
             (ism & drefs)

See AWS S3 with Authentication for further details.

Requirements

To use subrequests in Origin, you require:

  • Apache 2.4, with the following modules enabled:
    • mod_proxy
    • mod_proxy_http
    • mod_ssl
  • mod_smooth_streaming 1.10.22, or later
  • mod_unified_s3_auth 1.10.22, or later (for AWS S3 authentication)

Installation

Install Apache and Unified Origin as usual (see How to Configure (Unified Origin) for more information).

Ensure mod_proxy, mod_proxy_http and mod_smooth_streaming are enabled, and that apachectl configtest shows no errors.

If you require Amazon S3 authentication install and enable mod_unified_s3_auth.

Note

For further instructions see How to Configure (Unified Origin).

Tip

apachectl can be used to test configurations

Configuration

To configure subrequests:

  • Add a UspEnableSubreq on directive
  • Add <Proxy> sections for target URLs

Custom HTTP headers can also be added, if mod_headers is enabled.

Adding UspEnableSubreq directives

UspEnableSubreq on directs Origin to use subrequests, it should be placed in a <Location> section. (See Location in the Apache documentation for further details.)

This should be combined with the directives enabling the use of the Unified Streaming module.

For example:

<Location "/">
  UspHandleIsm on
  UspEnableSubreq on
</Location>

To enable remote storage access IsmProxyPass needs to be added also.

This can be done in either either a <Location> or a <Directory> directive - where <Location> is the preferred directive as of 1.10.28:

<Location "/your-bucket">
  IsmProxyPass http://your-bucket.s3.amazonaws.com/
</Location>

or:

<Directory "/var/www/test/your-bucket">
  IsmProxyPass http://your-bucket.s3.amazonaws.com/
</Directory>

Note

Traditionally <Directory> has been used to access remote storage (as can be seen in the Dynamic Manifests section) with the path being a virtual path: it should not actually exist on disk for the mapping to remote storage to work.

However, looking at the Apaxche documentation <Location> seems to be a better fit as remote storage indeed does not relate to the local filesystem which <Directory> implies - so no 'virtual' path anymore.

Alternatively, the directives can be combined into a single <Location> when all content is remotely stored in for instance S3, which is the most common use case:

<Location "/">
  UspHandleIsm on
  UspEnableSubreq on
  IsmProxyPass http://your-bucket.s3.amazonaws.com/
</Location>

For locations and directories where UspEnableSubreq is enabled, Origin issues HTTP requests to remote storage objects by building internal subrequests, and dispatching these directly into Apache's proxy handler.

Adding <Proxy> sections for target URLs

When the rewrite rules send the subrequest internally as a proxy request, they are handled by workers in Apache. There are two built-in workers: the default forward proxy worker and the default reverse proxy worker, these are not configurable.

Additional workers can be configured explicitly, using <Proxy> sections with ProxySet directives, these should be defined for each of your remote storage servers. This enables connection reuse, and HTTP keep-alive for the defined remote storage servers.

For example, for a remote storage server at http://storage.example.com/, add the following to the <VirtualHost>:

<Proxy "http://storage.example.com/">
  ProxySet connectiontimeout=5 enablereuse=on keepalive=on retry=0 timeout=30 ttl=300
</Proxy>

Individual settings are explained below.

If the server is reachable via http and https, you must add a separate <Proxy> setting for each.

Note

Where a <Proxy> section refers to https, you must also add the SSLProxyEngine on directive to your <VirtualHost> section.

The above ProxySet parameters are customized, the most important being enablereuse=on, which enables connection reuse and gives the greatest performance improvements.

For more information about the ProxySet directive, see proxyset in the Apache documentation.

Description of ProxySet key=value parameters

connectiontimeout (default: timeout)

We recommend 5 seconds, which should be more than enough for most cases, including when connecting to far away Amazon S3 buckets.

If you know your storage is "close", in network terms, this setting can be lowered. However setting this too low can lead to an increase in errors when establishing connections.

disablereuse (default: Off)
We recommend keeping this off, see below.
enablereuse (default: On)
We recommend keeping this on (or not setting it at all), reusing connections greatly improves performance.
keepalive (default: On)
We recommend keeping this on unless you know that TCP connections are kept open indefinitely by the network between your origin and storage.
retry (default: 60)
We recommend 0, this means errors will be immediately reported to the subrequest handler instead of keeping the pool workers occupied.
timeout (default: ProxyTimeout)

We recommend 30 seconds as the upstream default is 60 seconds, which is a long for data to be retrieved.

If you know that the connection to your storage is fast, this setting can be lowered. However setting this too low can lead to more errors when downloading storage content.

ttl (default: n/a)
We recommend 300 seconds as the upstream default does not keep inactive connections. Keeping inactive connections means they can be reused for HTTP Keep-Alive, which improves performance.

Adding custom HTTP headers

You can add custom HTTP headers to subrequests, using Apache's mod_headers. Use the RequestHeader directive inside the appropriate <Proxy> section.

For example:

<Proxy "http://storage.example.com/">
  ProxySet connectiontimeout=5 enablereuse=on keepalive=on retry=0 timeout=30 ttl=300
  RequestHeader set MyHeader1 "%D %t"
  RequestHeader set MyHeader2 "Hello"
</Proxy>

This will add two custom headers to requests for http://storage.example.com/:

  • MyHeader1 which contains the duration and the time of the request
  • MyHeader2 which contains the fixed string Hello

Trace-ID Headers can be set similarly.

For more information about the possible uses of the RequestHeader directive, see requestheader.

In the above case headers are only added to request for media fragments, as <Proxy> is only used for media fragments. In case headers are required on manifest request they may be added in a proxy, for instance as outlined in Header Authorization. Alternatively, if local caching is used as outlined in Object Storage Reducing Latency the headers may be set in the caching virtual host so they are added when proxying the request to the remote storage.

Notes

[1]https://httpd.apache.org/docs/2.4/mod/core.html#keepalive
[2]https://www.w3.org/TR/trace-context/
[3]https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-request-tracing.html
[4]https://cloud.google.com/trace/docs
[5]https://github.com/dotnet/runtime/blob/master/src/libraries/System.Diagnostics.DiagnosticSource/src/HttpCorrelationProtocol.md
[6]https://tools.ietf.org/html/rfc7239