mod_h2, inside httpd

Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without warranty of any kind. See LICENSE for details.

This is a look at the internals of the mod_h2 implementation and its interfaces with Apache httpd. I try to give experiences and observations made during implementation of mod_h2 without any guarantee of completeness or particular order. All mistakes made are my own.

Processing Model
2015-04-17

The nature of HTTP/2 places new demands on a server. In HTTP/1, the client only expectations after sending a request is to get an answer to that request as soon as possible. Even if it pipelines a second request on the same connection, it will expect the answer to the first one to arrive before the second. The server can process a HTTP/1 connection by a single thread, since it has only one thing to perform at at time (I exclude sub-requests in this discussion).

And this was the model for the early httpd. It later got refined by different multi-processing modules, the current star being mpm_event that can reuse threads during times when a request is waiting. But while the thread may change during the lifetime of a connection, there is only ever one at a time. And there is only ever one requests worked on per connection at a time. In gaming once would say this is a "1-1-1" built.

HTTP/2 is built for handling multiple requests at once, expecting high bandwidth utilization, interleaving of responses and even on-the-fly prioritization. But not only that, both endpoints of a HTTP/2 connection are frequently exchanging meta information, adjusting window sizes, update settings or even simply answering a ping request.

A HTTP/2 server that only serves static files may handle all of this in a single thread, using some sort of async io or event handling. A server like httpd that allows configurable filters/handlers and foreign request processing modules, cannot do that. Instead request processing must be shifted into separate threads, while the "main" thread serves the connection and collects and combines results from request processing. This would then be called a "1-n-n" processing model.

But that still is too simple, since threads are a valuable and limited server resource. To guarantee a thread to each client request is not possible unless the number of parallel requests is kept small. But that would then limit the client and, especially on high latency connections, potentially lower performance. A better model is to allow queuing of requests up to a certain amount if not enough processing threads are available. Then the server has a "1-n-m" processing model.

mod_h2 implements that model with one thread serving the connection, allowing up to a configurable number of parallel requests which are then served by a number of workers. The worker output, the response data, is collected again in the connection thread for sending out to the client. Worker threads are blocked when buffered output reaches a certain memory size (This is the model adopted from mod_spdy).

Embedding
2015-05-27

In its internal architecture, mod_h2 started with the basic mod_spdy architecture and then started tweaking some more. For those not familiar with mod_spdy I try to give a short summary of what google did there:

`mod_spdy` architecture

The architecture is most easily understood by following what happens on a new connection and the first request:

In a pre_connection hook, running after mod_ssl, the module registers NPN callbacks for new TLS connections.
In a connection hook, also after mod_ssl, the whole connection processing is taken over if a spdy protocol was negotiated via NPN. This disables any further processing of the connection by httpd core. However the mod_ssl in-/output filters are in place.
spdy now talks the selected spdy dialect with the client, negotiates some parameter and reads the first request.
Once the head of the request is sufficiently there, the module creates a new pseudo connection that is a basically a copy of its main one, but with its own memory pool and a socket for itself. Additionally it has special in- and output filters. The whole thing then is thrown into a queue where a worker thread will eventually get it and start processing.
The pseudo connection is processed just like any other connection via ap_process_connection(), which runs all the usual connection hooks. The pre-connection hooks of the module detect the nature of the connection and disable mod_ssl for it, among other things.
At this point, mod_spdy has a filter pair to read/write data to a connection that runs the httpd HTTP/1 processing engine in another thread.
The request data is then serialized in HTTP/1.1 format onto that pseudo connection where httpd core will parse it, create a request_rec and process that. The response eventually arrives in HTTP/1.1 at spdy's output filter on that connection, is converted to internal format and passed in spdy protocol packages out on the main connection.
Some peeking/poking is done to convince everyone that this is the only request on this pseudo connection, the connection get "closed" and the worker thread is free to perform other tasks.
Should the main connection die/close, any connected workers need some signalling and will be joined before processing of the main connection terminates.

This is how mod_spdy works, in a nutshell. There are many details to get things right and everyone convinced that there is a HTTP/1.1 request on a connection, just business as usual, move on please and disregard the man behind the curtain.

And it is a very good approach to bringing the spdy/h2 processing model into httpd as it allows most of httpd's infrastructure as hooks, filters and other modules to keep on processing requests, even though the originally arrived via a totally different network protocol.

But disadvantages are also there:

Pseudo connections are "hand made", I think because the core had no ap_run_create_connection at that time (mod_spdy was created on 2.2 in 2012?). And even if it could have used it, it still would not have worked with mpm_event. There is simply something missing in the core API that allows for a spdy/h2 like processing model. Hacks can be made, but may be soon outdated by httpd development.
The serialization of request headers and data in HTTP/1.1 format is a bit awkward. If you need to add HTTP/1.1 chunking on internal buffers because content-length is missing, well... And, of course, responses need to be parsed and possibly un-chunked as well.
The mod_spdy implementation was held a bit generic and does not use the APR to its fullest extend. Probably because the spdy engine is being used also inside other google code. Data passing involves more copying than all the carefully crafted bucket code in Apache deserves.

`mod_h2` architecture

mod_h2 was written from scratch, but took the ideas from spdy. The main structures/names connected to the concepts introduced by spdy are:

h2_session: the instance handling the main connection, keep the ngttp2 instance and all other state information.
h2_stream: a HTTP/2 stream, the equivalent of a request/response pair.
h2_task: the processor for a h2_stream that gets executed in another thread.
h2_worker: a specialized thread for executing h2_tasks.
h2_mplx: a multiplexing instance, one per main connection, that does the talking/synchronization between h2_session and h2_tasks.
h2_conn: sets up pseudo connections.
h2_request+h2_to_h1: request headers and conversion into httpd HTTP/1 format.
h2_response+h2_from_h1: response headers and conversion from httpd HTTP/1 format.
h2_h2: hooks and filters for handling "h2" protocol in TLS connections.
h2_h2c: hooks and filters for handling "h2c" protocol upgrades in clear text requests.

So, connection setup runs by h2_h2, upgrades in clear by h2_h2c. On success, a h2_session is created. Any newly opened HTTP/2 stream results in a h2_stream. When all headers have been received in a h2_request, a h2_task is created and added to the task queue.

A h2_worker eventually takes the h2_task, sets up the environment pools, bucket allocators, filters and converts the h2_request into a httpd request_rec. This processed by httpd core.

The output filters then transform the output into a h2_response which is passed to the h2_mplx. h2_session regularly polls h2_mplx for new responses and submits those to the client.

Request and response bodies are passed via h2_mplx from h2_session to h2_task and vice versa. When the response body is passed, h2_task ends, its resources are reclaimed and the h2_worker will start on other tasks.

Once all data for a stream has been processed (or when the stream has been aborted), the h2_stream is destroyed and all its resources (memory pools, buckets) are reclaimed. This can happen before the h2_task for the stream is done. The cleanup needs to be delayed then and such streams are kept in a special zombie set.

The closing of connection (graceful or not) triggers the destruction of h2_session which again frees resources. There, it needs to remove all pending h2_task from the queue and join all pending zombie streams before shutting down itself.

`mod_h2` hacks

pseudo connections

The creation of pseudo connections is using ap_run_create_connection instead of doing it manually. This works for mpm_worker, but mpm_event is not happy with it and a special hack needed to be added, mainly for setting up a satisfactory connection state structure.

Other mpm modules have not been tested yet. It would be preferable the extend ap_run_create_connection to setup a connection regardless of what mpm is configured.

request serialization

mod_h2 has code to serialize requests in HTTP/1.1 format so that httpd core may ap_read_request() if from the pseudo connection and then ap_process_request() it. This is basically how ap_process_connection() works, plus some mpm state handling and connection close/keepalive handling.

In v0.6.0, an alternate implementation was added that directly creates a request_rec and invokes ap_process_request() directly. This saves the serialization and parsing exercise and gives better performance at the cost of compatibility.

Setting up the request_rec currently copies two pages of code from the httpd core, something which could be easily mitigated by enhancing the core API.

The HTTP/1.1 serialization can be configured on/off. It is disabled by default.

response serialization

Similar to the request handling, the response was initially parsed from the pseudo connection. And that code is still there when serialization is configured on. By default however, an optimization is in place that replaces the core HTTP_HEADER filter with its own variation.

HTTP_HEADER is a quite large filter that does the following tasks:

Check for error or end of connection buckets.
Apply the sum of holy knowledge how different flags, notes and headers in request_rec and to and remove from the response headers. Initialize potentially missing fields such as status_line.
Determine if HTTP/1.1 chunking is necessary and, if so, add a filter that will apply it.
Remove itself from the output filters once it has passed all status and headers in HTTP/1.1 format down the filter chain.

All but the second point is not needed in the case of HTTP/2 processing. And the replacement filter installed by mod_h2 is a copy of that filter stripped from all the unwanted parts.

HTTP_HEADER should be split into two filters: HTTP_HEADER and HTTP_SERIALIZATION. Then the first one could be kept and code duplication avoided.

polling vs. events

Processing a HTTP/2 connection means that data from the client may arrive any time and that stream data from responses need to be sent as soon as it becomes available. In the current mod_h2 this is done in a polling loop that works like this:

Check if h2_mplx has new responses to be submitted and sent what is there.
Check if new response data arrived for suspended streams and resume them if this is the case.
Let nghttp2 write, if it wants. This may pull stream response data and suspend streams if no data is available at h2_mplx.
If streams are open, do a non-blocking read on the main connection. Otherwise do a blocking read. Pass any data received to the nghttp2 instance for this connection.
If none of the above resulted in any action, increase a backoff time and perform a timed wait on arrival of new data in h2_mplx.

The backoff timer keeps the main connection thread from using up unnecessary CPU. Currently it is capped at 200 ms to keep responsiveness to client data. This is not ideal and a purely event driven implementation is needed.

In common web page processing however, there will be bursts of streams interleaved with long periods of nothingness and in such periods mod_h2 can do a blocking read on the main connection.

Moving Data
2015-04-17

Due to the processing model request and response data needs to traverse threads. In the httpd infrastructure, this means data in apr_buckets handed out/placed into apr_bucket_brigades.

apr_bucket has no link to a apr_bucket_brigade. It can move freely from one brigade to the next. However, it cannot move freely from one thread to another. This is due to the fact that almost all interesting operations on a bucket will involve the apr_bucket_alloc_t it is created with. The job of the apr_bucket_alloc_t is to manage a free list of suitable memory chunks for fast bucket creation/split/transform operations. And it is not thread-safe.

This requires all apr_buckets managed by the same apr_bucket_alloc_t to stay in the same thread. (There are even more turtles down there, as the apr_bucket_alloc_t uses a apr_allocator_t itself, but that can be configured thread safe, if needed).

So, while a worker thread is writing to its output apr_bucket_brigade, buckets from this brigade cannot be transferred and manipulated in another thread. Which means mod_h2 cannot simply transfer the data from the workers to the main thread and out on the connection output apr_bucket_brigade.

A closer look at the apr_bucket reveals that bucket data is not supposed to leave its apr_bucket_alloc_t instance. Which is no surprise as that was never necessary in the 1-1-1 processing model.

That means mod_h2 needs to read from one brigade and write to another brigade when it wants data to cross thread boundaries. Which means basically memcpy the request and response data. Which is not smart.

For resources in static files, httpd has a very efficient implementation. A file descriptor is placed into a bucket and that bucket is passed on to the part handling the connection output streaming. Only there it will be read and written to the connection, using preferably the most efficient way that the host operating system offers.

Any improvement to mod_h2's output handling would transfer the file handle exactly that way. But that poses another challenge, as described in "resource allocation".

Resource Allocation
2015-04-17

If mod_h2 handled output of static file resources similar to httpd, it could easily run out of open file descriptors under load. Especially when processing many parallel requests with the same priority.

10 requests for files could then have 10 file descriptors open, since output of streams with same priority is interleaved. In HTTP/1 processing, the 1-1-1 model, 10 file requests on the same connection would only allocate 1 file descriptor at a time.

And, the HTTP/2 spec says " It is recommended that this value be no smaller than 100, so as to not unnecessarily limit parallelism." So, potentially worst case, a straightforward implementation would open 100 file descriptors at the same time for each connection. This certainly worsens the problem, especially since HTTP/2 connections are expected to stay open for a much longer duration.

The current implementation in mod_h2 never passes file buckets from h2_task to h2_mplx, it just passes data. That means that file handles are only open while a h2_task is being processed. And a h2_task being processed means that it is allocated to a h2_worker thread.

As a result of all this, the number of simultaneous open file descriptors is in the order of h2_worker numbers. And those are limited (and configurable). This gives a behaviour that is stable under load.

But not as efficient as it could be. Maybe there is a good way to introduce resource managements that allows passing of file handles as long as configurable limit has not been exceeded.

The Unknowns
2015-05-28

Since httpd is such a rich environment, deployed on many platforms and host of many, many modules, there certainly will be more incompatibilities discovered in connection with mod_h2. Some few issues have been reported on github, but I am certain many more await.

The main cause, I suspect, will be the pseudo connection handling, especially the setup. Modules that add their data to a connection and then later expect to find it there again during request processing, are the most vulnerable (One of the issues was a report that SSL variables no longer worked).

It would be nice to get rid of these pseudo connections (or replace them with some other concept).

mod_h2, inside httpd

Processing Model2015-04-17

Embedding2015-05-27

mod_spdy architecture

mod_h2 architecture

mod_h2 hacks