HTTP/2 module for Apache httpd
Copyright (C) 2015 greenbytes GmbH
Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without warranty of any kind. See LICENSE for details.
I did some measurements between two machines linked with a 1 Gbps ethernet to see what the effect of requested resource size played with the number of parallel streams. Some surprises.
A 1 Gbps ethernet can theoretically carry 81000 full 1518 byte frames per second
(source) which gives a max throughput
of 122958000 bytes or 117 MB/sec.
When transferring a resource of 10844 bytes 500000 times, this amounts to 5422000000 bytes or 5170 MB. Of payload.
h2load
reports that 25063966 bytes were transferred additionally, so about 24 MB, short of 5%. This
gives a theoretical maximum of 44.3 seconds to transfer this amount of data. Or, broken down by the 500000 requests,
a maximum of 11286 requests/second.
Similarly, for the resource with 7526 bytes, h2load transferred 3787560677 bytes in total. Which gives a theoretical maximum of 16231 requests/second.
For the 10k resource, mod_h2
comes close to the maximum throughput. For 7k resources, not so much. Compared
to the HTTP/1.1
measurements using wrk
, the module looks
fine (there is suspicion that wrk and h2load are not directly comparablem though).
As resource sizes become smaller and smaller, there is a peculiar wave effect. When using more than 4-5 requests in parallel, performance gets worse and recovers slowly with every increasing parallelism. What is going on here?
This became clearer when I swapped the server test machine, a iMac i5 2010 with my Powerbook i7 2012:
This leads me to the following observations:
h2
performance for small resources is superior to h1
, even if no parallel requests
are made. The binary format and header compression pays off.h2
parallelism burns server CPU cycles and is not for free. This - currently - is visible when
requesting very small resources and performance decreases with more parallelism. (This is the current
theory on the measured 'wave', it may be an implementation issue.)h2
client, the tradeoff/benefit in number of connections used
vs. number of parallel requests is not easily answered. Putting all actions into a single connection will run
into multi-threading overhead for any implementation out there. The "at least 100" parallel streams
a server should offer needs more thought.h1
. For mod_h2
there are several areas where optimizations are still possible. More effort is needed (Time & Brains).
With release v0.5.5
mod_h2
addresses performance of writes. Early tests in April showed that transfers
of large resources achieved only 50-60% of the throughput that was possible with HTTP/1.1, e.g. httpd not involving mod_h2 (on my
machine, all over https). That seemed excessively slow.
At first I suspected that the code just did not shovel the data fast enough from worker threads to the main connection. But changes there resulted in only marginal improvements.
Then I looked at how mod_h2 actually passed the raw HTTP/2 frames down the httpd filter chain for writing to the connection. I
experimented with some simple write buffering and immediately got much better results. So I looked at how connection output
filters work with the data they are given, especially the output filter from mod_ssl
.
For those unfamiliar with the httpd internals: the server uses a very nice mechanism call bucket brigades for input and output handling. A very smart list of data chunks, basically, that allows also for meta data buckets like flushing, end-of-stream or resource maintenance indicators. The main purpose of brigades is to make copying of data chunks unnecessary for most operations on the overall data stream. This way, code can manage a brigade of a 10 MB file without having the full 10 MB in memory, read the first 16 KB of it, insert a flush bucket any time etc. without copying of data from one buffer to another.
mod_h2
gets complete frames from its nghttp2
engine to be transfered to the client. Before v0.5.5
it placed them
into a bucket and passed that down the connection output filter chain where it eventually reached mod_ssl
's
filter, got encrypted and then passed to the socket. While DATA frames are mostly 8-16 KB in size, depending on the amount
delivered by the worker threads, there are also may other session management frames that are quite small.
By just passing these small frames as buckets in the output brigade, mod_ssl
was doing a SSLWrite()
on
each of them, including all the yadayada that is required by TLS. And this was causing the slow performance.
v0.5.5
uses apr_brigade_write()
instead, which contains some very smart code. If possible, it collects
small data chunks into 8 KB buckets and, given a proper flush callback, directly writes large data chunks without
copying. That gives the following measurements on my Ubuntu image, transferring a 10 MB file 1000 times via 8 connections:
Scenario | Metric | /005.txt |
---|---|---|
wrk (http/1.1) | MB/s | 1029 |
mod_h2(0.5.4) | MB/s | 601 |
mod_h2(0.5.5) | MB/s | 950 |
With this change, mod_h2
is transferring data about as fast as in HTTP/1.1 and there is no downside of
enabling HTTP/2 in a httpd that needs to transfer large resources*).
*)In my test scenarios. If you have proof to the contrary, please submit a test case!
With release v0.5.0
I did some improvements internally that reflect in less memory
consumptiona and better performance. I also did a detailed look at the effect that parallel stream
numbers have on the overall results.
The tests were again done on my trustworthy MacBook with a Parallels (hah!) Ubuntu 14.04 image. All tests ran in the sandbox. The numbers are samples from several runs, not really averaged and with variation and all that stuff that I should know and do as a mathematician...but I want just to give a feel for it.
Scenario | Parallelism | Metric | /index.html |
---|---|---|---|
653 bytes | |||
wrk (httpd 2.4.12) | - | req/s | 26026 |
nghttpd(0.7.11) | -m 1 | req/s | 23691 |
-m 10 | req/s | 71078 | |
-m 20 | req/s | 84392 | |
-m 40 | req/s | 89725 | |
-m 100 | req/s | ~30% failures | |
mod_h2(0.5.0) | -m 1 | req/s | 19587 |
-m 2 | req/s | 23702 | |
-m 5 | req/s | 28582 | |
-m 10 | req/s | 28723 | |
-m 20 | req/s | 29535 | |
-m 40 | req/s | 29189 | |
-m 100 | req/s | 29498 |
Interpretation? I think it is safe to say the following:
-m 1
numbers show that the cost of a single request is still 20% higher
in mod_h2
compared to nghttpd
.nghttpd
is fabulous, however the -m 100
case shows
that it keeps files open until streams are done and runs out of file handles rather soon. (I am sure
now that I mentioned it that the next nghttp2 release will fix this and keep the performance, too!).mod_h2
hits is ceiling with 10-20 parallel streams very soon. But the good news is it
stays stable with increased stream numbers. mod_h2
stays stable with increasing parallel stream numbers because files for static content
get converted into byte buffers before worker threads turn to other streams. That limits the number of
open files to the number of worker threads.
This is a compromise in the current processing model. The HTTP/2 connection terminates in a specific worker process and that process has a limit on the number of open files. No spawning of new processes will really help the connection. If HTTP/2 connections should expose long lifetimes and bursty, parallel streams, resources need to be allocated carefully.
The v0.5.0
release allocates everything for stream handling per worker. The worker has a
memory pool, bucket allocators and pseudo connection socket and rents those out to the stream that it
processes. The stream is done, if all its output data has been sent or sits in the, size limited, output
buffers of the HTTP/2 session. So, opening a new stream will allocate only a few bytes. Only when processing
of the stream actually starts will more resources be allocated, most of them on hot standby in the worker
itself. Therefore the stable performance with increasing parallel stream numbers.
Important update below!
I did two tests in three combinations, using the mod_h2
sandbox setup on an Ubuntu Parallels Image.
I used h2load
for the HTTP/2 test cases and the nice wrk
(see https://github.com/wg/wrk) for the HTTP/1.1 numbers.
The performance tests invoked were:
wrk wrk -t10 -c100 -d30s https://test.example.org:12346/<file>
h2load -c 100 -t 10 -n 740000 https://test.example.org:12346/<file>
where wrk was tested against Apache httpd 2.4.12 with TLS+http/1.1 and h2load was tested
with nghttpd
, the server that comes with nghttp2, and mod_h2 in the Apache httpd 2.4.12 setup.
test.example.org
was mapped to 127.0.0.1
.
The numbers:
Scenario | Metric | /index.html | /002.jpg |
---|---|---|---|
653 bytes | 90364 bytes | ||
wrk + apache(2.4.12) | req/s | 25139 | 7941 |
MB/s | 23.1 | 686.8 | |
h2load + nghttpd(0.7.9) | req/s | 25084 | 4022 |
MB/s | 16.3 | 347.0 | |
h2load + mod_h2(0.4.3) | req/s | 16093 | 4272 |
MB/s | 10.7 | 368.7 |
How to interpret this? Like every benchmark: with care.
First of all wrk
and h2load
are different programs and its dangerous to compare the absolute numbers between them. But assuming that they are both as efficient in generating the load, one can see that the number of requests generated per second is very similar between wrk
and h2load+nghttpd
. The MB/s shows either the effect of header compression, or that both tools measure throughput differently. But my bet is on header compression. index.html is very small and compression will play a larger role.
The mod_h2
performance is at about two thirds now (coming from a good 50% start in February) of that of nghttpd
or http/1.1 Apache. This is the penalty that mod_h2
has to pay currently in processing individual requests via the Apache httpd
2.4.x runtime. It internally has to fake HTTP/1.1 requests and parse HTTP/1.1 responses and that costs time and resources. The advantage of adding HTTP/2 support without changing the Apache core itself.
That mod_h2
's implementation is not inherently stupid becomes visible when looking at requests for the larger resource. In this scenario, the i/o optimized Apache httpd
can really shine. Since most of the power goes into shoveling a large file onto a socket and ramming it down TCP's flow control throat, we see almost twice the performance of nghttpd/mod_h2
.
Some people might suspect that the lower performance is a inherent disadvantage of HTTP2 flow control. However if one looks at performance measurements from the h2o server, one sees that HTTP/2 performance in throughput and requests/s can match and even outpace HTTP/1 implementations.
So, the most likely suspect (and that needs to be investigated more) is the way that nghttp2
and mod_h2
handle response DATA
. There are two points to make:
libnghttp2
uses a dataprovider
callback API that needs to return a memory buffer with the data. It then shuffles this data around internally until it is read to send it out (properly framed). This means that static files (mmap'ed) will be copied at least twice. In comparison, the Apache bucket brigade
architecture will do this only once and in a very efficient manner.mod_h2
does when collecting response DATA from different threads and feeding it to nghttp2
does not limit performance (in this case). This seems to be the benefit of having own data bucket structures that are passed between threads without copying.Anyway, those are the two area to work on: data copying and Apache core integration. There's probably a lot of fun ahead. Feedback always welcome.
And feedback I got: Tatsuhiro Tsujikawa (the author of nghttp2) pointed out that h2load
without the
-m number
option will only send one request at a time per connection. Doh! How could I miss that?
In the light of this, my numbers above are a comparision what you get if you use HTTP/2 exactly as HTTP/1.1. But if you really start to send requests in parallel, I get much nicer numbers for the small requests:
Scenario | Parallelism | Metric | /index.html |
---|---|---|---|
653 bytes | |||
h2load + nghttpd(0.7.9) | -m 10 | req/s (MB/s) | 67157 (43.7) |
h2load + nghttpd(0.7.9) | -m 20 | req/s (MB/s) | 81847 (53.2) |
h2load + nghttpd(0.7.9) | -m 30 | req/s (MB/s) | 85494 (55.6) |
h2load + nghttpd(0.7.9) | -m 40 | req/s (MB/s) | 87312 (56.8) |
h2load + nghttpd(0.7.9) | -m 50 | req/s (MB/s) | 88216 (57.4) |
h2load + mod_h2(0.4.3) | -m 10 | req/s | 25575 (17.1) |
h2load + mod_h2(0.4.3) | -m 20 | req/s | 26553 (17.8) |
h2load + mod_h2(0.4.3) | -m 30 | req/s | 24952 (16.7) |
mod_h2
, while nghttp2
blazes ahead!
My apologies to Tatsuhiro for the earlier numbers and the wrong impression reported!
Well, next week I need to look while mod_h2
does not scale better, it seems.
Münster, 02.04.2015,
Stefan Eissing, greenbytes GmbH