Copyright (C) 2015 greenbytes GmbH
Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without warranty of any kind. See LICENSE for details.
I did some measurements between two machines linked with a 1 Gbps ethernet to see what the effect of requested resource size played with the number of parallel streams. Some surprises.
A 1 Gbps ethernet can theoretically carry 81000 full 1518 byte frames per second
(source) which gives a max throughput
of 122958000 bytes or 117 MB/sec.
When transferring a resource of 10844 bytes 500000 times, this amounts to 5422000000 bytes or 5170 MB. Of payload.
h2load reports that 25063966 bytes were transferred additionally, so about 24 MB, short of 5%. This
gives a theoretical maximum of 44.3 seconds to transfer this amount of data. Or, broken down by the 500000 requests,
a maximum of 11286 requests/second.
Similarly, for the resource with 7526 bytes, h2load transferred 3787560677 bytes in total. Which gives a theoretical maximum of 16231 requests/second.
For the 10k resource,
mod_h2 comes close to the maximum throughput. For 7k resources, not so much. Compared
HTTP/1.1 measurements using
wrk, the module looks
fine (there is suspicion that wrk and h2load are not directly comparablem though).
As resource sizes become smaller and smaller, there is a peculiar wave effect. When using more than 4-5 requests in parallel, performance gets worse and recovers slowly with every increasing parallelism. What is going on here?
This became clearer when I swapped the server test machine, a iMac i5 2010 with my Powerbook i7 2012:
This leads me to the following observations:
h2performance for small resources is superior to
h1, even if no parallel requests are made. The binary format and header compression pays off.
h2parallelism burns server CPU cycles and is not for free. This - currently - is visible when requesting very small resources and performance decreases with more parallelism. (This is the current theory on the measured 'wave', it may be an implementation issue.)
h2client, the tradeoff/benefit in number of connections used vs. number of parallel requests is not easily answered. Putting all actions into a single connection will run into multi-threading overhead for any implementation out there. The "at least 100" parallel streams a server should offer needs more thought.
mod_h2there are several areas where optimizations are still possible. More effort is needed (Time & Brains).
mod_h2 addresses performance of writes. Early tests in April showed that transfers
of large resources achieved only 50-60% of the throughput that was possible with HTTP/1.1, e.g. httpd not involving mod_h2 (on my
machine, all over https). That seemed excessively slow.
At first I suspected that the code just did not shovel the data fast enough from worker threads to the main connection. But changes there resulted in only marginal improvements.
Then I looked at how mod_h2 actually passed the raw HTTP/2 frames down the httpd filter chain for writing to the connection. I
experimented with some simple write buffering and immediately got much better results. So I looked at how connection output
filters work with the data they are given, especially the output filter from
For those unfamiliar with the httpd internals: the server uses a very nice mechanism call bucket brigades for input and output handling. A very smart list of data chunks, basically, that allows also for meta data buckets like flushing, end-of-stream or resource maintenance indicators. The main purpose of brigades is to make copying of data chunks unnecessary for most operations on the overall data stream. This way, code can manage a brigade of a 10 MB file without having the full 10 MB in memory, read the first 16 KB of it, insert a flush bucket any time etc. without copying of data from one buffer to another.
mod_h2 gets complete frames from its
nghttp2 engine to be transfered to the client. Before
v0.5.5 it placed them
into a bucket and passed that down the connection output filter chain where it eventually reached
filter, got encrypted and then passed to the socket. While DATA frames are mostly 8-16 KB in size, depending on the amount
delivered by the worker threads, there are also may other session management frames that are quite small.
By just passing these small frames as buckets in the output brigade,
mod_ssl was doing a
each of them, including all the yadayada that is required by TLS. And this was causing the slow performance.
apr_brigade_write() instead, which contains some very smart code. If possible, it collects
small data chunks into 8 KB buckets and, given a proper flush callback, directly writes large data chunks without
copying. That gives the following measurements on my Ubuntu image, transferring a 10 MB file 1000 times via 8 connections:
With this change,
mod_h2 is transferring data about as fast as in HTTP/1.1 and there is no downside of
enabling HTTP/2 in a httpd that needs to transfer large resources*).
*)In my test scenarios. If you have proof to the contrary, please submit a test case!
v0.5.0 I did some improvements internally that reflect in less memory
consumptiona and better performance. I also did a detailed look at the effect that parallel stream
numbers have on the overall results.
The tests were again done on my trustworthy MacBook with a Parallels (hah!) Ubuntu 14.04 image. All tests ran in the sandbox. The numbers are samples from several runs, not really averaged and with variation and all that stuff that I should know and do as a mathematician...but I want just to give a feel for it.
|wrk (httpd 2.4.12)||req/s||26026|
Interpretation? I think it is safe to say the following:
-m 1numbers show that the cost of a single request is still 20% higher in
nghttpdis fabulous, however the
-m 100case shows that it keeps files open until streams are done and runs out of file handles rather soon. (I am sure now that I mentioned it that the next nghttp2 release will fix this and keep the performance, too!).
mod_h2hits is ceiling with 10-20 parallel streams very soon. But the good news is it stays stable with increased stream numbers.
mod_h2stays stable with increasing parallel stream numbers because files for static content get converted into byte buffers before worker threads turn to other streams. That limits the number of open files to the number of worker threads.
This is a compromise in the current processing model. The HTTP/2 connection terminates in a specific worker process and that process has a limit on the number of open files. No spawning of new processes will really help the connection. If HTTP/2 connections should expose long lifetimes and bursty, parallel streams, resources need to be allocated carefully.
v0.5.0 release allocates everything for stream handling per worker. The worker has a
memory pool, bucket allocators and pseudo connection socket and rents those out to the stream that it
processes. The stream is done, if all its output data has been sent or sits in the, size limited, output
buffers of the HTTP/2 session. So, opening a new stream will allocate only a few bytes. Only when processing
of the stream actually starts will more resources be allocated, most of them on hot standby in the worker
itself. Therefore the stable performance with increasing parallel stream numbers.
Important update below!
I did two tests in three combinations, using the
mod_h2 sandbox setup on an Ubuntu Parallels Image.
h2load for the HTTP/2 test cases and the nice
wrk (see https://github.com/wg/wrk) for the HTTP/1.1 numbers.
The performance tests invoked were:
where wrk was tested against Apache httpd 2.4.12 with TLS+http/1.1 and h2load was tested with
wrk wrk -t10 -c100 -d30s https://test.example.org:12346/<file> h2load -c 100 -t 10 -n 740000 https://test.example.org:12346/<file>
nghttpd, the server that comes with nghttp2, and mod_h2 in the Apache httpd 2.4.12 setup.
test.example.orgwas mapped to
|653 bytes||90364 bytes|
|wrk + apache(2.4.12)||req/s||25139||7941|
|h2load + nghttpd(0.7.9)||req/s||25084||4022|
|h2load + mod_h2(0.4.3)||req/s||16093||4272|
How to interpret this? Like every benchmark: with care.
First of all
h2load are different programs and its dangerous to compare the absolute numbers between them. But assuming that they are both as efficient in generating the load, one can see that the number of requests generated per second is very similar between
h2load+nghttpd. The MB/s shows either the effect of header compression, or that both tools measure throughput differently. But my bet is on header compression. index.html is very small and compression will play a larger role.
mod_h2 performance is at about two thirds now (coming from a good 50% start in February) of that of
nghttpd or http/1.1 Apache. This is the penalty that
mod_h2 has to pay currently in processing individual requests via the Apache
httpd 2.4.x runtime. It internally has to fake HTTP/1.1 requests and parse HTTP/1.1 responses and that costs time and resources. The advantage of adding HTTP/2 support without changing the Apache core itself.
mod_h2's implementation is not inherently stupid becomes visible when looking at requests for the larger resource. In this scenario, the i/o optimized Apache
httpd can really shine. Since most of the power goes into shoveling a large file onto a socket and ramming it down TCP's flow control throat, we see almost twice the performance of
Some people might suspect that the lower performance is a inherent disadvantage of HTTP2 flow control. However if one looks at performance measurements from the h2o server, one sees that HTTP/2 performance in throughput and requests/s can match and even outpace HTTP/1 implementations.
So, the most likely suspect (and that needs to be investigated more) is the way that
mod_h2 handle response
DATA. There are two points to make:
dataprovidercallback API that needs to return a memory buffer with the data. It then shuffles this data around internally until it is read to send it out (properly framed). This means that static files (mmap'ed) will be copied at least twice. In comparison, the Apache
bucket brigadearchitecture will do this only once and in a very efficient manner.
mod_h2does when collecting response DATA from different threads and feeding it to
nghttp2does not limit performance (in this case). This seems to be the benefit of having own data bucket structures that are passed between threads without copying.
Anyway, those are the two area to work on: data copying and Apache core integration. There's probably a lot of fun ahead. Feedback always welcome.
And feedback I got: Tatsuhiro Tsujikawa (the author of nghttp2) pointed out that
h2load without the
-m number option will only send one request at a time per connection. Doh! How could I miss that?
In the light of this, my numbers above are a comparision what you get if you use HTTP/2 exactly as HTTP/1.1. But if you really start to send requests in parallel, I get much nicer numbers for the small requests:
|h2load + nghttpd(0.7.9)||req/s (MB/s)||67157 (43.7)|
|h2load + nghttpd(0.7.9)||req/s (MB/s)||81847 (53.2)|
|h2load + nghttpd(0.7.9)||req/s (MB/s)||85494 (55.6)|
|h2load + nghttpd(0.7.9)||req/s (MB/s)||87312 (56.8)|
|h2load + nghttpd(0.7.9)||req/s (MB/s)||88216 (57.4)|
|h2load + mod_h2(0.4.3)||req/s||25575 (17.1)|
|h2load + mod_h2(0.4.3)||req/s||26553 (17.8)|
|h2load + mod_h2(0.4.3)||req/s||24952 (16.7)|
nghttp2blazes ahead! My apologies to Tatsuhiro for the earlier numbers and the wrong impression reported!
Well, next week I need to look while
mod_h2 does not scale better, it seems.
Stefan Eissing, greenbytes GmbH