mod_h2, a look at performance

Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without warranty of any kind. See LICENSE for details.

Parallelism and Power
2015-06-19

I did some measurements between two machines linked with a 1 Gbps ethernet to see what the effect of requested resource size played with the number of parallel streams. Some surprises.

A 1 Gbps ethernet can theoretically carry 81000 full 1518 byte frames per second (source) which gives a max throughput of 122958000 bytes or 117 MB/sec.
When transferring a resource of 10844 bytes 500000 times, this amounts to 5422000000 bytes or 5170 MB. Of payload. h2load reports that 25063966 bytes were transferred additionally, so about 24 MB, short of 5%. This gives a theoretical maximum of 44.3 seconds to transfer this amount of data. Or, broken down by the 500000 requests, a maximum of 11286 requests/second.

Similarly, for the resource with 7526 bytes, h2load transferred 3787560677 bytes in total. Which gives a theoretical maximum of 16231 requests/second.

For the 10k resource, mod_h2 comes close to the maximum throughput. For 7k resources, not so much. Compared to the HTTP/1.1 measurements using wrk, the module looks fine (there is suspicion that wrk and h2load are not directly comparablem though).

As resource sizes become smaller and smaller, there is a peculiar wave effect. When using more than 4-5 requests in parallel, performance gets worse and recovers slowly with every increasing parallelism. What is going on here?

This became clearer when I swapped the server test machine, a iMac i5 2010 with my Powerbook i7 2012:

This leads me to the following observations:

h2 performance for small resources is superior to h1, even if no parallel requests are made. The binary format and header compression pays off.
h2 parallelism burns server CPU cycles and is not for free. This - currently - is visible when requesting very small resources and performance decreases with more parallelism. (This is the current theory on the measured 'wave', it may be an implementation issue.)
With the dual core i5 from 2010, we can fill up the 1Gpbs pipe with 10k resources. The larger the resources, the easier it will be. With a quad core i7, we can achieve the same for 7.5 k sized resources (All on 100 simultaneous, busy connections). It works.
If you consider implementing a large h2 client, the tradeoff/benefit in number of connections used vs. number of parallel requests is not easily answered. Putting all actions into a single connection will run into multi-threading overhead for any implementation out there. The "at least 100" parallel streams a server should offer needs more thought.
The less CPU cycles a server burns per request, the better. But this is old news, since it was already true for h1. For mod_h2 there are several areas where optimizations are still possible. More effort is needed (Time & Brains).

Large Transfers
2015-05-12

With release v0.5.5 mod_h2 addresses performance of writes. Early tests in April showed that transfers of large resources achieved only 50-60% of the throughput that was possible with HTTP/1.1, e.g. httpd not involving mod_h2 (on my machine, all over https). That seemed excessively slow.

At first I suspected that the code just did not shovel the data fast enough from worker threads to the main connection. But changes there resulted in only marginal improvements.

Then I looked at how mod_h2 actually passed the raw HTTP/2 frames down the httpd filter chain for writing to the connection. I experimented with some simple write buffering and immediately got much better results. So I looked at how connection output filters work with the data they are given, especially the output filter from mod_ssl.

For those unfamiliar with the httpd internals: the server uses a very nice mechanism call bucket brigades for input and output handling. A very smart list of data chunks, basically, that allows also for meta data buckets like flushing, end-of-stream or resource maintenance indicators. The main purpose of brigades is to make copying of data chunks unnecessary for most operations on the overall data stream. This way, code can manage a brigade of a 10 MB file without having the full 10 MB in memory, read the first 16 KB of it, insert a flush bucket any time etc. without copying of data from one buffer to another.

mod_h2 gets complete frames from its nghttp2 engine to be transfered to the client. Before v0.5.5 it placed them into a bucket and passed that down the connection output filter chain where it eventually reached mod_ssl's filter, got encrypted and then passed to the socket. While DATA frames are mostly 8-16 KB in size, depending on the amount delivered by the worker threads, there are also may other session management frames that are quite small.

By just passing these small frames as buckets in the output brigade, mod_ssl was doing a SSLWrite() on each of them, including all the yadayada that is required by TLS. And this was causing the slow performance.

v0.5.5 uses apr_brigade_write() instead, which contains some very smart code. If possible, it collects small data chunks into 8 KB buckets and, given a proper flush callback, directly writes large data chunks without copying. That gives the following measurements on my Ubuntu image, transferring a 10 MB file 1000 times via 8 connections:

Scenario	Metric	/005.txt
wrk (http/1.1)	MB/s	1029
mod_h2(0.5.4)	MB/s	601
mod_h2(0.5.5)	MB/s	950

With this change, mod_h2 is transferring data about as fast as in HTTP/1.1 and there is no downside of enabling HTTP/2 in a httpd that needs to transfer large resources^*).

^*)In my test scenarios. If you have proof to the contrary, please submit a test case!

Parallelism
2015-04-15

With release v0.5.0 I did some improvements internally that reflect in less memory consumptiona and better performance. I also did a detailed look at the effect that parallel stream numbers have on the overall results.

The tests were again done on my trustworthy MacBook with a Parallels (hah!) Ubuntu 14.04 image. All tests ran in the sandbox. The numbers are samples from several runs, not really averaged and with variation and all that stuff that I should know and do as a mathematician...but I want just to give a feel for it.

Scenario	Parallelism	Metric	/index.html
			653 bytes
wrk (httpd 2.4.12)	`-`	req/s	26026
nghttpd(0.7.11)	`-m 1`	req/s	23691
	`-m 10`	req/s	71078
	`-m 20`	req/s	84392
	`-m 40`	req/s	89725
	`-m 100`	req/s	~30% failures
mod_h2(0.5.0)	`-m 1`	req/s	19587
	`-m 2`	req/s	23702
	`-m 5`	req/s	28582
	`-m 10`	req/s	28723
	`-m 20`	req/s	29535
	`-m 40`	req/s	29189
	`-m 100`	req/s	29498

Interpretation? I think it is safe to say the following:

The -m 1 numbers show that the cost of a single request is still 20% higher in mod_h2 compared to nghttpd.
The overall performance of HTTP/2 is better than HTTP/1 if the number of parallel requests exceeds 2-5, depending on implementation.
The scaling of the nghttpd is fabulous, however the -m 100 case shows that it keeps files open until streams are done and runs out of file handles rather soon. (I am sure now that I mentioned it that the next nghttp2 release will fix this and keep the performance, too!).
mod_h2 hits is ceiling with 10-20 parallel streams very soon. But the good news is it stays stable with increased stream numbers.

mod_h2 stays stable with increasing parallel stream numbers because files for static content get converted into byte buffers before worker threads turn to other streams. That limits the number of open files to the number of worker threads.

This is a compromise in the current processing model. The HTTP/2 connection terminates in a specific worker process and that process has a limit on the number of open files. No spawning of new processes will really help the connection. If HTTP/2 connections should expose long lifetimes and bursty, parallel streams, resources need to be allocated carefully.

The v0.5.0 release allocates everything for stream handling per worker. The worker has a memory pool, bucket allocators and pseudo connection socket and rents those out to the stream that it processes. The stream is done, if all its output data has been sent or sits in the, size limited, output buffers of the HTTP/2 session. So, opening a new stream will allocate only a few bytes. Only when processing of the stream actually starts will more resources be allocated, most of them on hot standby in the worker itself. Therefore the stable performance with increasing parallel stream numbers.

Tests
2015-04-02

Important update below!

I did two tests in three combinations, using the mod_h2 sandbox setup on an Ubuntu Parallels Image. I used h2load for the HTTP/2 test cases and the nice wrk (see https://github.com/wg/wrk) for the HTTP/1.1 numbers.

The performance tests invoked were:


                    wrk wrk -t10 -c100 -d30s https://test.example.org:12346/<file>
                    h2load -c 100 -t 10 -n 740000 https://test.example.org:12346/<file>

where wrk was tested against Apache httpd 2.4.12 with TLS+http/1.1 and h2load was tested with nghttpd, the server that comes with nghttp2, and mod_h2 in the Apache httpd 2.4.12 setup. test.example.org was mapped to 127.0.0.1.

The numbers:

Scenario	Metric	/index.html	/002.jpg
		653 bytes	90364 bytes
wrk + apache(2.4.12)	req/s	25139	7941
	MB/s	23.1	686.8
h2load + nghttpd(0.7.9)	req/s	25084	4022
	MB/s	16.3	347.0
h2load + mod_h2(0.4.3)	req/s	16093	4272
	MB/s	10.7	368.7

Discussion

How to interpret this? Like every benchmark: with care.

First of all wrk and h2load are different programs and its dangerous to compare the absolute numbers between them. But assuming that they are both as efficient in generating the load, one can see that the number of requests generated per second is very similar between wrk and h2load+nghttpd. The MB/s shows either the effect of header compression, or that both tools measure throughput differently. But my bet is on header compression. index.html is very small and compression will play a larger role.

The mod_h2 performance is at about two thirds now (coming from a good 50% start in February) of that of nghttpd or http/1.1 Apache. This is the penalty that mod_h2 has to pay currently in processing individual requests via the Apache httpd 2.4.x runtime. It internally has to fake HTTP/1.1 requests and parse HTTP/1.1 responses and that costs time and resources. The advantage of adding HTTP/2 support without changing the Apache core itself.

That mod_h2's implementation is not inherently stupid becomes visible when looking at requests for the larger resource. In this scenario, the i/o optimized Apache httpd can really shine. Since most of the power goes into shoveling a large file onto a socket and ramming it down TCP's flow control throat, we see almost twice the performance of nghttpd/mod_h2.

Some people might suspect that the lower performance is a inherent disadvantage of HTTP2 flow control. However if one looks at performance measurements from the h2o server, one sees that HTTP/2 performance in throughput and requests/s can match and even outpace HTTP/1 implementations.

So, the most likely suspect (and that needs to be investigated more) is the way that nghttp2 and mod_h2 handle response DATA. There are two points to make:

libnghttp2 uses a dataprovider callback API that needs to return a memory buffer with the data. It then shuffles this data around internally until it is read to send it out (properly framed). This means that static files (mmap'ed) will be copied at least twice. In comparison, the Apache bucket brigade architecture will do this only once and in a very efficient manner.
the additional work that mod_h2 does when collecting response DATA from different threads and feeding it to nghttp2 does not limit performance (in this case). This seems to be the benefit of having own data bucket structures that are passed between threads without copying.

Anyway, those are the two area to work on: data copying and Apache core integration. There's probably a lot of fun ahead. Feedback always welcome.

Update

And feedback I got: Tatsuhiro Tsujikawa (the author of nghttp2) pointed out that h2load without the -m number option will only send one request at a time per connection. Doh! How could I miss that?

In the light of this, my numbers above are a comparision what you get if you use HTTP/2 exactly as HTTP/1.1. But if you really start to send requests in parallel, I get much nicer numbers for the small requests:

Scenario	Parallelism	Metric	/index.html
			653 bytes
h2load + nghttpd(0.7.9)	`-m 10`	req/s (MB/s)	67157 (43.7)
h2load + nghttpd(0.7.9)	`-m 20`	req/s (MB/s)	81847 (53.2)
h2load + nghttpd(0.7.9)	`-m 30`	req/s (MB/s)	85494 (55.6)
h2load + nghttpd(0.7.9)	`-m 40`	req/s (MB/s)	87312 (56.8)
h2load + nghttpd(0.7.9)	`-m 50`	req/s (MB/s)	88216 (57.4)
h2load + mod_h2(0.4.3)	`-m 10`	req/s	25575 (17.1)
h2load + mod_h2(0.4.3)	`-m 20`	req/s	26553 (17.8)
h2load + mod_h2(0.4.3)	`-m 30`	req/s	24952 (16.7)

And there it seems to peek quite early with parallelism in mod_h2, while nghttp2 blazes ahead! My apologies to Tatsuhiro for the earlier numbers and the wrong impression reported!

Well, next week I need to look while mod_h2 does not scale better, it seems.

Münster, 02.04.2015,

Stefan Eissing, greenbytes GmbH

mod_h2, a look at performance

Parallelism and Power2015-06-19

Large Transfers2015-05-12