Modules in Apache httpd
There are bugs and there are bugs. Most are just annoying or tedious to analyze. Sometimes, however, you can learn something new.
This particular bug was in the Akamai CDN caching infrastructure. In particular, the part handing out OCSP responses (disclaimer: I do not know if it was restricted to this) for a variety of Certificate Authorities, such as Let's Encrypt (their bug report).
I had implemented a new OCSP Stapling as part of Apache's domain management module. I test that with a local boulder installation. Which is the server Let's Encrypt runs, so it's very much as close to the real thing as you can get. Or so I thought.
However, before I shipped the new feature, I deployed it on my site. I was reasonably sure that it would work, since it uses curl as library for its communication. This weathered and well maintained piece of software is invaluable.
The module started up, saw that it needed an OCSP response from http://ocsp.int-x3.letsencrypt.org
,
constructed the request data and POST
ed it. It got a proper OCSP response back,
openssl parsed it fine. But when asked about the certificate status report in it, openssl
waved its hand and said "These aren't the status responses you are looking for!".
Since I program in C
, well
know for its memory corruption capabilities, I looked at mistakes in my code first. That part looked
all right, so I made hex dumps of the responses (yes, you still need to be able to do
that in programming) and saw that the response was a OCSP validation which reported the
certificate to be in GOOD
status. But it was for another certificate!
What?
Ok, eliminate the impossible, Watson! I saved the request data constructed by my module and send the request using the curl (did I mention how fantastic this thing is?) command line version:
> curl -H 'Content-Type: application/ocsp-request' --data-binary @request.bin http://ocsp.int-x3.letsencrypt.orgAnd it gave me a another response. A better response! The correct response!
Back to my code. More logging statements. Did not find the mistake. Still got wrong answers. "Well, it was a long day - maybe I see it tomorrow", I thought.
Checking the server again on the next day, I saw that it received the correct answer sometimes during the night (it keeps on trying as it is designed to)! Yay! But throwing away the cached one, it tried again and...got a wrong answer again, just like the day before.
Testing the curl command line: it still retrieves the correct answer, all the time.
Ok, time to go deeper, raising log levels to the extreme. Recording everything my module and libcurl does. The request from my module:
POST / HTTP/1.1 Host: ocsp.int-x3.letsencrypt.org User-Agent: Apache/mod_md Accept: */* Expect: 100-continue Content-Type: application/ocsp-request Content-Length: 85 HTTP/1.1 100 Continue Cache-Control: max-age=1 Expires: Tue, 20 Aug 2019 13:43:58 GMT HTTP/1.1 200 OK Content-Type: application/ocsp-response Content-Length: 527 ...and from the command line:
POST / HTTP/1.1 Host: ocsp.int-x3.letsencrypt.org User-Agent: curl/7.54.0 Accept: */* Content-Type: application/ocsp-request Content-Length: 85 HTTP/1.1 200 OK Content-Type: application/ocsp-response Content-Length: 527 ...Reproducing the first version in the command line via
> curl -H 'Expect: 100-continue' -H 'Content-Type: application/ocsp-request' --data-binary @request.bin http://ocsp.int-x3.letsencrypt.orgconfirmed: the wrong response came back for the
100-continue
variant.
in the command line as well. No bug in my module. I suppressed 100-continue
in the module and the correct response was received and validated. Everyone
was happy.
Oops.
I contacted @cpu at Let's Encrypt to tell him that their responders are acting a bit weird and gave him the data to reproduce.
But he could not reproduce.
I still could.
Hmm.
He then escalated this to their Akamai contacts, as Akamai was fronting their OCSP responder. What IP addresses was I seeing? etc.
A few hours later I was informed that the bug has been reproduced at Akamai and if I would keep this to myself until a fix was rolled out? Sure, not a problem.
The rollout has been done, so I feel free to speculate about what happened.
The most peculiar thing about this bug is that a valid response was returned. Usually when something goes wrong, you get a garbled response, or an incomplete one or the connection just dies or the server catches fire. But not here, everything behaved as if no one had noticed something wrong.
The fact that the wrong, valid responses were all different and varied
in Last-Modified
and ETag
data, leads me to believe
that I was seeing the responses to requests from other clients.
The fact that sometime during the night, probably during low traffic hours, my module got the correct response once, makes it very likely that the response to my request was delivered, but mostly to others.
Caveat: the following is pure speculation on my part.
Imagine a restaurant where you order a meal at the counter and then stand in line at the dispenser where the trays with food are handed out. Every customer takes her tray and leaves. Since the trays are produced in the sequence ordered, everything is fine.
Now, a customer comes in and places an order to a meal that produces 2 trays. No one has done this before. But it's a meal on the menu, nothing wrong with that.
So, this special customer stands in line as well, and gets handed his tray. He steps away, realizes it's only one tray of two, and queues again at the dispenser to retrieve the other.
This is fine as long as he is the only customer at the time. Imagine there being two. Customer 2 will get the other tray of customer 1. And customer 1 will get the tray of customer 2 as his second.
It there are 10 people in the dispenser queue behind our special customer, they will all get the wrong tray.
Now replace "tray" with "http/1.1 response" and it describes - what I guess - was going wrong at the OCSP responder.
Side note: HTTP/2
is preferable on
server-to-server communications. HTTP/2
would place numbers on orders and trays, so customers know which one to pick up.
End side note.
This could have been very annoying for people all over the world. Worst case scenario probably would be to ship this in Apache for various distributions and many people start using it at about the same time from different places, retrying and retrying.
What can we learn from this? Besides the obvious things such as testing, variety in implementations and careful and responsive handling of security issues (props to LE and Akamai for handling this very professionally).
First, personally, it was fun to be on the other side of a security incident for once. Very relaxing.
Second, when you are using CDNs for you services - and this holds true for all of them, Akamai, Cloudflare, Fastly, Amazon, Azure etc. - you need to add geo-location to your test dimensions.
If I had been a developer in California, I would not have found this bug. I would have shipped my module in Apache and all US users would have been happy.
Third, there is no bug bounty for any of this. I am old-skooling this for the greater good, earning a little fame with this blog, maybe. That's it.
And it worked this time and is ok. But I believe infrastructure companies, as CDNs are, should offer a bug bounty program. It would make everyone more safe. Amen.
Fourth, I believe there are still many bugs of this kind in the world. Mistakes in coding are made everywhere. Some lead to crashes or user-after-frees and for those there are excellent tools nowadays to find them. It's just a matter of invested CPU cycles.
For bugs such as this one, there is no automated search. A fuzzer does not verify that the OCSP response matches the fuzzed request data. Nor was there any drop in responsiveness or peaks in CPU/memory usage to be observed.
These bugs will mostly remain hidden, unless computers become way smarter.
Münster, 28.08.2019,
Stefan Eissing
Copyright (C) 2019 Stefan Eissing