Oopscsp!

There are bugs and there are bugs. Most are just annoying or tedious to analyze. Sometimes, however, you can learn something new.

This particular bug was in the Akamai CDN caching infrastructure. In particular, the part handing out OCSP responses (disclaimer: I do not know if it was restricted to this) for a variety of Certificate Authorities, such as Let's Encrypt (their bug report).

What happened?

I had implemented a new OCSP Stapling as part of Apache's domain management module. I test that with a local boulder installation. Which is the server Let's Encrypt runs, so it's very much as close to the real thing as you can get. Or so I thought.

However, before I shipped the new feature, I deployed it on my site. I was reasonably sure that it would work, since it uses curl as library for its communication. This weathered and well maintained piece of software is invaluable.

The module started up, saw that it needed an OCSP response from http://ocsp.int-x3.letsencrypt.org, constructed the request data and POSTed it. It got a proper OCSP response back, openssl parsed it fine. But when asked about the certificate status report in it, openssl waved its hand and said "These aren't the status responses you are looking for!".

Since I program in C, well know for its memory corruption capabilities, I looked at mistakes in my code first. That part looked all right, so I made hex dumps of the responses (yes, you still need to be able to do that in programming) and saw that the response was a OCSP validation which reported the certificate to be in GOOD status. But it was for another certificate!

What?

Ok, eliminate the impossible, Watson! I saved the request data constructed by my module and send the request using the curl (did I mention how fantastic this thing is?) command line version:

> curl -H 'Content-Type: application/ocsp-request' --data-binary @request.bin http://ocsp.int-x3.letsencrypt.org

And it gave me a another response. A better response! The correct response!

Back to my code. More logging statements. Did not find the mistake. Still got wrong answers. "Well, it was a long day - maybe I see it tomorrow", I thought.

Checking the server again on the next day, I saw that it received the correct answer sometimes during the night (it keeps on trying as it is designed to)! Yay! But throwing away the cached one, it tried again and...got a wrong answer again, just like the day before.

Testing the curl command line: it still retrieves the correct answer, all the time.

Ok, time to go deeper, raising log levels to the extreme. Recording everything my module and libcurl does. The request from my module:

POST / HTTP/1.1
Host: ocsp.int-x3.letsencrypt.org
User-Agent: Apache/mod_md
Accept: */*
Expect: 100-continue
Content-Type: application/ocsp-request
Content-Length: 85

HTTP/1.1 100 Continue
Cache-Control: max-age=1
Expires: Tue, 20 Aug 2019 13:43:58 GMT

HTTP/1.1 200 OK
Content-Type: application/ocsp-response
Content-Length: 527
...

and from the command line:

POST / HTTP/1.1
Host: ocsp.int-x3.letsencrypt.org
User-Agent: curl/7.54.0
Accept: */*
Content-Type: application/ocsp-request
Content-Length: 85

HTTP/1.1 200 OK
Content-Type: application/ocsp-response
Content-Length: 527
...

Reproducing the first version in the command line via

> curl -H 'Expect: 100-continue' -H 'Content-Type: application/ocsp-request' --data-binary @request.bin http://ocsp.int-x3.letsencrypt.org

confirmed: the wrong response came back for the 100-continue variant. in the command line as well. No bug in my module. I suppressed 100-continue in the module and the correct response was received and validated. Everyone was happy.

Oops.

Reporting In

I contacted @cpu at Let's Encrypt to tell him that their responders are acting a bit weird and gave him the data to reproduce.

But he could not reproduce.

I still could.

Hmm.

He then escalated this to their Akamai contacts, as Akamai was fronting their OCSP responder. What IP addresses was I seeing? etc.

A few hours later I was informed that the bug has been reproduced at Akamai and if I would keep this to myself until a fix was rolled out? Sure, not a problem.

The rollout has been done, so I feel free to speculate about what happened.

POST Analysis

The most peculiar thing about this bug is that a valid response was returned. Usually when something goes wrong, you get a garbled response, or an incomplete one or the connection just dies or the server catches fire. But not here, everything behaved as if no one had noticed something wrong.

The fact that the wrong, valid responses were all different and varied in Last-Modified and ETag data, leads me to believe that I was seeing the responses to requests from other clients.

The fact that sometime during the night, probably during low traffic hours, my module got the correct response once, makes it very likely that the response to my request was delivered, but mostly to others.

Ordering Food

Caveat: the following is pure speculation on my part.

Imagine a restaurant where you order a meal at the counter and then stand in line at the dispenser where the trays with food are handed out. Every customer takes her tray and leaves. Since the trays are produced in the sequence ordered, everything is fine.

Now, a customer comes in and places an order to a meal that produces 2 trays. No one has done this before. But it's a meal on the menu, nothing wrong with that.

So, this special customer stands in line as well, and gets handed his tray. He steps away, realizes it's only one tray of two, and queues again at the dispenser to retrieve the other.

This is fine as long as he is the only customer at the time. Imagine there being two. Customer 2 will get the other tray of customer 1. And customer 1 will get the tray of customer 2 as his second.

It there are 10 people in the dispenser queue behind our special customer, they will all get the wrong tray.

Now replace "tray" with "http/1.1 response" and it describes - what I guess - was going wrong at the OCSP responder.

Side note: HTTP/2 is preferable on server-to-server communications. HTTP/2 would place numbers on orders and trays, so customers know which one to pick up. End side note.

Security Impact

This could have been very annoying for people all over the world. Worst case scenario probably would be to ship this in Apache for various distributions and many people start using it at about the same time from different places, retrying and retrying.

Lessons

What can we learn from this? Besides the obvious things such as testing, variety in implementations and careful and responsive handling of security issues (props to LE and Akamai for handling this very professionally).

First, personally, it was fun to be on the other side of a security incident for once. Very relaxing.

Second, when you are using CDNs for you services - and this holds true for all of them, Akamai, Cloudflare, Fastly, Amazon, Azure etc. - you need to add geo-location to your test dimensions.

If I had been a developer in California, I would not have found this bug. I would have shipped my module in Apache and all US users would have been happy.

Third, there is no bug bounty for any of this. I am old-skooling this for the greater good, earning a little fame with this blog, maybe. That's it.

And it worked this time and is ok. But I believe infrastructure companies, as CDNs are, should offer a bug bounty program. It would make everyone more safe. Amen.

Fourth, I believe there are still many bugs of this kind in the world. Mistakes in coding are made everywhere. Some lead to crashes or user-after-frees and for those there are excellent tools nowadays to find them. It's just a matter of invested CPU cycles.

For bugs such as this one, there is no automated search. A fuzzer does not verify that the OCSP response matches the fuzzed request data. Nor was there any drop in responsiveness or peaks in CPU/memory usage to be observed.

These bugs will mostly remain hidden, unless computers become way smarter.

Münster, 28.08.2019,

Stefan Eissing