The Grumpy Troll

Ramblings of a grumpy troll.

OCSP Oops!

Conceptual Background

OCSP provides a means for a TLS client to check that a certificate issued to a server is still valid, by asking for a “current proof”. In its original form, it’s a disaster: clients need to talk to the TLS server (typically a secure web server), find out who issued the certificates and where on the Internet they can talk to, to get a current cert, go off and talk to that OCSP server, get a current proof, then resume talking to the original server.

Thus this original form was a major privacy problem, as Certificate Authorities find out and can track everyone visiting the web-server of one of their customers; it was a performance problem; and it was an availability problem, because your secure server’s availability is hurt every time the OCSP servers are unavailable (whether “down down” or “appearing down for people on that bit of network over there”, for whatever reason). If you ran a huge site, the CA would need to run a much larger service to handle the load. But because there were so many issues, clients would never hard-require OCSP, so it was mostly placebo security: inability to check had to be treated as “sure, probably secure”.

But someone came up with a rather bright idea: instead of the clients talking to the CA’s OCSP server, how about the webserver side talks to the OCSP server periodically to get a proof, and hold onto it, and issue it within TLS to the clients? This is called OCSP Stapling.

OCSP Stapling is A Good Thing: the CA’s OCSP server is only talked to by their existing customers, with whom they have a relationship and they already know who these people are; the burden on the OCSP server scales with number of CA’s customers, not number of the customers’ customers. And if the stapling party starts trying to fetch a new proof before the current one expires, then it can ride out glitches and problems with the CA’s OCSP server. As long as it can succeed somewhere in the window between “when it starts trying” and when the current staple expires, all is good. This timer-based renew system is old hat in networking: DHCP does it with leases. It’s a well-understood solution.

And so we’re in the stages where certificates are slowly being issued where they have flags saying “this is only valid if accompanied by an OCSP proof” and other such ratchets.

Grumpy Troll Deployment Background

This Grumpy Troll is one of the Exim maintainers, and Exim is a mail-server which supports providing OCSP proofs. Whether any mail-clients will use it is an open question, but we break the chicken/egg deadlock by moving first and offering it. This is probably most useful for SMTP Submission clients.

Most of my certificates these days are issued by Let’s Encrypt.

I have OCSP proof renewal happen every two days; each proof lasts for one week, I don’t have anything more sophisticated at present than “I’ll be notified if a renewal fails and can go look”. But I also have OCSP monitoring, for SMTP, via smtpdane; that’s early-alpha software and not yet entirely trustworthy, but it works well enough to alert me to real issues.

I also have monthly Let’s Encrypt renewals, via calendar reminder; for other systems, the renewal is entirely automated, but these ones, for Boring Reasons, are stuck with a manual invocation which I watch over. I invoke one tool, it does the renewal.

I mostly only have OCSP here to be proving that it works, and to find problems early. Since almost nothing checks OCSP in SMTP, failures are unfortunately harmless … which makes it safe to experiment.

The Exim server-side stapling is “specify a filename; if that file exists, we’ll try to load it for use as the staple to send to clients, if not we won’t; update is your responsibility”. In our current design, that’s the most reasonable approach; it’s brutal in its elegant simplicity.

Incident

After a couple of days of not checking email, I checked this evening and … there were a bunch of alerts from my smtpdane monitoring, showing that OCSP validation had failed:

OCSP: response invalid for mx.spodhuis.org from Let's Encrypt Authority X3:
        unsupported issuer hash algorithm

The first line of the error message is under my control and correctly points to where the problem is; the second line comes from golang.org/x/crypto/ocsp, the Golang OCSP library. The golang.org/x/ bit means that it’s not standard library and there are no API guarantees, but that it is official and may make its way into the standard library.

My first reaction was “I thought I was pulling in all the hash algorithm libraries, because I’ve been bitten by that before”. I check, and sure enough I am.

So I look more closely at the slew of alerts and eventually realize that they started after my last certificate renewal and stopped two days later, that they’re not current because there would be one more if there were, and that they stopped when OCSP renewal happened.

Cue dawning horror: my certificate renewal process was not also renewing the OCSP staple at the same time, so I spent two days serving mismatched staples.

Cleanup

  1. Monitoring caught this; monitoring is good, my tool is working; life stuff kept me from paying attention for a while, but I found out because monitoring told me. This is a good state of affairs.
  2. The error message sucks. I’ll file a bug report against the OCSP library.
  3. My renewal tool now deletes the existing staple, before renewing and getting a new staple. This has been tested. I feel it’s better to very briefly be without a staple and have paranoid clients defer and retry later, than to be serving up a bad staple.
  4. Better tooling needs to strongly associate staples with the generation of the cert and step both together atomically. That’s not going to happen here, it is something to consider in server application design.
  5. I need to go check the tooling for exim.org too, where things are fully automated.
Categories: TLS PKIX OCSP