The Grumpy Troll

Ramblings of a grumpy troll.

The path to Exim 4.80

This is long, detailed and rambling. If you don't want the editorial, then
just peruse the git tree, including:

A while back, The Exim Maintainers held a mini-conf where we sorted out policy for issues such as "when do we release". We decided that if not forced sooner by something urgent or the wrap up of major new features, we'd release about every six months with whatever we have. Our previous release was 4.77 on October 10th, 2011, so we're about a month and a half late.

We're entirely a volunteer project, with nobody paid to work on Exim; as it happens, I'm currently between jobs and have used the time to get this release out. There were quite a few new features in 4.80; and, once again, some changes which affect backwards compatibility. As is usual, these are limited to security impacting changes, but this time the reasons are cryptographic and library based. Once it was clear that there were enough little issues building up, I bit the bullet and pushed a rewrite in which made things even worse, but got everything over and done with. Thus the version bump to 4.80.

The theme of this release is really security and authentication. There are no architectural changes, merely implementation details and improvements. I've sweetened the potion with build ease improvements.

We also welcome two new maintainers, Jeremy Harris and Todd Lyons.

It started when I found that I could no longer authenticate to Exim with GSSAPI. I had been doing this with Cyrus SASL, exporting $KRB5_KEYTAB into Exim's startup environment. After some investigation, I discovered that Heimdal had changed their code so that if a program had changed permissions then the environment variable would no longer be honoured, and there was no longer a way for Exim to use a keytab other than the default. Since Exim runs as non-root when receiving connections, this is distinctly non-optimal.

Initially I tinkered with the cyrus_sasl authentication driver, improving the debugging, letting it set the external Security Strength Factor (SSF) from the number of bits of encryption provided by TLS, confirming what could be seen, as I worked to verify that the problem was inside Heimdal.

Then in a couple of evenings after work I wrote a GSASL authentication driver, to “compete” with Cyrus SASL. There's nothing unusual in this, as Exim supports choosing either OpenSSL or GnuTLS, multiple database libraries, all sorts of backends. The only unusual item in this list is the OpenSSL or GnuTLS choice, which is an either-or selection at compile time, whereas all of the others can cohabit in a single Exim binary. Some time later, after some of the stuff below, I added a new lookup type of dbmjz, which takes a list of items as the key, joins them on ASCII NUL and uses that for looking up data in a DBM file. Yes, while Exim handles embedded NULs just fine in an email, it doesn't like them much in configuration rules, which use C strings. Well, with this lookup type, I had Exim able to use GSASL authentication, with passwords being looked up via dbmjz from a sasldb2 DB file. After asking for feedback, one sysadmin switched his production system to using it, as he appeared to feel that it was simpler and better than the alternative. Well, competition is healthy.

But by the time that authentication driver was written, I had confirmed that there was no API in GSASL's GSSAPI support for setting the server keytab. The two main SASL framework implementations, neither usable for servers accessing non-default keytabs, which servers should be doing if they're not granting system login access. So a week and a half later, again having some spare time in the evenings, I bit the bullet and wrote a second new authentication driver, heimdal_gssapi, to use the Heimdal libraries directly, and taking a server_keytab option. Goodbye environment passthrough, goodbye fragility, hello configuration all in one place, hello working Kerberos authentication for SMTP submission.

Both of these new authenticators are currently server-only. The GSASL one could stand to support client side too, it's a simple matter of programming. I'm not seeing a real use-case for Exim as a client to implement GSSAPI, but if one ever comes up, we can look into supporting it.

Along the way, rather than require users to embed all of the library rules for Heimdal into their Local/Makefile build config, I started adding pkg-config support, so that Exim's build system could ask a central tool used on most modern systems "hey, how do I find the include files and libraries for this component?". This is working rather well. It supports various lookup types and authentication types, both SSL libraries, and PCRE, since that is now no longer bundled with Exim. (Note: PCRE and Exim were written by the same author and PCRE typically wasn't available on systems when Exim was first distributed, so bundling was the sane option back then. It was removed from Exim for the 4.70 release.)

The coding died down for a while, as I got very busy job-hunting (with a friend, who decided to leave on the same day). Along the way, I nominated two occasional contributors to Exim for commit bits, which happened. So we're now up another two committers, with one of them having code in this release, as Jeremy has beaten the test suite into submission, so that he could refactor the ACL verification logic safely.

During this period, I noted to the other maintainers that we were coming up on the six month slot for a new release, but none of the other folks who can cut a release had time either.

And then OpenSSL 1.0.1 came out, with support for TLSv1.1 and TLSv1.2. Awesome news, except that Exim failed to work with them. After some prodding, I came out with three issues. Two of them were fixed by removing a call Exim made to SSL_clear() on a newly created SSL handle. For some reason, that causes protocol negotiation failures, where it didn't used to, and the OpenSSL devs have been silent as to why that might be. But since the call was superfluous, I removed it and Exim once more worked. Along the way, I refreshed the options list. I made Exim handle renegotiation, which had led to the third issue, that TLSv1.1/1.2 renegotiation was broken, with the server sending a TLSv1.0 handshake in a TLSv1.1/1.2 session. That has been fixed upstream for OpenSSL 1.0.1d.

And I made the first real backwards compatibility break. Long ago, the OpenSSL developers created a work-around for the BEAST attack, before the BEAST attack existed as something practical and workable. It went into OpenSSL 0.9.6d. It broke the Eudora mail-client, so Exim was setting SSL_OP_DONT_INSERT_EMPTY_FRAGMENTS. Some time later, we had a request to add a second option to be set by Exim, and rather than do so, I'd added the ‘openssl_options’ configuration option to Exim (4.73), and grandfathered in ‘+dont_insert_empty_fragments’ as the default. It was time for this to go. I cleared the defaults, so that now if you still have Eudora clients and if they're still broken then (1) you're in trouble security-wise anyway and (2) you need to change Exim's configuration to restore compatibility. Sooner or later the compatibility vs security trade-off has to become “if it's still not compatible, you have bigger problems, it's the admin's job now to keep compatibility only if needed, rather than penalising everyone else's security” and place pressure on vendors that way. This security vs compatibility issue reared its head more than once.

Then I did something potentially useful: I added TLS Server Name Indication (SNI) support to Exim. TLS SNI is what permits webservers to perform name-based virtual hosting of SSL/TLS on just one IP. In this, as part of the initial handshake, the client sends the expected server-name to the server. The downside is, this is sent in the clear, so someone capturing traffic can now see which site on a given server you're requesting data from. This traffic analysis exposure is comparatively minor and probably something that could be determined anyway by someone determined to find out, from statistical modelling of the traffic sizes and knowing what different sites are hosted on a given server.

I changed Exim so that if the client presents the TLS SNI option, it is available in the variable ‘$tls_sni’ and, importantly, if the main configuration version (not the smtp client transport variant) of the ‘tls_certificate’ option contains the string ‘tls_sni’, then it and a few other options will be re-expanded at that time, permitting the certificate and the key used for the session to be changed.

For MX delivery, this doesn't matter at this time, as even without SNI nobody can settle on what identity should be checked for in a server certificate, so if there is any verification happening it's by prior mutual agreement out-of-band between particular sites as to what identity they expect. The only hostnames that can be derived from an email address are obtained via DNS, which is typically not DNSSEC, so those are untrustworthy as identifiers for identity assertions. Mail servers often handle many domains and without SNI, there was no way to have the domain be the identifier without the server also telling anyone who connected “here is a list of all the domains I handle” and changing their master certificate with every arriving or departing customer. (The lack of a defined type for mail-domain in an X.509v3 certificate is trivial to solve: the first group that pick something and get folks using it will have them using their OID, and it will stick, whether from IANA or elsewhere.)

For Submission, this matters a lot more. The connection from the Message User Agent (MUA) to the Message Handling Service (MHS, more commonly "SMTP submission servers"), often does have a trusted hostname configured, and will make a connection to one identified name. Changing that name involves changing what users have configured in their software, so transitions are expensive and possibly never quite complete. So having the server able to handle multiple names and present different certificates is a win.

I'm tempted to expand the list of re-expanded options in Exim to include ‘tls_require_ciphers’ and also ‘openssl_options’, so that administrators will be able to move all of their current clients up to modern security and provide a different hostname for use by folks sticking with software that can't handle the new features (TLS1.3?). I haven't yet done so, because SNI support is currently patchy enough that trying to support folks doing this will cause more headaches for us than it solves for them. Use different IP addresses. But yes, if SNI isn't sent then $tls_sni will be empty, which has to be handled anyway in configuring tls_certificate. So this should be doable. We'll consider it for Exim 4.81.

I then added ‘tls_sni’ as an option to the SMTP Transport driver, so that Exim can send SNI too, not just receive it. That's much simpler: the admin specifies a string, it is expanded, it is sent. No complexity.

This next item will upset some folks who believe that anything written in a published RFC is sacred. I turned on ‘accept_8bitmime’ by default in Exim, following Dan Bernstein's advice. Exim has long been 8bit clean, but did not advertise the option by default, because it does none of the content conversions which the specification mandates. Today, if you have a mail-server that is not 8bit clean on the Internet, then your server is broken, whether it technically complies with the RFCs or not. Continuing to not advertise 8BITMIME in the EHLO capability response causes more conversion work by other servers and generally creates more interoperability problems than in solves. We've turned it on, and if you need to receive mail which you will pass on to a server that is not 8bit clean, then it is up to you to turn this off in your configuration.

Up until I saw Philip Hazel agree with this approach in the tracking bug, I had always lacked some confidence in how much I touched Exim. It was strangely liberating to know that ignoring the RFC and doing what's right had the backing of the retired author of Exim and this, more than anything, led to my feeling much more free to perform more radical work on Exim and feeling more like a member of a team of owners than a team of caretakers. That happened some time ago, but it shows in this release.

Those who do believe that RFCs are sacred will be pleased to note that Exim complies with RFC 6176 and disables SSLv2 by default in OpenSSL now. GnuTLS has never supported SSLv2. I surveyed a mailing-list of mail operators and got back several sets of statistics all showing a little SSLv3 and no SSLv2 whatsoever. As usual, those who really need SSLv2 compatibility can get it back, using the openssl_options option in Exim.

The current hotness for how servers should start is something called “Socket Activation”. It's an old concept: one simple server listens on a socket, and when a connection is received, instead of passing off the socket for the received connection to the handler program, the listening socket is passed and the program, once started, can handle multiple connections. In the venerable inetd(8) software, this is the “stream wait” mode of operation. It was rarely used, but makes sense for software that is a little too heavyweight to want to start up for each new connection but which you don't necessarily want to start at system boot, or hang around when not needed at all.

I added the -bw option to Exim's command-line, which implements wait-mode. Do I think folks will use inetd wait-mode much with Exim? Not at all. What's far more significant is that this re-worked some of the start-up and daemonisation logic to handle the received listening socket and that work is all done. So if someone else wants to support some newer variant, such as systemd, upstart or launchd, then it should be almost trivial to do and if the system is not portable to multiple Unix variants then it can be kept safely in OS/os.c-variant.

Jeremy re-worked the arithmetic type system to be 64-bit on platforms which support it. As expected, it is causing some minor portability issues, but he did good work in abstracting it out so that if any issues come up for a given platform then the fix has so far been trivial.

Around this point, I wanted to get out a new release. But I didn't want the OpenSSL and GnuTLS feature sets to be so divergent. Also, building GnuTLS was causing deprecation warnings. Pushing out a new release which uses code already deprecated is just asking for problems in the future, but changing to new APIs can also cause backporting issues. Ultimately, we rely upon the library vendor to have the good judgement to decide when to start causing compiler warnings, and we provide multiple backends so that if there's a problem with one, the other may well still work. Since this release was already not-fully-compatible, I decided to update the GnuTLS support.

I opted to start with a clean file and copy code where needed but often rewrite it. Some more error checks, more cases handled, and a clean slate. We're also now in a situation where we have folks asking to be able to use TLS in an SMTP callout, which happens in ACL logic while still receiving a connection, which means that the previous use of globals shared for both client and server modes was problematic. I abstracted all this state out to be in a client or server structure instance, paving the way for a future release to implement TLS callouts. GnuTLS is now more ready for this than OpenSSL, despite OpenSSL having better APIs taking user void *callback_data, so that I don't have to still be using globals to identify correct context.

This turned out to be opening a can of worms. The portability notes in the GnuTLS documentation are inconsistent in noting when a feature is new, which led to false confidence on my part. The feature in GnuTLS 2.12.x which offers to tell you how many bits are ‘normal’ for Diffie-Hellman data follows ECRYPT guidelines which led to failures with Thunderbird, spotted by Janne Snabb. We both picked at things but he did the better work and spotted that it was caused by a hardcoded limit in NSS, the TLS library used by Mozilla products, of the numer of bits it will accept in DH negotiation. The NSS limit is lower than the limit which GnuTLS recommends as normal.

So I added a new option ‘tls_dh_max_bits’, used in GnuTLS mode to clamp the number suggested by GnuTLS. It is also used in OpenSSL mode to ignore a tls_dhparams file if it the loaded struct provides more bits than this, on the theory that it's better to drop negotiation of a particular subset of ciphersuites than to cause client failures to negotiate, popping up uninformative error messages to end-users.

Then after RC5 I tried lowering the clamp a bit, because GnuTLS generates a prime with roughly the number of bits requested, but often goes over. After deciding that there really was no API to get the actual number of bits, I both lowered the requested number by a number which might work, and got rid of hope as a strategy. I hard-coded a bunch of DH primes from three RFCs into Exim and picked one 2048 bit prime from RFC 5114 as the default. I redefined tls_dhparams a little to be able to take a non-path and made it apply to GnuTLS too. Problem avoided. People who care about attacks against well-known primes can still set tls_dhparam to contain the path to a file with the PKCS#3 DH parameters of their choosing. GnuTLS users might argue that they're slightly worse off than they were, except that unless they generated a prime themselves but ignored the old value and used a prime larger than 1024, they're still way better off by default.

As part of the API re-work, I ended up removing three options previous used by the GnuTLS support and letting the ‘tls_require_ciphers’ string now be parsed directly by GnuTLS as a Priority String, instead of being picked apart by Exim and used for more manual tuning. Hopefully, very few people ever used these rather arcane options, but we have no way of knowing. For now, they can still be set but are silently ignored. The priority string feature of GnuTLS is more powerful, more consistent with other products using GnuTLS, more consistent with how tls_require_ciphers is used in OpenSSL, and lets us remove hard-coded lists of ciphers, MACs, signing algorithms and more from the Exim code base. We leave the crypto almost entirely to GnuTLS, letting the admin configure it via a string which is opaque to Exim.

This is the other major backwards compatibility gotcha from me.

This, and spotting the ineffective string, led me to think that the string should be checked at startup with an invalid string being a configuration error, much as happens for many other strings in Exim. Combine with ongoing problems of segfaults from folks building Exim with headers for one version of OpenSSL but a different version of the library, or binary linkage trying to pull in two different versions of the OpenSSL library, and this only being visible during TLS negotiation, made a strong case. The argument against is that sending mail is pretty essential and just being able to send something, anything, cleartext may be more important than the TLS negotiation. Some folks might prefer to leave the TLS configuration broken with STARTTLS returning 4xx errors. But we almost certainly don't want segfaults in code currently receiving untrusted data from over the Internet, even if it's probably not exploitable and certainly not our fault.

So Exim startup is now even heavier, making the wait-mode an even more useful alternative for those currently using inetd to launch “exim -bs”. We now fork, drop privs if root, and call into the TLS library support, initialising things and cleaning up, along the way checking tls_require_ciphers if it has been set. If that fails, Exim panics and the parent refuses to continue into start-up. So segfaults should now be much more readily apparent and should cause broken updates to not make it all the way into production before being spotted.

End result: GnuTLS supports TLS SNI, same as OpenSSL; Exim doesn't inherently limit the negotiated protocol details so much, in either case, and has one consistent way to pass through the opaque-to-Exim tuning string. We support TLSv1.2 with both, with SNI support in both. I failed to remove Exim from TLS policy decisions, but did stop Exim arbitrarily limiting the available ciphersuites (which has caused a little pain in the test suite).

There's one other non-backwards compatible change, which one of the other devs put in: LDAP lookup results now differ for multi-value results and results with embedded commas; that one has caused a user problem. It too is a security improvement, though.

There are some other changes which made it in, but this post focuses on things from my perspective. My blog posts recently have been thin, so the length of this should make up for it.

-The Grumpy Troll
Categories: Kerberos TLS exim GnuTLS security OpenSSL BEAST NSS SASL rfc crypto SNI programming