Hacking last night on a Python (2.6) conversion of a Perl tool I wrote, I ran into a stumbling block. I couldn't figure it out then. I tried again tonight, after a couple of hours I finally tracked it down.
The tool is verifying x509 certs from a TLS connection. To do this, I quickly discovered that Python's ssl module is horribly insecure until Python 3.2, not providing cert verification. pycrypto's OpenSSL package doesn't support using a ca_path directory-of-certs on MacOS. So, I ended up with M2Crypto.
The tool needs to be able to upgrade a connection to TLS on an existing socket, so I abstracted things out and carefully created the SSL context, modified it, created an SSL connection on an existing socket, linked in the BIO stuff using the accessor routines and kicked off the handshake.
For this initial work, I'm speaking HTTPS, so I send the HEAD request, and junk the results. For debugging, I choose to see the results just before junking. Strangely, I get an error page. Checking the server side, Apache logs that the request was just "H". “Invalid method in request H”.
I pick through the crypto code. I confirm things work without crypto in the way. I insert proxy IO layers which dump data to stderr. I disable verification. I experiment with blocking modes, write timeouts, etc. I check results from all the SSL calls.
I write a simple short script, to confirm the problem and … things work. I want to howl in frustration.
Finally, I log the length of the written data. 101 octets with plain HTTP, 202 octets with HTTPS. That's strange, why is the write, at the layer visible to the invoker to the write, considering the data different?
Print the repr() of the request, instead of just printing the request. See that the working confirmation script has a string, my broken tool has a unicode string.
I have a config file which is JSON. The hostname is read from that. The HTTP request header is constructed using the hostname for the Host: header. When written over a plain socket, it's sent as a normal string and I haven't seen a conversion error yet. When sent over SSL, the types are detected and it's sent as UTF-16 data, so that every other octet is, from the perspective of the web-server, a NUL character.
Thus I *was* sending more than one character and my decision not to double-check that, last night, was infelicitous.
The automatic codec conversions bit simply because the type behaviour of write()/send() on a file-descriptor changed when I put SSL in as the target. This is the opposite problem to that of Tying oneself in Pythonic knots, where data about the character types was available and not being inferred - the metadata was available and being silently used on write to change behaviour.
Clearly, I was in part wrong in my belief of what The Right Thing to do is. But there needs to be a middle ground. Convert to the right type on reading, declare the codec to be used for writing on an fd for writing and auto-convert, and propagate that up through SSL layers? More thought is needed on my part.
-A Troll Who Needs To Stop Grinding His Teeth