The Grumpy Troll

Ramblings of a grumpy troll.

IDN, Python, Perl and my woes

With the help of a friend who lives in Japan, I have registered an IDN domain, for working on improving the non-ascii support of some software I use; in particular, Exim.

The domain is “グランピートロル.jp”, “grumpy troll .jp”.

In order to do much with this, it's necessary to get the IDN format. Yet in testing basic Perl and Python to get the IDN, I was seeing values which I knew were wrong. The desired end result is "xn--qck5b9a5eml3bze.jp". Encoding is done on a per-label basis and the "xn--" prefix is added for DNS, so the test is "can I convert グランピートロル to qck5b9a5eml3bze ?"

To get the actual value needed, I ended up using a dig(1) with an IDN-capable dig, capturing the traffic with tcpdump and shoving it to my laptop to use Wireshark with a filter of [dns.qry.name contains "jp"]. That this was the first method which worked is problematic.

The basic problem was obvious once I spotted it: the input encoding was not utf-8, so the interpreter was not accepting that the string provided was already encoded in some way.

My first attempt was with ipython, the second was with perl. Neither worked. Turns out, my Python problem was ipython, not Python itself.

The simple one:

% perl -MNet::IDN::Punycode -le '$d="グランピートロル"; print Net::IDN::Punycode::encode_punycode($d)'
cacaaaaaa7b0h4j6ay6a0c9i30accccccc
% perl -Mutf8 -MNet::IDN::Punycode -le '$d="グランピートロル"; print Net::IDN::Punycode::encode_punycode($d)'
qck5b9a5eml3bze

Now for Python, this is what things look like when they go well:

>>> import codecs
>>> puny = codecs.lookup('punycode')
>>> label = u'グランピートロル'
>>> puny.encode(label)
('qck5b9a5eml3bze', 8)

But it turns out that just creating the string, from input, was being blocked by ipython. When things encode correctly, we get: u'\u30b0\u30e9\u30f3\u30d4\u30fc\u30c8\u30ed\u30eb'

With ipython:
In [1]: u'グランピートロル'
Out[1]: u'\xe3\x82\xb0\xe3\x83\xa9\xe3\x83\xb3\xe3\x83\x94\xe3\x83\xbc\xe3\x83\x88\xe3\x83\xad\xe3\x83\xab'

That's what I get for wanting tab-completion in my Python REPL.

End result: I wrote a tool to wrap all this for me, in future.
% puny グランピートロル.jp
Input: グランピートロル.jp
Unicode: グランピートロル.jp
Punycode: xn--qck5b9a5eml3bze.jp
Categories: IDN dns グランピートロル exim unicode JP python perl punycode