The Grumpy Troll

Ramblings of a grumpy troll.

Tying oneself in Pythonic knots

A couple of weeks back, I knocked together a short script “puny”, to let me give it a domain-name containing punycode or unicode and show me the conversions either direction. I opted to use Python 3 because it was a simple script and would let me practice using the newer variant of the language.

Today, I decided I wanted to add automatic translation to the tool, using Google Translate's API. The problems which were resulted were all Pythonic in origin.

You can grab the script from
http://people.spodhuis.org/phil.pennock/software/ together with the PGP signature. (nb: the revision number is the revision of the repository, not of the script, which is has only 3 revisions so far).

So Python's developers have decided that the way to deal with the ambiguity between characters vs raw data is to have stronger type enforcement. This is good and forces developers to do the right thing, but it could do with a little more loving care in having the interfaces do the right thing for the developer, instead of pushing the developer to work around the limitations.

When you fetch data via URL, you now get back a response in "bytes", which json.load() now barfs at seeing with:
TypeError: can't use a string pattern on a bytes-like object

So, after taking my unicode string and converting to UTF-8 and then converting to %-encoding for the URL in the query, I can get the HTTP response object, look at the Content-Type response header, extract a charset from that, then decode the bytes per that header and finally feed into json.loads(). I can't help but feel that if a data source could offer "bytes, but I have a method to convert to a unicode string" and the urllib stuff looked at the headers automatically, this would make it so trivial to have The Right Thing happen automatically. As it is, since this was a short script, I went for the simple approach which involves reading entire responses into memory instead of using streaming transcoding, but it did work.

import re
import urllib.parse
import urllib.request

GOOGLE_TRANSLATE_API_URL = 'https://www.googleapis.com/language/translate/v2'
GOOGLE_TRANSLATE_API_KEY = '...'
HUMAN_TARGET_LANGUAGE = 'en'

def Translate(text):
"""Use Google Translate to get local language."""
urltext = urllib.parse.quote(utf8_codec.encode(text)[0])
url = '{url}?key={key}&target={targetlang}&q={q}'.format(
url=GOOGLE_TRANSLATE_API_URL,
key=GOOGLE_TRANSLATE_API_KEY,
targetlang=HUMAN_TARGET_LANGUAGE,
q=urltext)
response = urllib.request.urlopen(url)
decode_charset = 'ASCII'
m = re.search(r'(?i)\bcharset=([^\s;]+)', response.getheader('content-type'))
if m:
decode_charset = m.group(1)
transblob = json.loads(codecs.lookup(decode_charset).decode(response.read())[0])
if 'data' in transblob:
if 'translations' in transblob['data']:
for d in transblob['data']['translations']:
if 'translatedText' in d:
yield d['translatedText']


% puny xn--zckzah
Input: xn--zckzah
Unicode: テスト
Punycode: xn--zckzah
Translation: "テスト" -> "Test"

(And yes, this does mean there's an API key in the downloadable script, I don't have a better security solution for preventing abuse; please don't rip my key for your own purposes, it's truly trivial to sign up for your own)

-The Grumpy Troll
Categories: IDN charset unicode python google punycode