The Grumpy Troll

Ramblings of a grumpy troll.

How long is a piece of string?

When I was a teenager and I asked a teacher how long they expected an essay to be, I'd get asked in response, “How long is a piece of string?” — which I found infuriating as I was trying to establish expectations to, uhm, determine how little work I needed to do. cough

Fast-forward to today and I'm banging my head against the way that in modern computing, where “string” has a specific meaning, there are three different answers to the question. The Perl scripting language thinks it's sophisticated for knowing about two of them. (Anthropomorphise much?)

They are:

  1. The byte length
  2. The character length
  3. The display length

In computing, a “string” is a holder for a piece of text, to be manipulated as a unit. So we might have:

my $short_one = "fred";
my $long_one = Storage->Retrieve("War and Peace");

This defines two strings, while using a third. The first might, in this case, consistently be said to have a length of 4 units, while the second has a length of rather more units. Strings are normally shown between quotes, although the permissible quote characters depends upon the programming language. In some, double-quotes are needed for a string, single-quotes for a character, 'x'.

A character is a letter, digit, punctuation mark, etc.

Characters exist in a character set; the character set defines the number used by a computer to represent a given character. For instance, most modern systems start from ASCII, so the letter “f” has value 102, or in hexadecimal (base 16, often abbreviated to just “hex”) 0x66. The number is called the “codepoint”.

If we allow just 256 possible characters, then we know that each character takes one octet (8-bit value), which in modern computers is the same thing as a byte. So, with at most 256 characters, the number of characters is the same as the number of bytes. For people using the romanised alphabets, this mostly worked for quite some time.

The venerable ASCII (American Standard Code for Information Interchange) displaced the chaotic selections of character sets which preceded it and defines 128 characters. This 7-bit character set would, on more modern systems, be encoded in bytes of 8 bits. This left another 128 characters which could be used to handle the characters not used in English-speaking North America. For instance, the local currency symbol, $, got a character of its own in ASCII, but £ and ¥ did not.

(Aside from this aside: Mainframes with 36-bit bytes could encode 5 ASCII characters in one byte.)

Various character sets then developed which mostly built upon ASCII. Latin 1 (ISO-8859-1) was/is used in Western Europe for providing basic combinations of characters with accents as used in languages of those regions, as well as £ (POUND SIGN) for UK currency, and so on. Later, Latin 9 (ISO-8859-15) attempted to displace Latin 1, since it provided a few replacements, including the needed € (EURO SIGN) for European currency, replacing the code-point previously used for ¤ CURRENCY SIGN. The Latin 1/9 character sets used some codepoints for control characters (not further covered here) and the home-computer software vendors often introduced their own proprietary character sets which replaced those with characters they deemed more useful. Perhaps the most wide-spread in the West is CP1252 from Microsoft Windows, their extension used in English-speaking regions, which provided characters for “some quoting” and † other ‡ mark-up.

Meanwhile those handling languages with more than 256 characters were having to jump through hoops to shuffle the meanings of numbers back and forth, to encode their characters in multiple pages of values. Those trying to handle text from multiple character sets at once were in even more difficulty.

One attempt to deal with this has been Unicode. It set out to encode all of the world's characters, which in itself has been seen as a form of cultural imperialism. Well, we try to make the best we can of the world and the Unicode developers have sincerely tried to help, since if we can exchange documents and work on them with common software then we can at least not have computers actively get in the way of communication.

Unicode initially tried to assign codepoints out of numbers up to 65535, the maximum number that can be stored as a whole number in two octets (16 bits). After a while, they went “oops!” and bumped this up to four octets (32 bits), giving 4,294,967,295 different codepoints.

If we represent those characters by just encoding the code-points as numbers in constant width storage, then our “fred” example jumps from a byte-length of 4 to either 8 or 16, depending on how many unicode characters we want to handle. Further, the numbers aren't compatible with those used for ASCII. Things get ugly. Fortunately, Ken Thompson and Rob Pike, then of Bell Labs, are ingenious engineers and one evening developed UTF-8, which is now predominant.

Unfortunately, this means that codepoints are now encoded in a varying number of octets/bytes. The encoding scheme preserves compatibility with ASCII, so favours English text at the expense of various Asian texts and rarer texts; ah, an even more politically insensitive solution. But we're engineers and not known for being sensitive, the solution works, so it has spread.

This takes 1-3 octets for 16-bit Unicode, or up to 6 octets for the first 31 bits of 32-bit Unicode. If that last bit ever gets used, then UTF-8 would have to use 7 octets, to be able to encode a 36-bit number (codepoint). (Going past that would twist a design assumption about the first octet, but appropriate design could keep us from being limited to 42 bits and instead let us truly expand, to perhaps handle every language of every sentient race in the Universe. Placing the English characters so favourably, of all the languages in the universe, might raise a few tentacular pseudo-eyebrows though).

So the string “fred†” can not be encoded in ASCII, but could be encoded in CP1252 as 5 characters, taking five bytes of memory. In Unicode's UTF-16 encoding, 10 bytes. In UTF-8, 6 bytes — four bytes for the first four characters and then two bytes for the fifth character.

At this point, we have the simple case in Unicode encoding, where we have two different string lengths:

  1. The character length: 5
  2. The byte length: 6

For scripting languages like Perl, switching the encoding of the scripting engine to UTF-8 is sufficient to make operators such as “length” return 5 instead of 6.

Alas, life is not so simple.

In order to use these characters, at some point they're displayed to a human (or other entity walking amongst us, only visible in their true form with special glasses).

Those who deal with pretty fonts are used to the idea that to know how much horizontal space a string uses, you need to know which font is in use, at what size, which characters are involved (“i” vs “w”) and have probably learnt that modern fonts have special rules to handle special combinations (the dipthong “ae” often being rendered as something closer to æ even if given as two characters). In short, they learn that the software libraries that deal with fonts have ways to ask “Hey, how much space will this bit of text take up?” and to just trust the results. This includes issues such as where to place line-breaks.

Those who deal in code and command-lines, as I normally do, live in a world of “cells”, fixed grids of positions where a character can be shown. Often an 80x24 grid. You can then use box-drawing characters to make pretty grids, and Unicode provides code-points for those. So, given a naïve understanding, where one character occupies one cell, we can have tools produce output like this:

┃ C ┃ Name                 ┃ CodePoint ┃ Block                   ┃ Info ┃
┃ 2 │ DIGIT TWO            │   [0x32]  │ (Basic Latin)           │      ┃
┃ f │ LATIN SMALL LETTER F │   [0x66]  │ (Basic Latin)           │      ┃
┃ ☺ │ WHITE SMILING FACE   │ [0x263a]  │ (Miscellaneous Symbols) │      ┃
┃ † │ DAGGER               │ [0x2020]  │ (General Punctuation)   │      ┃
┃ ♡ │ WHITE HEART SUIT     │ [0x2661]  │ (Miscellaneous Symbols) │      ┃

How well that renders for you depends upon the quality of the font used for fixed-width characters, whether the box-drawing characters are permitted to join to each other, width of space given by the browser/blog to this text, alignment of the moon with the sun, and more. In terminal emulators, it works very well indeed in xterm and slightly less well in Apple's Terminal app. It's 73 cells wide and 9 cells tall, with all box-lines being straight and joining.

Alas, Unicode is not so simple. Some characters do not occupy space on their own, others require two cells to be viewed. Why do some characters not occupy space? The most common cause is the combining characters, used to place accents on top of other characters. That cultural imperialism thing? Well, while we have a unicode codepoint for “á”, “LATIN SMALL LETTER A WITH ACUTE” [0xE1] we don't encode all these variants for all languages. In fact, we can encode that same character as “LATIN SMALL LETTER A” [0x61] followed by “COMBINING ACUTE ACCENT” [0x0301]. Suddenly, two Unicode characters only occupy one cell. ☹

So knowing the display width of a character involves knowing both properties of the character and its surrounding context, even in cell-based environments where this has not been a concern.

Perhaps unsurprisingly, text manipulation tools are not always ready for this.

The tool I wrote, some time ago, to produce that grid of decoded information about each character? It uses a module called Text::Table to produce the basic layout and some light munging is done afterwards. Text::Table uses length to get the length of a string in a single cell of the table, assuming that the number of characters is the same as the number of positions required to display it. One character at a time, but to deal with combining characters, I combined them with a SPACE.

Providing a Text::Table::length() function in my code runs into a Perl limitation of its parsing rules and how it figures out which function you really meant to call, as suddenly warnings spew and … Perl still uses the CORE::length() function. sigh By contrast, Python makes it much easier to deceive modules by messing with their internals (at your own risk). I wrote this tool before I learnt Python.

So, track which lines I've added to the table contain combining chars with no inherent width, then when emitting table lines munge those lines to make them align once more.

Categories: unicode perl