The Grumpy Troll

Ramblings of a grumpy troll.

Typing Weird Stuff

I speak English, for certain values of ‟English”. I speak varying amounts of other languages which use the Roman alphabet. For the most part, I can work in plain ASCII. I like to be able to use currency symbols too, whether working with £1 or €1. The former can be met with ISO-8859-1 (Latin1), the latter can be met by using ISO-8859-15 (Latin9). But that's not enough for me, because I'm picky enough to want to use accurate characters for many other purposes. Those ‟quotes” I used earlier? Those require CP-1252 or the like (a Microsoft extended character set).

There's a character set which has more characters than sometimes seems sane. It's Unicode, commonly encountered with the codepoints (the character numbers) encoded in a format called UTF-8, which is compatible with ASCII up until you use any of the fancy characters. UTF-8 is often called the character set, simply because it's convenient to do so. The problem is typing those characters on a typical US keyboard. That's what this post is about. I cover Unix, because that's what I mostly use.

The normal way to enter the characters not on your keyboard is with sequences of mnemonics. For this, different applications have different ways to accomplish this, with different names. If you only ever work through the graphical interface, then that's all you need to configure.

I'll cover the graphical system, a text editor and a command shell.

All of these use the same common set of sequences for older characters, because RFC 1345 defines a set of such sequences. But they diverge with newer characters, so you can need to extend the supplied bindings.

X Windows (below GNOME, KDE)

On X Windows (the underlying system which provides graphics on any modern Unix system), the support has been there for decades. Older Unix-specific keyboards (such as the wonderful sun4m) had a Compose key. PC keyboards don't. These days, they do have some less-used keys in the same general area, though, including three added by Microsoft a while back. One of those can be repurposed.

Going back a bit in time, we'd use a tool called xmodmap(1) to do this. Today, that's not the sane way. Things break. There are new layers with all sorts of weird hooks, all designed to support people wanting to enter oriental languages. Fair enough, but I don't need anything so heavyweight. So, what do we need to do?

First, find a place to put the commands. Running commands on every X Windows start-up is easy, doing so portably is not, especially if you don't want to learn how to interact with the system init scripts and not break them. On modern systems, you may be able to just use ~/.xsessionrc which will be sourced by a shell if readable. This is a Debianism dating from 2007-12-30, also supported by Ubuntu (as it's derived from Debian) and I feel the best way for other OSes to proceed is to hook into invoking this too. Otherwise, you might look at putting commands in ~/.gnomerc or ~/.kde/share/config/kdesktoprc

In this, I put:
setxkbmap -option ctrl:nocaps,compose:menu
export LC_CTYPE

The setxkbmap(1) command is how we modify the features used with our keyboard. For me, I loath Caps Lock and prefer to have a Ctrl key to the left of the 'A', so that's what the "ctrl:nocaps" does. The only bit you need is the "compose:menu". This steals the Menu key, from the bottom right of the keyboard, and turns it into Compose.

The LC_CTYPE is a declaration that I work with US-variant English, with the UTF-8 character set. If you run locale charmap at a command-prompt, you should get back a UTF-8 response. If not, investigate and fix that.

The GTK_IM_MODULE tells GTK that we're just using the normal X input methods and that nothing else should be hooking in. Why set this? Because Gnome by default uses hardcoded compose definitions copied from an old version of the XFree86 mappings and which are therefore both stale and inconsistent with those used by xterm, etc. So we're kicking Gnome out of the way.

So, where does the X Input Method get these compose bindings from? By default, X11/locale/${LC_CTYPE}/Compose under the normal X11 config base. So, for me, that might be:
  • /usr/share/X11/locale/en_US.UTF-8/Compose
  • /usr/local/lib/X11/locale/en_US.UTF-8/Compose
If you look in that file, you'll see a whole bunch of bindings such as:
<multi_key> <l> <minus>  : "£" sterling # POUND SIGN
The name "Multi_key" is another name for "Compose" and is what is needed for this config file. (It's more complicated than that, but that doesn't matter).

That's great, as far as it goes. But I want to be able to add my own! That we can do. We can also create a file ~/.XCompose which will be used instead of the default. Instead. But, it can contain an "include" directive. Fantastic. But with the normal binding not being in a portable location, if we want portable configs, we want something more dynamic.

I found a way which appears to be undocumented. ‟Use the Source, Luke!” The parser for the file understands some %-escapes! In particular, %L is the default set of Compose bindings which I mentioned above. So, I can create ~/.XCompose which contains:

include "%L"
<Multi_key> <E> <u>         : "€" EuroSign
<Multi_key> <slash> <minus> : "†" U2020 # Dagger
<Multi_key> <slash> <equal> : "‡" U2021 # Double-dagger
<Multi_key> <colon> <parenleft>  : "☹" U2639
<Multi_key> <colon> <parenright> : "☺" U263A

And we're done! We can put whatever we want in there, and use the original bindings. Just be aware that the sequences after <Multi_key> can be more than two keystrokes long and the defaults contain some three-keystroke sequences.

So, why did I bind Eu to be the EuroSign, instead of sticking with C= or E= ?  This leads into ...

A Text Editor: vim

vim is a vi-style text-editor which is rich in features. One of those features is "digraph" support. A digraph is a two-character compose sequence. To get this, all we need is a "normal" vim. If you have a "lite" install, then digraphs might not be supported. Just run vim --version to see if the feature is +digraphs rather than -digraphs.

If digraphs are in, then in insert-mode, you can just type Ctrl-K followed by the characters. A list of all supported digraphs can be obtained by typing :digraphs and you can add more using :digraph. Use :help digraphs for more information. Including the unfortunate news that the characters are entered in decimal, not hex. So, my ~/.vimrc file contains:
if has("digraphs")
  digraph :) 9786 :( 9785

And since vim uses Eu for € and I am often logged into a machine remotely from non-Unix systems, I get in the habit of using the vim sequence and configured X to use that.

A Command Shell: zsh

Any zsh from the last several years also supports composed sequences, but the wide-character support has been improving markedly in the most recent releases, getting rid of rendering glitches. To turn this on, you can use:
is-at-least 4.3.0 && {
  zle -N insert-composed-char
  zle -N insert-unicode-char
  bindkey '^Xk' insert-composed-char
  bindkey '^XU' insert-unicode-char

In emacs-style bindings (such as I, ironically given the previous section, have gone back to using), Ctrl-K is delete-to-end-of-line. So I use Ctrl-X K instead. I then use Ctrl-X U which goes one step further: Ctrl-X U hex-sequence Ctrl-X U will insert the numbered character.

How do you add new characters? Uhm, well, that's slightly more awkward. I use the function below, invoked as:
add_composition_pair ':)' 263A ':(' 2639 'OK' 2713  # ☺ ☹ ✓

That's it.
-The Grumpy Troll

function add_composition_pair {
  emulate -L zsh
  if (( $# % 2 ))
    print -u2 "Usage: $0 [ ...]"
    return 1    
  if (( ${+zsh_accented_chars} == 0 ))
    autoload -U define-composed-chars
    unfunction define-composed-chars
  while (( $# )) 
    local pair="$1" val="$2"
    shift 2     
    local -A existing 
    existing=( ${=zsh_accented_chars[${pair[2]}]} )
    if [[ -n ${(k)existing[${pair[1]}]} ]]
      noglob unset existing[${pair[1]}]
      zsh_accented_chars[${pair[2]}]=" ${(kv)existing}"
    zsh_accented_chars[${pair[2]}]+=" ${pair[1]} $val"

Categories: zsh compose x11 vim digraphs unicode unix