Sunday, May 18, 2008

Unicode and strlen()

One of the common troubles in the world of modern software is its continuing inability to deal correctly with Unicode. One of the ways this inability manifests itself is when counting the length of the string.

Case in point -- let's take Twitter. Earlier today, I submitted the following update (because I was reading Astrid Lindgren's Karlsson-on-the-Roof in Chinese translation that I found on the web):
我风华正茂:英俊、绝顶聪明、不胖不瘦! (красивый, умный, в меру упитанный мужчина в полном расцвете сил) :D
I was using the web interface to post it, and the counter at the top-right of the form dutifully told me that I had 53 more characters left, because Mozilla gets Unicode right. However, once I submitted the post, Twitter told me that oops, I went way over 140 characters and thus my post will be truncated when shown on the site.

Now, this disconnect happens to be because my post contained many multibyte characters -- 3 bytes per each Chinese character, and 2 bytes per each Cyrillic character:
print strlen('我') . "\n"; // output: 3
print strlen('я') . "\n"; // output: 2
A lot of software was written to deal with "unibyte" characters -- where one byte is used to represent a character, such as is the case with the venerable US-ASCII or ISO-8859-1. For example, such is the case with PHP -- when you use strlen() to calculate the length of the string, it will give you its length in bytes, and not its length in characters.

While this is arguably a sane default behaviour (if I wanted the string size, I would have asked for a string size, not length?), the trouble here is that this chokes on multibyte characters when calculating string lengths. Furthermore, this practice often results in data mutilation, for example when trying to auto-calculate the "short version" of a string and then offer a "read more" link, or when truncating something to fit into visual space constraints.

Consider this:
print substr('你叫什么名字?', 0, 10) . " (read more)\n";
That just sliced the string mid-character, and will usually show up as some version of an os-specific "[?]" glyph. And, of course, that actually truncated the string to 3 characters (plus junk) instead of achieving the wanted result.

Different programming environments cope with this problem in different ways, but most of them require extra work. PHP deals with Unicode by providing an "mbstring" interface to most string functions. For example, we can use mb_strlen(), and mb_substr() to perform Unicode-aware string manipulation just as we would with regular strlen() and substr():
print strlen('你叫什么名字?') . "\n"; // output: 21
print mb_strlen('你叫什么名字?') . "\n"; // output: 7
This will also do what we actually want and won't chop things off mid-torso:
print mb_substr('你叫什么名字?', 0, 10) . " (read more)\n";
PHP even has an option to replace all string functions with their multibyte equivalents, but this is rife with danger, because there's a good chance that the software you use will want to actually calculate the byte-length of a string and not its character length, i.e. when trying to figure out the size of a binary blob. Read more about mbstring and php.

Python, which is also internally all-ASCII all the time (until python 3000 comes around, that is), deals with multibyte strings in simlarly clunky ways:
mystr = '你叫什么名字?'
print mystr[:4] + ' (read more)' # outputs junk
In order to correctly handle Unicode strings, you have to first go from a string object to a unicode object by ways of using .decode('utf-8'):
myuni = mystr.decode('utf-8')
print myuni[:4] + ' (read more)' # yay!
Alternatively, you can prepend all your unicode strings with 'u' to go straight to a unicode object, bypassing the ascii-centric string object:
myuni = u'你叫什么名字?'
print myuni[:4] + ' (read more)' # yay!
However, you'll still be doing a lot of .decode('utf-8') when you are doing things like reading data from a file. The linked talk is probably the most succinct and useful presentation on python and Unicode I've found: you should read it.


Yes, Unicode is a pain in the ass, and requires jumping through extra loops whenever you get to deal with it. However, beleve me when I tell you that if you get used to the idea of Unicode from the very first line of your application, you won't have to later go back and rewrite things, potentially subtly breaking them in the process (e.g. see Twitter). Retrofitting an existing application to make it support Unicode quite often involves lots and lots of eye-stabbing and pain.

Oh, and the first person who says "why doesn't everyone just use English?" will be propmlty fed to the most rabid apparatchiks of the Office Québécois de la Langue Française. ;)


Mads Sülau Jørgensen said...

When reading unicode files with puthon, the codecs module provides a nice wrapper so read() returns a unicode object and write(s) proplery encodes the string it is given prior to writing it to the file.

Also I think unicode is one of the things php is (finally) going to tackle for php 6.

Unknown said...
This comment has been removed by a blog administrator.