Archive for March, 2007
As a sign on my skin
I have met my nemesis; it’s name is character encoding. The idea that when programming, the physical memory size of a string does not necessarily equate to how many characters (or “glyphs” to use the vernacular) are within that text string.
The antagonist in this comedy of errors is not ignorance of the situation, but knowledge of the ineptitude surrounding it. In theory, Unicode is the great equaliser, the One Ring and so forth. In practice however things are different. For web projects I boiled down the dilemma into four places where Things Can Go Wrong. The first is the web-page markup itself, nestled cosily in the <head> tag is the content-type, oft forgotten and left to Dreamweaver to assign this is the encoding that is passed to your programming language of choice which is the second choke point. If all you do is pass the string to your database you may just be able to pull off the perfect murder, if however you wish to do any kind of modification to the string, then your programming language needs to know how to deal with the character encoding or how to convert to an encoding it can use. Once you spent enough time with the jesters, it’s time to pass things over to your database which, just to be pedantic, has two points of failure. The first is the connection encoding whereby the database tries to convert whatever it stores the data in to a suitable encoding for it’s client; and then there’s the minor issue of how your string is stored within the database.
Projects that don’t need to worry about languages other than English can well ignore character sets entirely and pretend that everything is buttercups and puppies in the world of ASCII. Move even a little though, even to Roman-based alphabets like French or Swedish and things break down. The ideal, the blue-sky pie would be UTF-8 from start to finish and back again, but this article wouldn’t exist if it were that simple.
The reality of the situation is that browsers are like panes of glass, you fit them correctly and you don’t need to worry about them. As long as you’re not using a database of antiquity then character encoding within the database is solid, the “major” databases have you covered. It’s a shame then that it all breaks down with the programming language/environment especially the scripting languages.
I’ve been informed that Perl, the most distinguished of scripting languages (what other language allows a power-user to be called a “monk”?), has everything neatly arranged and ready for surgery. Python shares a similar preparedness. It’s unfortunate then that PHP (the greatest scripting language evar!) and Ruby (the new greatest scripting language evar!) are so arse-backward despite their popularity.
PHP I don’t necessarily blame for it’s inadequacy; it’s something that I’ve come to expect from the Quasimodo bell-ringing approach it takes: volume over grace. Sure you have the multi-byte string module which is not included by default, or the iconv conversion module again not included by default, or the perpetually in development PHP 6. These are just plasters over a gaping tumour that is the lack of built-in character encoding support.
Ruby on the other hand, I held so much hope for. The fervent few claimed it would do so much for so little investment and yet where the lack of character encoding support grew from stagnation and sloth-like speed with PHP, Ruby has just omitted to deal with it entirely. As a language it is actively developed and yet once again character encoding support is resigned to labyrinthian workarounds and obscure modules (or “mixins” as I’ve been commanded to call them).
Were I a petty man I would blame the programmers of yore for their lack of foresight, but when ASCII was developed most computers had trouble rendering more colours than I have teeth so forgiveness I give retrospectively. What I can’t fathom is why two widely (read: cosmopolitan) used languages would be so flaccid when it comes to “simple” text. It’s not that I can read Chinese, or Japanese, or Korean, or Swedish or even French, but it irks me that were I wont to, so many obstacles stand in the way.