By default, Ruby has no encoding or character set support in its
String class. You’ll notice this when you try to slice a string on a multibyte character:
"f\303" instead of
"fé" because Ruby slices the string on the byte and not on the character boundary.
Encodings and Character Sets
To represent a written language in a computer you need a character set and an encoding. The character set is a mapping from an integer to a glyph (an image representing a character). The encoding is the way you represent those integers in a sequence.
For instance the ASCII character set consists of only 128 characters and can thus be represented in just 7 bits. Encoding ASCII is trivial; each character fits in one byte and a string is a sequence of bytes.
Languages that use more that 255 glyphs can’t fit each character into one byte, so they need another way to encode their strings, for instance by using two bytes to encode one character. History has produced a long list of different character sets and encodings, certainly not all interchangeable.
Unicode is an effort to unite all known character sets, so we have to deal with only one. Unicode also makes it possible to use multiple languages using widely different character in one document. UTF-8 and UTF-16 are the most widely used encodings for Unicode strings.
Fixing String in Ruby
String class assumes that every byte in the string represents a single character. This means that
String#slice may not return the result you expect if you’re handling multibyte characters.
Some of this behaviour can be fixed with the jcode library. The jcode library updates a number of methods on the String class: chop!, chop, delete!, delete, squeeze!, squeeze, succ!, succ, tr!, tr, tr_s!, and tr_s. It also adds jlength and jcount. The encoding assumed in a string is globally defined in the global variable
The drawback of this solution is that it doesn’t include much used methods like
Another solution is the Unicode library by Yoshida Masato. This library contains some functions to handle UTF-8 encoded strings.
Fixing String in Rails
Following the howto on using Unicode strings in Rails can get a bit baroque to say the least.
It’s easy to break things; when a helper uses the
slice method on a multibyte strings, it might chop a character in half as we saw above. Even worse, validations like
validates_length_of can fail or succeed when you don’t expect them to.
A nice way to fix these problems would be to have multibyte support in the
String class in such a way that it works exactly like the current
String class. It looks like this is planned for a future version of Ruby.
Julian Tarkhanov proposed to fix the string class by overriding the existing one with Unicode aware methods where necessary. He has hacked up a version using the Unicode library and packaged it as a plugin. The plugin also puts your database server in UTF-8 mode and updates the Content-Type header sent by Rails.
The UTF-8 hack
Overriding default classes can be dangerous. Overriding widely used classes like the
String class can be a recipe for disaster. Fortunately Rails has good test coverage so we can get an idea of how well the hack works.
I did an svn export of the Rails stable branch (essentially 1.0 with a few patches). Then I put string-overrides.rb from the plugin in the root of the export and I copied the Unicode library from my RubyGems directory to the root of the export, so I wouldn’t have to use require_gem. After that I added the following lines to all the libraries in Rails (actionpack, activerecord, etc):
The changes to the
String class introduced 3 errors and 1 failure. The failure is because of a missing \n in the result, which seems to be harmless. I’m not sure if the three new errors are fatal, but I suppose they can be fixed.
Adding UTF-8 support to the
String class is likely to result in a speed penalty. Most web applications push a lot of text around, so using this hack could potentially lower the performance of an application substantially.
Testing is knowing, so I tested the plugin with two Rails applications. The first is a small application that displays a page with some messages and a form. The second application is a production application, on which I tested the account page. Ab2, the apache benchmark program, was used to test the number of requests per second.
I have to warn in advance that this doesn’t give a very accurate impression of the actual performance, but it does give an impression of the speed penalty I got by using the Unicode plugin.
|plain Rails||with the plugin|
|Application 1||16.06 reqs/sec||9.32 reqs/sec|
|Application 2||11.48 reqs/sec||8.41 reqs/sec|
The performance impact is certainly noticeable. Although I’m not certain what part of the implementation is causing this performance penalty, the introduced levels of indirection in the string class probably don’t help. A native C implementation could probably be a lot faster.
By creating this plugin we haven’t resolved all our problems. One of the biggest problems is that we can only process UTF-8 encoded strings. Although all the input from forms should be in UTF-8 as specified in our HTML documents and headers, information from other sources, like the filesystem, could still be in a different encoding. In this setup we have to make sure we don’t send data in a different encoding than we promise in our headers. Sure, there are solutions like iconv to re-encode this data, but life would be a lot simpler if we didn’t have to think about this.
Although a native Ruby character set and encoding aware
String class would be the ultimate solution, the Unicode hack plugin for Rails provides you with the tools to use UTF-8 in your Rails application. This support comes with a noticeable performance penalty.