Ruby 1.9 character encoding field notes

Manfred Stienstra, 11 Jan 2010, 15:14 in unicode and ruby (edit).

As you probably already know the String class became encoding aware in Ruby 1.9. This makes it possible to manipulate strings on the character level instead of on byte level. However, it’s still a general purpose API which means writing a few lines of code to get stuff done.

It’s common to choose one internal representation for character data in an application and convert all incoming strings to this representation. For example, in modern applications strings are often encoded in UTF-8 or UTF-16. I took some time to figure out how to do this in Ruby 1.9.

The biggest problem with receiving data from external sources is trust. Sources can lie about their encoding or provide broken data. Sometimes it gets mangled accidentally and sometimes someone is attacking your application with a carefully crafted payload.

Problems can arise on a lot of levels. Just think about receiving an HTTP response from a webserver. Things can go wrong in the proxy, in the client library, in the string implementation of your language. Meta-data about the encoding is stored in HTTP headers, the HTML, and now in String. The same problems exist with data coming from databases, filesystems, and caches.

You can trust some of these sources more than others. For example, you can control the data going into a database so you can trust the data coming out. In contrast, anything coming from the internet should be considered potentially dangerous.

My solution for these problems is a new method on String called ensure_encoding. It makes sure the data in the string is at least compatible with your internal strings. Depending on the options you pass it will respond differently to broken data.

As an example, let’s take an HTTP POST to a web API. Assume we’ve explained in the API documentation we only accept UTF-8 character data and will be very strict about this. Our code might look something like this:


require "ensure/encoding"
begin
  params.each do |key, value|
    params[key].ensure_encoding!(Encoding::UTF_8,
      :external_encoding  => Encoding::UTF_8,
      :invalid_characters => :raise)
  end
rescue Encoding::InvalidByteSequenceError => e
  send_response_document :unprocessable_entity,
    "Sorry, your request contains invalid encoding" +
    "and can't be processed (#{e.message})"
end

You can find more examples on the GitHub project page and in the source. Normally I try to extract code from a running project, but we don’t run any production code on 1.9 yet. It would be great if you can help out with testing the code. I’ve released the code as a gem, so it should be really easy to install.

$ gem install ensure-encoding

Please leave any bugs, problems, or suggestions in the GitHub issue tracker.

2 comments

Ordering and comparing text in Rails and MySQL

Manfred Stienstra, 22 Dec 2009, 12:20 in ruby on rails and unicode (edit).

Ordering and comparing text is a lot trickier than a lot of people expect, computer scientists even came up with a complicated name for it: collation. There are two groups of problems associated with collation: cultural and technical. Today we’re not going to focus on technical problems, but rather how the cultural problems influence the technical solution.

An example of a cultural difference is letter ordering in a language, in Swedish the Ä is ordered after the z and in German it follows the letter a. You can also discuss whether the ä is an alternative form of the a or if it’s a completely different character, this is relevant when implementing search. When searching for ‘nächste’ you might also be interested in text containing ‘nachste’.

MySQL implements a number of collations solutions so you can use the one relevant for your application. For instance utf8_icelandic_ci when you want to order UTF-8 encoded text based on the cultural norm in Iceland. You can get a full list of supported collations from the mysql client.

mysql> SHOW COLLATION;

When you have an international site that might contain multiple languages you can use the default Unicode collation algorithm, which is called utf8_unicode_ci in MySQL. The UCA is pretty sensible so a lot of frameworks, including Rails, use it as a default. Unfortunately this collation also changes character equality. Lets look at an example how this might go wrong.

mysql> CREATE DATABASE books_example CHARACTER SET utf8
  COLLATE utf8_unicode_ci;
mysql> USE books_example;
mysql> CREATE TABLE books ( title VARCHAR(255) );
mysql> SHOW FIELDS FROM books;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| title | varchar(255) | YES  |     | NULL    |       | 
+-------+--------------+------+-----+---------+-------+
1 row in set (0.00 sec)
mysql> INSERT INTO books SET title = 'Pokemon';
mysql> INSERT INTO books SET title = 'pokemon';

Now we have a database with books, currently with two entries.

mysql> SELECT * FROM books;
+---------+
| title   |
+---------+
| Pokemon | 
| pokemon | 
+---------+
2 rows in set (0.00 sec)

A sensible action might be to select a book by name.

mysql> SELECT * FROM books WHERE title = 'pokemon';
+---------+
| title   |
+---------+
| Pokemon | 
| pokemon | 
+---------+
2 rows in set (0.00 sec)

But unfortunately this returns both books because in utf8_unicode_ci P equals p, in the same way a equals ä equals A. This is useful for ordering and searching but not for selecting.

We can fix it by specifying a binary collation algorithm for the select so it will not use these fuzzy equality rules.

mysql> SELECT * FROM books WHERE title = 'pokemon' COLLATE utf8_bin;
+---------+
| title   |
+---------+
| pokemon | 
+---------+
1 row in set (0.00 sec)

Note that this problem can introduce a lot of bugs and maybe even security problems. For instance, imagine two accounts: ‘Manfred’ and ‘manfred’. With the following query it is undetermined which of these two will be returned.

SELECT * FROM accounts WHERE username = 'manfred' LIMIT 1

My advise is to set the default collation for your database to utf8_bin and include the collation in queries where you want to order the entries nicely for the user interface or when you need fuzzy equality for searching. In Rails you can specify the collation in config/database.yml.

development:
  database: books_development
  adapter: mysql
  encoding: utf8
  collation: utf8_bin

2 comments

MySQL character encoding trouble pre Rails 1.2

Manfred Stienstra, 24 Feb 2009, 17:44 in ruby on rails and unicode (edit).

Back in the days when Rails didn’t care about the encoding of the database connection you sometimes ended up with UTF-8 encoded strings from the browser travelling over a ISO-8859-1 encoded connection with MySQL to a UTF-8 database. This isn’t a big problem in itself as long and you don’t slice strings on the database level. When you decide to migration your application to a newer version of Rails you might run into trouble.

You know you have this problem when the characters from the first column of the following table look like the characters in the second column.

NormalBroken
‘
’
ëë

Fortunately it’s pretty easy to fix, you just need to convince mysqldump that it’s operating on a Latin-1 database and load the dump back into the database. Make sure you have a backup before you try this though!

$ mysqldump --skip-set-charset --default-character-set=latin1 databasename > fixed.sql
$ mysql databasename < fixed.sql

Now make sure that your Rails app talks UTF-8 to the database by setting the encoding in database.yml:

production:
  adapter: mysql
  database: databasename
  encoding: utf8

No comments yet

ActiveSupport::Multibyte Updated

Manfred Stienstra, 23 Sep 2008, 13:47 in ruby on rails, unicode, and releases (edit).

Yesterday Michael Koziarski merged the updated version of ActiveSupport::Multibyte into Rails. The initial reason for the update was Ruby 1.9 compatibility but it turned into a complete overhaul. Not just the code, but also the documentation was revised.

For most people the only noticeable change is the move from String#chars to String#mb_chars. People relying heavily on ActiveSupport::Multibyte probably want to read on.

String#chars renamed to String#mb_chars

One of the initial reasons to use a proxy to access characters back in 2006 was to make Rails future proof in case Ruby got some kind of Unicode support on String. Unfortunately Matz decided to use String#chars for one of these features so we had to change the method name. People running on Ruby <= 1.8.6 will get a nice deprecation warning.

String#mb_chars now returns a proxy on Ruby 1.8 and returns self on Ruby 1.9.

Note that the Ruby 1.9 String class does not implement methods like String#normalize. We’re still trying to figure out how to approach this limitation. For now, you might want to do:

class String
  def normalize(normalization_form=ActiveSupport::Multibyte.default_normalization_form)
    ActiveSupport::Multibyte::Chars.new(self).normalize(normalization_form)
  end
end

No more automatic tidying of bytes

Multibyte no longer attempts to convert broken encoding in strings to a valid UTF-8. The String#tidy_bytes method still exists if you need this functionality.

Duck-typing aid

Strings are notoriously hard to duck-type because they include Enumerable, which makes them hard to differentiate from Arrays. Rails already had some duck-typing help in place for Date, Time and DateTime. We decided to implement the same thing on String and Chars.

'Bambi and Thumper'.acts_like?(:string) #=> true
'Bambi and Thumper'.mb_chars.acts_like?(:string) #=> true

So if you catch yourself using str.is_a?(String) please consider using acts_like?.

Different way of registering backends

Instead of registering a handler on the Chars class, you now set the proxy_class on ActiveSupport::Multibyte.

ActiveSupport::Multibyte.proxy_class = UTF32Chars

Note that this removes a level of indirection, which speeds up the entire Multibyte implementation quite a bit.

If you’ve implemented your own handler, please look at the implementation of ActiveSupport::Multibyte::Chars on how to convert it to work with the new implementation. In most cases this should be a trivial exercise. Don’t hesitate to contact me if you need help.

Overrideable default normalization form

The default normalization form can now be set on ActiveSupport::Multibyte instead of updating a constant.

ActiveSupport::Multibyte.default_normalization_form = :kd

See ActiveSupport::Multibyte::NORMALIZATIONS_FORMS for valid normalization forms.

1 comment

Quick ActiveSupport::Multibyte glossary trick

Manfred Stienstra, 23 Mar 2007, 14:02 in ruby on rails and unicode (edit).

I was trying to make a glossary of words grouped by their first letter, but I wanted words starting with the letter é grouped with words starting with the letter e. No small feat you might imagine. Wrong.

dict = words.inject({}) do |dict, word|
  letter = word.chars.decompose[0..0].downcase.to_s
  dict[letter] ||= []
  dict[letter] << word; dict
end

The reason this works is that letters like é have a decomposed form in Unicode, this form consists of a latin letter and a accent modifier. I’m not sure what happens if you run Arabic through this code, but we’ll cross that bridge when we get there.

11 comments

Ruby and MySQL encoding flakiness

Manfred Stienstra, 20 Feb 2007, 12:01 in ruby on rails and unicode (edit).

The last few weeks we noticed the dreaded question marks on our sites running against MySQL 5.0. We thought we did everything to make sure our servers, databases, tables, clients and connections understood UTF-8, but somehow connections to the database were reset back to Latin1 after some time.

Instead of trying to fix the problem in Rails/Ruby/libmysql I decided to squash the problem in the MySQL server configuration. By default we were seeing this:

mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | latin1 | 
| character_set_connection | latin1 | 
| character_set_database   | latin1 | 
| character_set_filesystem | binary | 
| character_set_results    | latin1 | 
| character_set_server     | latin1 | 
| character_set_system     | utf8   | 
+--------------------------+--------+

So I set the following in /etc/mysql/my.cnf:

[mysqld]
character-set-server = utf8

[client]
default-character-set = utf8

Which forces all the encoding to go to UTF-8 by default:

mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | utf8   | 
| character_set_connection | utf8   | 
| character_set_database   | utf8   | 
| character_set_filesystem | binary | 
| character_set_results    | utf8   | 
| character_set_server     | utf8   | 
| character_set_system     | utf8   | 
+--------------------------+--------+

7 comments

Rails 1.2 Released

Manfred Stienstra, 19 Jan 2007, 16:49 in ruby on rails and unicode (edit).

Today the Rails Core Team released Ruby on Rails 1.2. The long awaited new version is of course full of features and fixes. The thing we’re most excited about is the inclusion of ActiveSupport::Multibyte. But it doesn’t stop with multibyte support, Rails now ships with UTF-8 as a default in all parts of the framework, something we couldn’t have dreamed of a year ago.

1 comment

UnSpun encoding problems

Manfred Stienstra, 07 Dec 2006, 12:44 in ruby on rails, web, broken, and unicode (edit).

A few weeks ago Amazon launched UnSpun, a web application to collectively manage lists of all sorts.

During signup I was presented with the following.

Screenshot of UnSpun with a broken letter

I know Internet Explorer fixes a lot of broken encoding by guessing the true encoding for just about everything, maybe that’s why they never noticed during development?

I’ve had this problem myself on a few occasions. Because geographical information is commonly extracted from text files and loaded into a database you always have to be really careful to transcode any data extracted from text files to the same encoding as the database. In the case of ISO-8859-1/15, which is commonly used in west-european countries, there is a really simple oneliner to transcode to utf-8.

source.unpack('C*').pack('U*')

3 comments

ActiveSupport::MultiByte

Manfred Stienstra, 05 Oct 2006, 08:48 in ruby on rails and unicode (edit).

As of revision #5223 ActiveSupport::Multibyte is part of Rails. Now everyone can enjoy multibyte safeness in their applications. Needless to say we are really happy. To show some of the features of Multibyte’s String#chars proxy I’ve put together a short screencast.

If you have any questions about ActiveSupport::Multibyte, please consult the API documentation and the Trac Wiki first. Enjoy.

Download screencast (QuickTime, 1.9MB)

32 comments

Extensive testing

Manfred Stienstra, 12 Sep 2006, 16:04 in ruby on rails and unicode (edit).

Today I’ve been merging the Rails multibyte support from Julik’s unicode hacks plugin into the current edge source. After a few testruns my Mac Mini started complaining about the normalization conformance tests…

Finished in 102.769516 seconds.

460 tests, 353652 assertions, 0 failures, 0 errors

I guess I’ll have to fix the Rakefile so I don’t overheat everyone’s computer.

5 comments

URoR 2: KCODE

Thijs van der Vossen, 12 Sep 2006, 00:33 in ruby on rails and unicode (edit).

URoR stands for ‘Unicode Ruby on Rails’ which is a series on using Unicode with Rails. In this second article I’ll show you how to enable the (somewhat limited) UTF-8 support in Ruby and Rails. (first article)

Let’s break a string

Suppose you’re using the truncate helper like this:

<p><%= truncate 'Iñtërnâtiônàlizætiøn', 12 %></p>

The result is something like:

Iñtërn?

Because the helper truncates the string to 12 bytes, it slices the codepoint for the ‘â’ in halve. The result is an invalid sequence which can’t be rendered.

Fix this by adding the following to the top of config/environment.rb:

# Add basic utf-8 encoding support 
$KCODE = 'UTF8'

And the result will be:

Iñtërnâti…

The string is now truncated to 12 codepoints instead of 12 bytes.

What’s happening here?

Setting KCODE to 'UTF8' tell Ruby that your source code is encoded as UTF-8. Some libraries like CGI and some parts of Rails look at KCODE to find out if they need to process strings in a UTF-8 friendly way.

You can now also require the jcode library you get some basic UTF-8 encoding support in Ruby. More about this in a future article.

Not all is good

Although it’s great that truncate has been fixed to work with UTF-8 you should be aware that this is not the case for all helpers:

<p><%= excerpt 'Iñtërnâtiônàlizætiøn', 'nâtiôn', 2 %></p>

This currently always breaks no matter if KCODE is set or not:

?rnâtiônàl…

You can again get the code from our subversion repository.

2 comments

Unicode is part of the solution, not part of the problem

Thijs van der Vossen, 06 Sep 2006, 08:59 in ruby on rails and unicode (edit).

Tim Bray in the (for now) final Ruby Ape Diaries entry (emphasis added):

It’s easy to make people angry about this subject, and some of the angry people have a point; certain aspects of Unicode are, on the surface at least, objectively racist; for example, why does UTF-8 encoding of characters become progressively less efficient as you move from the languages of the Western hemisphere to those of the East?

Having said all that, it is my opinion that Unicode works pretty well, and in terms of making the Internet useful to the many peoples of Earth, is part of the solution, not part of the problem. And for that reason, I think that any language that doesn’t do a real good job at Unicode isn’t a very good citizen. And I think Ruby has a major problem in this area. Solutions are promised; we’ll see. And hey, in a few weeks I’m going to get up a stage in a room in Denver full of Rubyists and talk about this stuff; we’ll see whether they let me out of town alive.

Someone please, please record his talk; I’m really looking forward to what he has to say on the subject.

No comments yet

URoR 1: Set the Content-Type

Thijs van der Vossen, 05 Sep 2006, 00:49 in ruby on rails and unicode (edit).

URoR stands for ‘Unicode Ruby on Rails’ which is going to be a series on using Unicode with Rails. In this first article I’ll show you how to set the Content-Type header so that the browser knows what you’re sending. (second article)

Set it in an after filter

On the web, the One And Only Sensible Encoding for Unicode is UTF-8, so that’s what we’re going to use. First, make sure your editor is set to save all files encoded as UTF-8. Then create a new Rails application and generate a controller called ‘static’ with an ‘index’ action so that we have something to test with.

$ rails uror
$ cd uror/
$ ./script/generate controller static index

Now add the following to app/views/static/index.rhtml (just copy it from this page and paste it into your editor):

<p>Iñtërnâtiônàlizætiøn</p>

Run the Rails application with ./script/server and go to /static/index where you should get something garbled that looks like this:

Iñtërnâtiônà lizætiøn

The problem is that you haven’t told the browser that you’re using UTF-8. Fix this by changing app/controllers/application.rb to:

class ApplicationController < ActionController::Base
  after_filter :set_encoding
  
  protected
  
  def set_encoding
    headers['Content-Type'] ||= 'text/html'
    if headers['Content-Type'].starts_with?('text/') and !headers['Content-Type'].include?('charset=')
      headers['Content-Type'] += '; charset=utf-8'
    end
  end
end

The set_encoding after filter does two things:

  1. It sets the Content-Type header to text/html, but only if no Content-Type header has yet been set. This is exactly what Rails would have done anyway, but we’re doing it here so that…
  2. It adds charset=utf-8 to every Content-Type header for a text type when no charset has yet been set.

If you now reload the page the problem is fixed because the browser is no longer receiving a:

Content-Type: text/html

header, but:

Content-Type: text/html; charset=utf-8

Also set it in your Lighttpd or Apache configuration

It’s a good idea to set the UTF-8 encoding in your web server configuration too. For Apache add the following in public/.htaccess or your main configuration:

AddDefaultCharset utf-8

For Lighttpd, change mimetype.assign in config/lighttpd.conf to:

mimetype.assign = (  
  ".css"        =>  "text/css; charset=utf-8",
  ".gif"        =>  "image/gif",
  ".htm"        =>  "text/html; charset=utf-8",
  ".html"       =>  "text/html; charset=utf-8",
  ".jpeg"       =>  "image/jpeg",
  ".jpg"        =>  "image/jpeg",
  ".js"         =>  "text/javascript; charset=utf-8",
  ".png"        =>  "image/png",
  ".swf"        =>  "application/x-shockwave-flash",
  ".txt"        =>  "text/plain; charset=utf-8"
)

Now all static stuff like 404.html and cached pages are also sent with the correct encoding in the Content-type header.

Even add it to the head

If you want make it easy for people to save your pages to disk and open them with the correct encoding later on, you might want to add the following inside the head element of your html pages:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Always do this last as it may mask any trouble you might be having with the http headers.

Update: You can now get all URoR code from our subversion repository.

Update 2:: The upcoming 1.2 release of Rails will add utf-8 as the default charset for all renders, so you’ll no longer need the after filter.

2 comments

Joel Spolsky got one thing right

Thijs van der Vossen, 01 Sep 2006, 09:11 in ruby on rails and unicode (edit).

Although I fully agree with David that Joel Spolsky’s Language Wars is one of the purest forms of FUD against Ruby and Rails ever, I do think Joel got this one right:

I for one am scared of Ruby because (1) it displays a stunning antipathy towards Unicode […]

Sad as it may be, this fear is mostly justified.

5 comments

We approve too

Thijs van der Vossen, 29 Jul 2006, 22:05 in unicode (edit).

Tim Bray about Guido van Rossum’s OSCON talk:

Python 3 will have a String type that is 100% Unicode and that’s all it is, and separately a byte-array type that lets you indulge your most squalidly-perverse bit-bashing fantasies. I approve.

We at Fingertips approve too.

1 comment

Discussing Unicode for Ruby

Thijs van der Vossen, 21 Jun 2006, 13:15 in ruby on rails, web, and unicode (edit).

The Unicode roadmap thread on the Ruby Lang mailing list is now almost 100 messages long.

Yukihiro Matsumoto (matz):

I am too optimized for Ruby string operations using Regexp.

Tim Bray:

Julian ‘Julik’ Tarkhanov:

I think this thread is going to end the same as the one in 2002 did.

Read the whole damn thing if you want to know more about the gritty details. In case it makes your head hurt, go read On the Goodness of Unicode, Characters vs. Bytes and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) first.

12 comments

An excellent and pragmatic proposal for easier Unicode support in Rails

Thijs van der Vossen, 16 Jun 2006, 10:45 in ruby on rails, web, and unicode (edit).

The current Ruby version has no Unicode String class like in Python or Java. This makes it hard for Rails to support multibyte encodings.

The following code snippet from the truncate helper is a good example:

if $KCODE == "NONE"
  text.length > length ? text[0...l] + truncate_string : text
else
  chars = text.split(//)
  chars.length > length ? chars[0...l].join + truncate_string : text
end

This was added to make the helper work with multibyte characters, but it is far from beautiful.

A few days ago, Julian proposed to add a proxy to the string class for accessing characters instead of bytes. I think this is an excellent and very nice solution.

You access the proxy with the char method on a string object. You can for example get the number of characters with:

text.chars.length

The char method is aliased as u, so you can also write:

text.u.length

Which to me looks even nicer.

Using the proxy, you could replace the six lines of code from the truncate helper with:

text.chars.length > length ? text.chars[0...l] + truncate_string : text

That’s a whole lot more obvious. And don’t be fooled, this is just as fast as the longer version because the proxy only uses the multibyte safe methods when $KODE is set.

Apart from making the Rails code easier to understand and maintain, the proxy can also save application developers a lot of work.

The proxy and the patches to the Rails code are currently in development as a plugin you can get from Subversion. Even though the plugin is called ‘Unicode hacks’ for historical reasons, it’s actually a very clean solution by now. There’s also a proposed patch to the Rails source.

Please try this one out and give your feedback.

11 comments

Rails encoding trouble in the wild

Thijs van der Vossen, 27 Jan 2006, 10:55 in ruby on rails and unicode (edit).

If you’re still unsure if you should worry about utf-8 encoded strings, look no further than the Projectionist RSS feed:

See those &#82’s in the headlines list and the big R in the title? That’s where a poor right double quotation mark was split in halve.

No comments yet

Encoding in Rails

Manfred Stienstra, 16 Jan 2006, 11:10 in ruby on rails and unicode (edit).

By default, Ruby has no encoding or character set support in its String class. You’ll notice this when you try to slice a string on a multibyte character:

"Café"[2..3] #=> "f\303"

You get "f\303" instead of "fé" because Ruby slices the string on the byte and not on the character boundary.

Encodings and Character Sets

To represent a written language in a computer you need a character set and an encoding. The character set is a mapping from an integer to a glyph (an image representing a character). The encoding is the way you represent those integers in a sequence.

For instance the ASCII character set consists of only 128 characters and can thus be represented in just 7 bits. Encoding ASCII is trivial; each character fits in one byte and a string is a sequence of bytes.

Languages that use more that 255 glyphs can’t fit each character into one byte, so they need another way to encode their strings, for instance by using two bytes to encode one character. History has produced a long list of different character sets and encodings, certainly not all interchangeable.

Unicode is an effort to unite all known character sets, so we have to deal with only one. Unicode also makes it possible to use multiple languages using widely different character in one document. UTF-8 and UTF-16 are the most widely used encodings for Unicode strings.

Fixing String in Ruby

The String class assumes that every byte in the string represents a single character. This means that String#length and String#slice may not return the result you expect if you’re handling multibyte characters.

Some of this behaviour can be fixed with the jcode library. The jcode library updates a number of methods on the String class: chop!, chop, delete!, delete, squeeze!, squeeze, succ!, succ, tr!, tr, tr_s!, and tr_s. It also adds jlength and jcount. The encoding assumed in a string is globally defined in the global variable KCODE.

$KCODE = 'UTF8'
require 'jcode'

"Café".jlength #=> 4

The drawback of this solution is that it doesn’t include much used methods like String#slice.

Another solution is the Unicode library by Yoshida Masato. This library contains some functions to handle UTF-8 encoded strings.

Fixing String in Rails

Following the howto on using Unicode strings in Rails can get a bit baroque to say the least.

It’s easy to break things; when a helper uses the slice method on a multibyte strings, it might chop a character in half as we saw above. Even worse, validations like validates_length_of can fail or succeed when you don’t expect them to.

A nice way to fix these problems would be to have multibyte support in the String class in such a way that it works exactly like the current String class. It looks like this is planned for a future version of Ruby.

Julian Tarkhanov proposed to fix the string class by overriding the existing one with Unicode aware methods where necessary. He has hacked up a version using the Unicode library and packaged it as a plugin. The plugin also puts your database server in UTF-8 mode and updates the Content-Type header sent by Rails.

The UTF-8 hack

Testing the hack

Overriding default classes can be dangerous. Overriding widely used classes like the String class can be a recipe for disaster. Fortunately Rails has good test coverage so we can get an idea of how well the hack works.

I did an svn export of the Rails stable branch (essentially 1.0 with a few patches). Then I put string-overrides.rb from the plugin in the root of the export and I copied the Unicode library from my RubyGems directory to the root of the export, so I wouldn’t have to use require_gem. After that I added the following lines to all the libraries in Rails (actionpack, activerecord, etc):

$KCODE = 'UTF8'

$:.unshift(File.dirname(__FILE__) + '/../../')
require 'unicode'
require 'string_overrides'

The changes to the String class introduced 3 errors and 1 failure. The failure is because of a missing \n in the result, which seems to be harmless. I’m not sure if the three new errors are fatal, but I suppose they can be fixed. You can download the complete test results.

Speed

Adding UTF-8 support to the String class is likely to result in a speed penalty. Most web applications push a lot of text around, so using this hack could potentially lower the performance of an application substantially.

Testing is knowing, so I tested the plugin with two Rails applications. The first is a small application that displays a page with some messages and a form. The second application is a production application, on which I tested the account page. Ab2, the apache benchmark program, was used to test the number of requests per second.

I have to warn in advance that this doesn’t give a very accurate impression of the actual performance, but it does give an impression of the speed penalty I got by using the Unicode plugin.

plain Railswith the plugin
Application 116.06 reqs/sec9.32 reqs/sec
Application 211.48 reqs/sec8.41 reqs/sec

You can view the complete ab2 output.

The performance impact is certainly noticeable. Although I’m not certain what part of the implementation is causing this performance penalty, the introduced levels of indirection in the string class probably don’t help. A native C implementation could probably be a lot faster.

Unresolved issues

By creating this plugin we haven’t resolved all our problems. One of the biggest problems is that we can only process UTF-8 encoded strings. Although all the input from forms should be in UTF-8 as specified in our HTML documents and headers, information from other sources, like the filesystem, could still be in a different encoding. In this setup we have to make sure we don’t send data in a different encoding than we promise in our headers. Sure, there are solutions like iconv to re-encode this data, but life would be a lot simpler if we didn’t have to think about this.

Summary

Although a native Ruby character set and encoding aware String class would be the ultimate solution, the Unicode hack plugin for Rails provides you with the tools to use UTF-8 in your Rails application. This support comes with a noticeable performance penalty.

Further reading

14 comments