Encoding in Rails

Manfred Stienstra, 16 Jan 2006, 11:10 in ruby on rails and unicode, last updated 16 Sep 2006, 12:19 (edit).

By default, Ruby has no encoding or character set support in its String class. You’ll notice this when you try to slice a string on a multibyte character:

"Café"[2..3] #=> "f\303"

You get "f\303" instead of "fé" because Ruby slices the string on the byte and not on the character boundary.

Encodings and Character Sets

To represent a written language in a computer you need a character set and an encoding. The character set is a mapping from an integer to a glyph (an image representing a character). The encoding is the way you represent those integers in a sequence.

For instance the ASCII character set consists of only 128 characters and can thus be represented in just 7 bits. Encoding ASCII is trivial; each character fits in one byte and a string is a sequence of bytes.

Languages that use more that 255 glyphs can’t fit each character into one byte, so they need another way to encode their strings, for instance by using two bytes to encode one character. History has produced a long list of different character sets and encodings, certainly not all interchangeable.

Unicode is an effort to unite all known character sets, so we have to deal with only one. Unicode also makes it possible to use multiple languages using widely different character in one document. UTF-8 and UTF-16 are the most widely used encodings for Unicode strings.

Fixing String in Ruby

The String class assumes that every byte in the string represents a single character. This means that String#length and String#slice may not return the result you expect if you’re handling multibyte characters.

Some of this behaviour can be fixed with the jcode library. The jcode library updates a number of methods on the String class: chop!, chop, delete!, delete, squeeze!, squeeze, succ!, succ, tr!, tr, tr_s!, and tr_s. It also adds jlength and jcount. The encoding assumed in a string is globally defined in the global variable KCODE.

$KCODE = 'UTF8'
require 'jcode'

"Café".jlength #=> 4

The drawback of this solution is that it doesn’t include much used methods like String#slice.

Another solution is the Unicode library by Yoshida Masato. This library contains some functions to handle UTF-8 encoded strings.

Fixing String in Rails

Following the howto on using Unicode strings in Rails can get a bit baroque to say the least.

It’s easy to break things; when a helper uses the slice method on a multibyte strings, it might chop a character in half as we saw above. Even worse, validations like validates_length_of can fail or succeed when you don’t expect them to.

A nice way to fix these problems would be to have multibyte support in the String class in such a way that it works exactly like the current String class. It looks like this is planned for a future version of Ruby.

Julian Tarkhanov proposed to fix the string class by overriding the existing one with Unicode aware methods where necessary. He has hacked up a version using the Unicode library and packaged it as a plugin. The plugin also puts your database server in UTF-8 mode and updates the Content-Type header sent by Rails.

The UTF-8 hack

Testing the hack

Overriding default classes can be dangerous. Overriding widely used classes like the String class can be a recipe for disaster. Fortunately Rails has good test coverage so we can get an idea of how well the hack works.

I did an svn export of the Rails stable branch (essentially 1.0 with a few patches). Then I put string-overrides.rb from the plugin in the root of the export and I copied the Unicode library from my RubyGems directory to the root of the export, so I wouldn’t have to use require_gem. After that I added the following lines to all the libraries in Rails (actionpack, activerecord, etc):

$KCODE = 'UTF8'

$:.unshift(File.dirname(__FILE__) + '/../../')
require 'unicode'
require 'string_overrides'

The changes to the String class introduced 3 errors and 1 failure. The failure is because of a missing \n in the result, which seems to be harmless. I’m not sure if the three new errors are fatal, but I suppose they can be fixed. You can download the complete test results.

Speed

Adding UTF-8 support to the String class is likely to result in a speed penalty. Most web applications push a lot of text around, so using this hack could potentially lower the performance of an application substantially.

Testing is knowing, so I tested the plugin with two Rails applications. The first is a small application that displays a page with some messages and a form. The second application is a production application, on which I tested the account page. Ab2, the apache benchmark program, was used to test the number of requests per second.

I have to warn in advance that this doesn’t give a very accurate impression of the actual performance, but it does give an impression of the speed penalty I got by using the Unicode plugin.

plain Railswith the plugin
Application 116.06 reqs/sec9.32 reqs/sec
Application 211.48 reqs/sec8.41 reqs/sec

You can view the complete ab2 output.

The performance impact is certainly noticeable. Although I’m not certain what part of the implementation is causing this performance penalty, the introduced levels of indirection in the string class probably don’t help. A native C implementation could probably be a lot faster.

Unresolved issues

By creating this plugin we haven’t resolved all our problems. One of the biggest problems is that we can only process UTF-8 encoded strings. Although all the input from forms should be in UTF-8 as specified in our HTML documents and headers, information from other sources, like the filesystem, could still be in a different encoding. In this setup we have to make sure we don’t send data in a different encoding than we promise in our headers. Sure, there are solutions like iconv to re-encode this data, but life would be a lot simpler if we didn’t have to think about this.

Summary

Although a native Ruby character set and encoding aware String class would be the ultimate solution, the Unicode hack plugin for Rails provides you with the tools to use UTF-8 in your Rails application. This support comes with a noticeable performance penalty.

Further reading

Comments

  1. out there 1 day later: (delete)

    "Fixing String in Rails" - yuck, this is wrong solution.

    More correct solution is to go for
    "Fixing Rails for multibyte Strings",
    but that requires a lot of work:
    1. make a list of "dangerous" or "incorrect" String methods - i.e. methods, which don't work reliably for UTF8
    2. Trace calls to them from Rails. Investigate each case. Fix if broken. (for example, one could inject Kernel#set_trace_func method in process of running Rals test suite and analyze call logs :)

    Other semi-solution is to make a checklist of Rails points (I don't think there are many) that can break Strings and avoid them totally (say, never use validates_length_of, truncate etc. )

  2. Manfred 1 day later: (delete)

    I don't agree. If you fix Rails to work with the broken String class, you will have to un-fix it again when a ruby version with a proper String class comes out. On top of that, it _way_ more work than just fix the string class.

    The second 'solution' is not viable. We want to use all the Rails goodies, those are what make Rails such a nice framework to work with.

  3. out there 1 day later: (delete)

    String class is not broken - it is bytevector by design. Only expectations of it to behave well on mutibyte encoded chars are false.
    Un-fix? I doubt. Anyway while fixing, such places can be once marked, documented and then easily found, if "unfix" will be necessary.
    And, even in Ruby 2.0 String will stay byte-oriented. Just maybe chars() and encoding() methods will be added, but anyway one will have to take care, and rewrite code to deal with it.

    When goodies are broken, they are indeed baddies :)

  4. Manfred 1 day later: (delete)

    I don't think there is much sense in arguing whether we should call it by design or broken, the fact is that the easiest and, in my opinion, the best place to resolve encoding problems is in the String class.

    If the String class would stay byte oriented in Ruby 1.9/2.0 I would be very sad, I think that every mature language should address these problems in a elegant way.

  5. rick 1 day later: (delete)

    Here's some info on how Ruby 2.0 strings will be:
    http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html

  6. Manfred 1 day later: (delete)

    I know, it's linked in the third paragraph of `Fixing String in Rails' (:

  7. out there 1 day later: (delete)

    I did some tracing for String methods (rejecting "safe") from Rails over rails test unit suite, using following filter:
    set_trace_func proc { |event, file, line, id, binding, classname|
    if classname.to_s == 'String' && event =~ /call/ && file !~ /^\/usr\/lib/ && id.to_s !~ /^(\+|<<|concat|==|<=>|to_s|gsub|upcase|downcase|sub|gsub|\*|include|inspect|split|scan|to_i|intern|initialize|%)/
    printf("%8s %s:%-2d %10s %8s\n", event, file, line, id, classname)
    end
    }

    After sorting log and review of it, i think that these are almost the only places places to fix, not String class:
    active_support/core_ext/blank.rb
    active_support/core_ext/string/access.rb
    active_support/core_ext/string/starts_ends_with.rb
    action_view/helpers/text_helper.rb
    active_record/validations.rb
    In other places there is no real need to fix neither Rails, nor String.

    Btw, String#[] method is in top 5 requested (with +, *, <<), so overloading it is serious impact and Julian's string_overrides plugin can anyway fail in interesting manner:
    as overloaded String#length returns number of codepoints for NFKC string, so for string
    "eÌ?eÌ€uÌ€çaÌ€aÌ‚eÌ‚iÌ‚oÌ‚ûæœëïÿü" it will be 16.
    Suppose we pass this string through validates_length_of :maximum => 16 to be stored in UTF8 VARCHAR(16) database field.
    so, obj.save will pass validation, but throw exception from database.

    Fix Rails for Strings.
    I'll just stay away from those methods, as being too lazy to fix, write tests and send patches.

  8. rick 1 day later: (delete)

    Maybe length method shouldn't be overriden and a new method should be introduced to find number of codepoints like jcode library does...

  9. izidor 3 days later: (delete)

    I tried to use Julik's plugin and found it causes problems when Rails is running with Webrick.

    To me it seems very wrong to override String#slice and similar, because there exists code which always uses #slice in byte-oriented manner (e.g. db adapters, network code, ...).

    Julik's plugin changes behaviour according to string content and this causes errors when code always wants byte-oriented #slice of data (e.g. net packets or files and such), but the data is also valid utf8.

  10. Julik 17 days later: (delete)

    Yes it does but it was an experiment! Which shows that _some_ things break (low level code), while most of the things start working much better (all of the frontend code).
    I still think that using String as a ByteArray is wrong (because String is a String, it even called like this). Every programmer out there just will use String instead of it's Unicode variant - look at how Python people are struggling.

  11. phil76 122 days later: (delete)

    thanks for the good summary on the topic. the rails developers attention to multibyte / i18n issues is certainly disappointing.

  12. Thijs van der Vossen 122 days later: (delete)

    I'm not sure I agree with you on that. I think the problem is more the lack of a proper unicode string class in the current version of Ruby.

  13. Ben 749 days later: (delete)

    Thanks for the advice. This has been driving me up the wall. I was pretty shocked when I realised how poor multibyte string support is, considering how slick the rest of the setup is.

  14. Manfred Stienstra 749 days later: (delete)

    Actually, we've improved the situation quite a bit since 2006. Check out the cool .chars accessor on String!

Add your comment

In order to fight spam on this blog, posting comments from a browser without javascript is currently not supported.