An excellent and pragmatic proposal for easier Unicode support in Rails

Thijs van der Vossen, 16 Jun 2006, 10:45 in ruby on rails, web, and unicode, last updated 16 Sep 2006, 12:18 (edit).

The current Ruby version has no Unicode String class like in Python or Java. This makes it hard for Rails to support multibyte encodings.

The following code snippet from the truncate helper is a good example:

if $KCODE == "NONE"
  text.length > length ? text[0...l] + truncate_string : text
else
  chars = text.split(//)
  chars.length > length ? chars[0...l].join + truncate_string : text
end

This was added to make the helper work with multibyte characters, but it is far from beautiful.

A few days ago, Julian proposed to add a proxy to the string class for accessing characters instead of bytes. I think this is an excellent and very nice solution.

You access the proxy with the char method on a string object. You can for example get the number of characters with:

text.chars.length

The char method is aliased as u, so you can also write:

text.u.length

Which to me looks even nicer.

Using the proxy, you could replace the six lines of code from the truncate helper with:

text.chars.length > length ? text.chars[0...l] + truncate_string : text

That’s a whole lot more obvious. And don’t be fooled, this is just as fast as the longer version because the proxy only uses the multibyte safe methods when $KODE is set.

Apart from making the Rails code easier to understand and maintain, the proxy can also save application developers a lot of work.

The proxy and the patches to the Rails code are currently in development as a plugin you can get from Subversion. Even though the plugin is called ‘Unicode hacks’ for historical reasons, it’s actually a very clean solution by now. There’s also a proposed patch to the Rails source.

Please try this one out and give your feedback.

Comments

  1. adnans about 1 hour later: (delete)

    Is the proxy only for determining correct length of unicode strings.. or?

  2. Thijs van der Vossen about 2 hours later: (delete)

    No, the proxy gives you access to *all* string methods in a multibyte-safe way.

  3. Thijs van der Vossen about 2 hours later: (delete)

    To put it differently, the proxy behaves like you would expect from a proper Unicode String object.

  4. dseverin about 4 hours later: (delete)

    As pointed out in
    "RubyOnRails to russian" google group (look at test benchmark result, attached to message
    http://groups.google.com/group/ror2ru/msg/28a15f258abd2562 )
    Unicode module, heavily used in "unicode_hacks" , is slow and buggy: several proxy methods call Unicode::normalize_KC, which is slow, don't conform to recent version of Unicode and are broken on processing of string, containing NULL (\000) char:
    > irb -r ./unicode.so
    irb(main):001:0> Unicode.upcase("test\0test")
    => "TEST"

    So, though approach might seem good, implementation is somewhat broken.
    Be careful!

  5. Julik about 4 hours later: (delete)

    No, the proxy is for managing everything in characters. Lengths, concatenation, slicing, iteration....

  6. Thijs van der Vossen about 4 hours later: (delete)

    I think we can really use a pure Ruby Unicode processing library all interested parties can submit patches and test cases for.

  7. Julik about 4 hours later: (delete)

    Nick, the author of ICU4R, has been invited to take part in development - we reaaly hope for his support

  8. Peter Cooper about 9 hours later: (delete | show email)

    Excellent work! I'm wondering why this is a Rails plugin rather than a RubyGem though.. as it seems more Ruby related than Rails related. I'll have to look at the source and see if this is possible, as I'd love this sort of support further out into Ruby.

  9. Julik about 9 hours later: (delete)

    Because we want this built-in in ActiveSupport, available to any Rails app and used in Rails itself. It will certainly be possible to use that outside of ActiveSupport when we're done though.

  10. Thijs van der Vossen about 12 hours later: (delete)

    Peter, I was discussing the RubyGem thing with Julik earlier today and he had a very good point that stopped me whining almost immediately.

    There are a lot of different libraries and techniques already out there if you need to handle multibyte encodings in your Ruby application. There has also been a lot of talk about how to properly support Unicode in Ruby over the last I don't know how many years.

    We're really not trying to design *the* proper way to handle Unicode in Ruby, or as Julik said "the point is not in brokenness - it's more about providing Rails with some foundation to stand on until Matz gets his act together regarding M17N."

    What we want is something that makes it easier to support Unicode in Rails *now*, both for developers working on the framework itself as for developers using the framework to build web applications.

  11. Peter+Cooper about 13 hours later: (delete | show email)

    I certainly don't disagree with those arguments, but the support in Ruby is so spread out and incomplete that having one consistent library, such as this, elsewhere would be useful. Sadly it seems if you want to have semi-robust Unicode support you need to start coding your own hacks to deal with those odd methods that haven't been implemented yet.. but one reasonably proven mixin would be great.

Add your comment

In order to fight spam on this blog, posting comments from a browser without javascript is currently not supported.