Rails encoding trouble in the wild

Thijs van der Vossen, 27 Jan 2006, 10:55 in ruby on rails and unicode (edit).

If you’re still unsure if you should worry about utf-8 encoded strings, look no further than the Projectionist RSS feed:

See those &#82’s in the headlines list and the big R in the title? That’s where a poor right double quotation mark was split in halve.

No comments yet

Chad Fowler on Monkeypatching

Manfred Stienstra, 26 Jan 2006, 09:26 in ruby on rails (edit).

Chad Fowler explains why opening classes in Ruby is a good idea. He also mentions how plugins allow the Rails core team to push out ‘unstable’ features to the public.

The Rails team is already using plugins like this as a way to get new functionality into circulation as a kind of proving ground before things make it into Rails core.

I really like this model because it allows the Rails core team to keep the svn trunk clean from any unstable code. The code can be tested in the wild through plugins, when the feature is used enough and appears to be stable, it can be included. The downside is that plugins which open Rails classes have to be re-written for every release.

No comments yet

Zeldman just made me smile

Thijs van der Vossen, 17 Jan 2006, 21:04 in web (edit).

Read all about Web 3.0 on A List Apart:

To you who are toiling over an AJAX- and Ruby-powered social software product, good luck, God bless, and have fun. Remember that 20 other people are working on the same idea. So keep it simple, and ship it before they do, and maintain your sense of humor whether you get rich or go broke. Especially if you get rich. Nothing is more unsightly than a solemn multi-millionaire.

To you who feel like failures because you spent last year honing your web skills and serving clients, or running a business, or perhaps publishing content, you are special and lovely, so hold that pretty head high, and never let them see the tears.

1 comment

Encoding in Rails

Manfred Stienstra, 16 Jan 2006, 11:10 in ruby on rails and unicode (edit).

By default, Ruby has no encoding or character set support in its String class. You’ll notice this when you try to slice a string on a multibyte character:

"Café"[2..3] #=> "f\303"

You get "f\303" instead of "fé" because Ruby slices the string on the byte and not on the character boundary.

Encodings and Character Sets

To represent a written language in a computer you need a character set and an encoding. The character set is a mapping from an integer to a glyph (an image representing a character). The encoding is the way you represent those integers in a sequence.

For instance the ASCII character set consists of only 128 characters and can thus be represented in just 7 bits. Encoding ASCII is trivial; each character fits in one byte and a string is a sequence of bytes.

Languages that use more that 255 glyphs can’t fit each character into one byte, so they need another way to encode their strings, for instance by using two bytes to encode one character. History has produced a long list of different character sets and encodings, certainly not all interchangeable.

Unicode is an effort to unite all known character sets, so we have to deal with only one. Unicode also makes it possible to use multiple languages using widely different character in one document. UTF-8 and UTF-16 are the most widely used encodings for Unicode strings.

Fixing String in Ruby

The String class assumes that every byte in the string represents a single character. This means that String#length and String#slice may not return the result you expect if you’re handling multibyte characters.

Some of this behaviour can be fixed with the jcode library. The jcode library updates a number of methods on the String class: chop!, chop, delete!, delete, squeeze!, squeeze, succ!, succ, tr!, tr, tr_s!, and tr_s. It also adds jlength and jcount. The encoding assumed in a string is globally defined in the global variable KCODE.

$KCODE = 'UTF8'
require 'jcode'

"Café".jlength #=> 4

The drawback of this solution is that it doesn’t include much used methods like String#slice.

Another solution is the Unicode library by Yoshida Masato. This library contains some functions to handle UTF-8 encoded strings.

Fixing String in Rails

Following the howto on using Unicode strings in Rails can get a bit baroque to say the least.

It’s easy to break things; when a helper uses the slice method on a multibyte strings, it might chop a character in half as we saw above. Even worse, validations like validates_length_of can fail or succeed when you don’t expect them to.

A nice way to fix these problems would be to have multibyte support in the String class in such a way that it works exactly like the current String class. It looks like this is planned for a future version of Ruby.

Julian Tarkhanov proposed to fix the string class by overriding the existing one with Unicode aware methods where necessary. He has hacked up a version using the Unicode library and packaged it as a plugin. The plugin also puts your database server in UTF-8 mode and updates the Content-Type header sent by Rails.

The UTF-8 hack

Testing the hack

Overriding default classes can be dangerous. Overriding widely used classes like the String class can be a recipe for disaster. Fortunately Rails has good test coverage so we can get an idea of how well the hack works.

I did an svn export of the Rails stable branch (essentially 1.0 with a few patches). Then I put string-overrides.rb from the plugin in the root of the export and I copied the Unicode library from my RubyGems directory to the root of the export, so I wouldn’t have to use require_gem. After that I added the following lines to all the libraries in Rails (actionpack, activerecord, etc):

$KCODE = 'UTF8'

$:.unshift(File.dirname(__FILE__) + '/../../')
require 'unicode'
require 'string_overrides'

The changes to the String class introduced 3 errors and 1 failure. The failure is because of a missing \n in the result, which seems to be harmless. I’m not sure if the three new errors are fatal, but I suppose they can be fixed. You can download the complete test results.

Speed

Adding UTF-8 support to the String class is likely to result in a speed penalty. Most web applications push a lot of text around, so using this hack could potentially lower the performance of an application substantially.

Testing is knowing, so I tested the plugin with two Rails applications. The first is a small application that displays a page with some messages and a form. The second application is a production application, on which I tested the account page. Ab2, the apache benchmark program, was used to test the number of requests per second.

I have to warn in advance that this doesn’t give a very accurate impression of the actual performance, but it does give an impression of the speed penalty I got by using the Unicode plugin.

plain Railswith the plugin
Application 116.06 reqs/sec9.32 reqs/sec
Application 211.48 reqs/sec8.41 reqs/sec

You can view the complete ab2 output.

The performance impact is certainly noticeable. Although I’m not certain what part of the implementation is causing this performance penalty, the introduced levels of indirection in the string class probably don’t help. A native C implementation could probably be a lot faster.

Unresolved issues

By creating this plugin we haven’t resolved all our problems. One of the biggest problems is that we can only process UTF-8 encoded strings. Although all the input from forms should be in UTF-8 as specified in our HTML documents and headers, information from other sources, like the filesystem, could still be in a different encoding. In this setup we have to make sure we don’t send data in a different encoding than we promise in our headers. Sure, there are solutions like iconv to re-encode this data, but life would be a lot simpler if we didn’t have to think about this.

Summary

Although a native Ruby character set and encoding aware String class would be the ultimate solution, the Unicode hack plugin for Rails provides you with the tools to use UTF-8 in your Rails application. This support comes with a noticeable performance penalty.

Further reading

14 comments

Poking at Java programmers

Thijs van der Vossen, 08 Jan 2006, 20:41 in ruby on rails (edit).

David pokes at Java programmers

Stills from the Snakes and Rubies video

David explains how you should market your project at the Snakes and Rubies event:

Well, making a stir. Like taking a big target and then picking on it. I really recommend Java, it works great. There are so many Java programmers out there and you just have to poke at them a little bit and then they go like bananas and link to you like mad. And if you poke them in the eyes they’ll go even better bananas.

That works pretty well for like breaking throught the ‘early awareness wall’ where basically what you need is just get it out there. Then you probably want to switch horses at some point in the game when you want those Java programmers to come over. That’s a good idea and I hope that we’ve done that at least somewhat when we can resist the temptations.

If you’re in any way interested in Rails or Django or both, you’ll probably enjoy the video of the event.

No comments yet

What is your Textmate serial number?

Thijs van der Vossen, 06 Jan 2006, 20:23 in tools (edit).

Allan on the sales figures for Textmate:

[…] the serial number on your license (should you have bought one) is the actual customer number, so this will tell you how many licenses were sold before you purchased yours.

I’ve got 619. What’s your number?

12 comments

Vim and TextMate followup

Manfred Stienstra, 04 Jan 2006, 15:35 in tools (edit).

Yesterday Kevin dropped a message asking for more vim mapping.

I was wondering how much could be done using vim scripts, it looks like I wasn’t the only one. Felix Ingram apparently created a TextMate snippet emulation script, as far as I can see it’s a script to emulate the tabbing through arguments in a snippet.

No comments yet

Luminous

Thijs van der Vossen, 03 Jan 2006, 20:40 (edit).

Michael Barrish, the writer of one of my favourite weblogs, just launched his new business site.

No comments yet

Lazily sweeping the whole Rails page cache

Thijs van der Vossen, 03 Jan 2006, 15:11 in ruby on rails (edit).

One of the more convenient features in Ruby on Rails is page caching. Simply add caches_page :show to the top of a controller class, and all pages rendered by the show action are written to disk automatically. On subsequent request, these pages will be served straight from disk without invoking Rails at all.

This works because of rewrite rules that basically tell the webserver to append .html to the request path. If the webserver can find a file using the resulting path, the webserver will send it. If not, then Rails will handle the request.

Pages are removed from the cache simply by deleting them from the public directory. Rails provides the expire_page method and sweepers to help with this.

Sweeping is hard

Suppose you are writing a blogging application and you decide to add page caching. When a post is updated, the cached page that shows the post has to be removed. You write a post sweeper for this:

class PostSweeper < ActionController::Caching::Sweeper
  observe Post

  def after_save(post)
    expire_page(:controller => "post", :action => "show", :id => post.id)
  end
end

But wait… Your blog also has a front page listing the most recent posts. The updated post might be included there, so you need to expire that page too.

def after_save(post)
  expire_page(:controller => "post", :action => "show", :id => post.id)
  expire_page(:controller => "post", :action => "index")
end

Then you realize you also have archive pages and category overviews…

def after_save(post)
  expire_page(:controller => "post", :action => "show", :id => post.id)
  expire_page(:controller => "post", :action => "index")
  expire_page(:controller => "archive", :action => "show", 
    :year => post.published_at.year)
  expire_page(:controller => "archive", :action => "show", 
    :year => post.published_at.year, :month => post.published_at.month)
  expire_page(:controller => "archive", :action => "show", 
    :year => post.published_at.year, :month => post.published_at.month, 
    :day => post.published_at.mday)
  post.categories.each do |category|
    expire_page(:controller => "category", :action => "show", 
      :id => category.id)
  end
end

Ok, but what if a post is destroyed? And what exactly should happen when a category is renamed? And…

When you have an application where a single change can invalidate a large number of pages, the sweepers can get quite complex. It’s easy to forget to expire one or more pages, leading to subtle bugs where old pages are served from a stale cache.

An obvious solution to this would be to just sweep all pages after each change. Sadly, this is not possible with page caching because Rails does not keep a list of cached pages. The files are written directly to the public directory, so there’s no way to cleanly delete them all.

Lazy sweeping

We’ve tried to solve the problem of not being sure which files in the public directory are just cached copies and which pages are static html, by moving all cached pages to a public/cache subdirectory. This seems to work fine for us.

In config/environment.rb, change the page cache directory from the default by adding the following line inside the Rails::Initializer.run block.

config.action_controller.page_cache_directory = RAILS_ROOT+"/public/cache/"

Then change the rewrite rules in the webserver configuration. For lighttpd (config/lighttpd.conf) these should be changed to:

url.rewrite = ( 
  "^/$" => "cache/index.html", 
  "^([^.]+)$" => "cache$1.html" )

For Apache (public/.htaccess) the first two rules probably need to be changed to:

RewriteRule ^$ cache/index.html [QSA]
RewriteRule ^([^.]+)$ cache/$1.html [QSA]

We use the following in app/models/site_sweeper.rb as a single sweeper for all the models in our application.

class SiteSweeper < ActionController::Caching::Sweeper
  observe Post, Category

  def after_save(record)
    self.class::sweep
  end
  
  def after_destroy(record)
    self.class::sweep
  end
  
  def self.sweep
    cache_dir = ActionController::Base.page_cache_directory
    unless cache_dir == RAILS_ROOT+"/public"
      FileUtils.rm_r(Dir.glob(cache_dir+"/*")) rescue Errno::ENOENT
      RAILS_DEFAULT_LOGGER.info("Cache directory '#{cache_dir}' fully swept.")
    end
  end
end

Finally assign the site sweeper to all controllers and actions that may invalidate the cache.

cache_sweeper :site_sweeper, :only => [:add, :update, :destroy

We’ve also added the following script as script/sweep_cache to easily sweep the cache during development.

#!/usr/bin/ruby
require File.dirname(__FILE__) + '/../config/boot'
require File.dirname(__FILE__) + '/../config/environment'
SiteSweeper::sweep

This approach can be extended very nicely for the subdomains as account keys pattern. More on that later.

9 comments