URoR 2: KCODE

Thijs van der Vossen

URoR stands for ‘Unicode Ruby on Rails’ which is a series on using Unicode with Rails. In this second article I’ll show you how to enable the (somewhat limited) UTF-8 support in Ruby and Rails. (first article)

Let’s break a string

Suppose you’re using the truncate helper like this:

<p><%= truncate 'Iñtërnâtiônàlizætiøn', 12 %></p>

The result is something like:

Iñtërn?

Because the helper truncates the string to 12 bytes, it slices the codepoint for the ‘â’ in halve. The result is an invalid sequence which can’t be rendered.

Fix this by adding the following to the top of config/environment.rb:

# Add basic utf-8 encoding support 
$KCODE = 'UTF8'

And the result will be:

Iñtërnâti…

The string is now truncated to 12 codepoints instead of 12 bytes.

What’s happening here?

Setting KCODE to 'UTF8' tell Ruby that your source code is encoded as UTF-8. Some libraries like CGI and some parts of Rails look at KCODE to find out if they need to process strings in a UTF-8 friendly way.

You can now also require the jcode library you get some basic UTF-8 encoding support in Ruby. More about this in a future article.

Not all is good

Although it’s great that truncate has been fixed to work with UTF-8 you should be aware that this is not the case for all helpers:

<p><%= excerpt 'Iñtërnâtiônàlizætiøn', 'nâtiôn', 2 %></p>

This currently always breaks no matter if KCODE is set or not:

?rnâtiônàl…