Ruby 1.9 character encoding field notes

Manfred Stienstra

As you probably already know the String class became encoding aware in Ruby 1.9. This makes it possible to manipulate strings on the character level instead of on byte level. However, it’s still a general purpose API which means writing a few lines of code to get stuff done.

It’s common to choose one internal representation for character data in an application and convert all incoming strings to this representation. For example, in modern applications strings are often encoded in UTF-8 or UTF-16. I took some time to figure out how to do this in Ruby 1.9.

The biggest problem with receiving data from external sources is trust. Sources can lie about their encoding or provide broken data. Sometimes it gets mangled accidentally and sometimes someone is attacking your application with a carefully crafted payload.

Problems can arise on a lot of levels. Just think about receiving an HTTP response from a webserver. Things can go wrong in the proxy, in the client library, in the string implementation of your language. Meta-data about the encoding is stored in HTTP headers, the HTML, and now in String. The same problems exist with data coming from databases, filesystems, and caches.

You can trust some of these sources more than others. For example, you can control the data going into a database so you can trust the data coming out. In contrast, anything coming from the internet should be considered potentially dangerous.

My solution for these problems is a new method on String called ensure_encoding. It makes sure the data in the string is at least compatible with your internal strings. Depending on the options you pass it will respond differently to broken data.

As an example, let’s take an HTTP POST to a web API. Assume we’ve explained in the API documentation we only accept UTF-8 character data and will be very strict about this. Our code might look something like this:

require "ensure/encoding"
  params.each do |key, value|
      :external_encoding  => Encoding::UTF_8,
      :invalid_characters => :raise)
rescue Encoding::InvalidByteSequenceError => e
  send_response_document :unprocessable_entity,
    "Sorry, your request contains invalid encoding" +
    "and can't be processed (#{e.message})"

You can find more examples on the GitHub project page and in the source. Normally I try to extract code from a running project, but we don’t run any production code on 1.9 yet. It would be great if you can help out with testing the code. I’ve released the code as a gem, so it should be really easy to install.

$ gem install ensure-encoding

Please leave any bugs, problems, or suggestions in the GitHub issue tracker.