ActiveSupport::Multibyte Updated

Manfred Stienstra

Yesterday Michael Koziarski merged the updated version of ActiveSupport::Multibyte into Rails. The initial reason for the update was Ruby 1.9 compatibility but it turned into a complete overhaul. Not just the code, but also the documentation was revised.

For most people the only noticeable change is the move from String#chars to String#mb_chars. People relying heavily on ActiveSupport::Multibyte probably want to read on.

String#chars renamed to String#mb_chars

One of the initial reasons to use a proxy to access characters back in 2006 was to make Rails future proof in case Ruby got some kind of Unicode support on String. Unfortunately Matz decided to use String#chars for one of these features so we had to change the method name. People running on Ruby <= 1.8.6 will get a nice deprecation warning.

String#mb_chars now returns a proxy on Ruby 1.8 and returns self on Ruby 1.9.

Note that the Ruby 1.9 String class does not implement methods like String#normalize. We’re still trying to figure out how to approach this limitation. For now, you might want to do:

class String
  def normalize(normalization_form=ActiveSupport::Multibyte.default_normalization_form)

No more automatic tidying of bytes

Multibyte no longer attempts to convert broken encoding in strings to a valid UTF-8. The String#tidy_bytes method still exists if you need this functionality.

Duck-typing aid

Strings are notoriously hard to duck-type because they include Enumerable, which makes them hard to differentiate from Arrays. Rails already had some duck-typing help in place for Date, Time and DateTime. We decided to implement the same thing on String and Chars.

'Bambi and Thumper'.acts_like?(:string) #=> true
'Bambi and Thumper'.mb_chars.acts_like?(:string) #=> true

So if you catch yourself using str.is_a?(String) please consider using acts_like?.

Different way of registering backends

Instead of registering a handler on the Chars class, you now set the proxy_class on ActiveSupport::Multibyte.

ActiveSupport::Multibyte.proxy_class = UTF32Chars

Note that this removes a level of indirection, which speeds up the entire Multibyte implementation quite a bit.

If you’ve implemented your own handler, please look at the implementation of ActiveSupport::Multibyte::Chars on how to convert it to work with the new implementation. In most cases this should be a trivial exercise. Don’t hesitate to contact me if you need help.

Overrideable default normalization form

The default normalization form can now be set on ActiveSupport::Multibyte instead of updating a constant.

ActiveSupport::Multibyte.default_normalization_form = :kd

See ActiveSupport::Multibyte::NORMALIZATIONS_FORMS for valid normalization forms.