ActiveSupport::MultiByte
As of revision #5223 ActiveSupport::Multibyte is part of Rails. Now everyone can enjoy multibyte safeness in their applications. Needless to say we are really happy. To show some of the features of Multibyte’s String#chars proxy I’ve put together a short screencast.
If you have any questions about ActiveSupport::Multibyte, please consult the API documentation and the Trac Wiki first. Enjoy.
Comments
Add your comment
In order to fight spam on this blog, posting comments from a browser without javascript is currently not supported.
Subscribe
Dominic Mitchell about 2 hours later: (delete | show email)
Congratulations! Good work on getting it into core! ¶
phil76 about 4 hours later: (delete)
this is brilliant; thanks so much! ¶
bicunisa 1 day later: (delete)
Just what I needed ;) ¶
Jacek Becela 4 days later: (delete)
Great news! Julik and others: you rock! ¶
omur 4 days later: (delete)
If the upcase/downcase methods are locale sensitive we definetely need international versions of those as well. Some locales have surprizingly different rules for case conversion of latin latters and experience showed that it causes subtle bugs. ¶
Alex 4 days later: (delete)
Looks like we're close to having one less thing for Joel et al to complain about ;) ¶
Bertrand 4 days later: (delete)
That's a much needed fix, but... what's the justification for Ruby not properly handling Unicode again? In 2006? ¶
Manfred Stienstra 4 days later: (delete)
Omur, none of the operations are locale dependent because that would take way too much space in our unicode tables. And we're not aiming for a complete solution, we just want to keep your strings in one piece. If you want locale dependent operations, please use ICU4R.
Bertrand, that's a very long story. Maybe we will have time for this story on a very cold and dark winter night. ¶
Daniel 5 days later: (delete)
Great work!
Will this be used internally in Rails, i.e. in validations and such, or will I need custom validations to use this? ¶
Manfred Stienstra 5 days later: (delete)
Daniel, we're currently working on patches to use ActiveSupport::Multibyte in the different parts of Rails. Validations are a difficult problem.
On the one hand validations guard the model for inserting string that are too long for the database schema, on the other hand they are a tool for the developer to limit the length for certain values. Most databases think in bytes. Varchar(255) is max 255 bytes, which would be around between 64 and 255 characters in UTF-8.
So what does validates_length_of :name, :maximum => 128 mean? ¶
omur 5 days later: (delete)
Manfred Stienstra, thanks for the clarification. No I don't have a request for locale dependent operations. I just wanted to remind that if they were locale dependent there should be independent versions available in API too. Standard C library doesn't have locale insensitive case conversion functions for example and this is an important source of locale related bugs in C programs because programmers are not aware of the issues of locale dependence. The solution in case of locale sensitive functions are used is, designing the API so that the programmer sees there are two versions of case conversion functions (or one, having an extra locale parameter which defaults to international locale). ¶
Manfred Stienstra 5 days later: (delete)
By the way, Rails length validation currently already counts Unicode codepoints. In MySQL varchar(255) means 255 codepoints. ¶
Tuxie 27 days later: (delete | show email)
Hi and thank you for this, it's great!
I'm not sure if it breaks anything (I have only done basic tests), but the following code could add some compability with older code:
<code><pre>
class ActiveSupport::Multibyte::Chars
def is_a?(klass)
klass == String ? true : super
end
alias_method :kind_of?, :is_a?
def class
String
end
end
</pre></code>
To make String === "foo".chars work you can add the following:
<code><pre>
class String
def self.===(other)
other.is_a?(String) ? true : super
end
end
</pre></code> ¶
Tuxie 27 days later: (delete | show email)
Regarding the above code, at least kind_of?() should be overloaded by default, so there is at least one standard way to check for strings in AS:MB:C-aware code...
Because ActiveSupport::MultiByte::Chars actually IS kind_of a String, just not in the strict OO-hierarchy as it is implemented. ¶
Manfred Stienstra 27 days later: (delete)
I'm not sure that is a good idea. Kind_of and is_a specifically say something about the hierarchy and were never meant to say something about the purpose of the instance.
If you expect a String somewhere, you should either just treat it like a String and let the exceptions fly when it's not or you should 'cast' it to a String with .to_s. ¶
Tuxie 27 days later: (delete)
The problem is when you need to do different things depending on for example the argument sent to a method:
# Example over-simplified for readbility :)
def find(arg)
if arg.kind_of?(Integer)
find_by_id(arg)
elsif arg.kind_of?(String)
find_by_sql(arg)
else
raise ArgumentError
end
end
def find_by_sql(sql)
DB.query(sql.to_s).result
end
find_by_sql() use .to_s because it expects a String. However, find() is type-agnostic...
This kind of code is very common. I think overloading kind_of?() is correct for AS::MB::Chars because you _could_ have made it inherit from String, you just chose not to and that's an internal implementation detail... ¶
Manfred Stienstra 27 days later: (delete)
I know it's common, but that's no reason to do it (: And if you _really_ _really_ have to do it, there's no reason to also add a check for Chars.
You can easily check for numbers by creating an instance through Integer.
begin
find_by_id Integer(arg)
except ArgumentError
# do other stuff
end ¶
Tuxie 29 days later: (delete)
Oh well, personally I'm overloading kind_of?() in environment.rb anyway. :)
Another question, you talk about normalizing all incoming params in ApplicationController but is that so wise? I use (string1.chars.normalize(:kc) == string2.chars.normalize(:kc)) when comparing but if a user submit something in decomposed form I think she want it to be displayed in decomposed form also. Especially doing :kc/:kd normalization of input seems destructive.
I'm not a unicode expert but doesn't this violate the "Never break other people's text" rule? :) ¶
Manfred Stienstra 29 days later: (delete)
Some characters have both a composed and a decomposed form, but they're the same characters. Users generally don't choose in what form the text is save/transmitted because it doesn't matter.
Compatability composition is destructive, but there are some characters, like ligatures, that don't really have a place in web applications. NFKC gets rid of the pesky ligatures, unfortunately it also decomposes characters like the 'half character'.
Normalizing all incoming parameters can be a solution depending on your problem, but it can also bring more problems. So depending on the application you either want it, or you don't want it. There is no absolute answer. ¶
Tuxie 29 days later: (delete)
ok, I guess it's dependant on the application then, seems reasonable.
I guess the biggest problem is when people write their names in kanji, people can be really upset when their name-kanji get transformed to a more common variant.. But then again, Unicode doesn't have very good support for kanji family names anyway because they often use ancient glyphs that haven't been used for like a 100 years except for names. And the Mojikyo charset is kind of out of the scope for this :)
Sorry for using this board as a support forum (is there an official one?) but I have one more question:
Does AS:MB:C have a CJK-aware way to split a string into words, that also work with allglyphswrittentogether languages like chinese? ¶
Tuxie 30 days later: (delete)
"foo".respond_to?(:strip) # => true
"foo".chars.respond_to?(:strip) # => false
If you don't want to overload #kind_of?, how about overloading #respond_to? so we can at least do duck typing?
def respond_to?(method)
super or ''.respond_to?(method)
end
Consider this ActiveRecord example:
def before_validation
self.attributes.each { |a| a.strip! if a.respond_to?(:strip!) }
end ¶
Tuxie 30 days later: (delete)
Make that self.attributes.values.each :) ¶
Manfred Stienstra 35 days later: (delete)
Tuxie, please redirect the discussion to the Rails Core mailing list. Patches are always welcome. ¶
omur 49 days later: (delete)
Congratulations! Rails 1.2 RC1 is out with Multibyte support. ¶
murphy 50 days later: (delete)
thanks for the cast! ¶
Thiago Taranto 50 days later: (delete)
That´s very nice screencast presentation.
Thanks a lot. ¶
Walter Davis 99 days later: (delete | show email)
Just curious, how was this screencast done? It doesn't look like any other I have seen.
Thanks, and thanks for the content, too! ¶
Manfred Stienstra 99 days later: (delete)
Walter, that's explained here: http://www.fngtps.com/2006/10/screencast-scripting. ¶
Simon de Haan 106 days later: (delete | show email)
Hey fngtps & julik en anderen, gefeliciteerd!
heel erg cool.
groeten uit Arnhem ¶
shy 133 days later: (delete)
str[regexp, fixnum] doesn't seem to work ¶
Manfred Stienstra 136 days later: (delete)
Shy, please redirect any bugs and patches to http://dev.rubyonrails.org ¶
Garry 212 days later: (delete | show email)
Thank you Manfred! ¶