Quick ActiveSupport::Multibyte glossary trick
I was trying to make a glossary of words grouped by their first letter, but I wanted words starting with the letter é grouped with words starting with the letter e. No small feat you might imagine. Wrong.
dict = words.inject({}) do |dict, word|
letter = word.chars.decompose[0..0].downcase.to_s
dict[letter] ||= []
dict[letter] << word; dict
end
The reason this works is that letters like é have a decomposed form in Unicode, this form consists of a latin letter and a accent modifier. I’m not sure what happens if you run Arabic through this code, but we’ll cross that bridge when we get there.
Comments
Add your comment
In order to fight spam on this blog, posting comments from a browser without javascript is currently not supported.
Subscribe
Erik van Oosten 44 minutes later: (delete | show email)
I am not sure it wise to use the same variable in and outside the closure. Or is that on purpose? ¶
Manfred Stienstra about 1 hour later: (delete)
You are right that a variable defined in the current scope would be sucked up by the block and changed, but because inject initializes dict with the empty Hash it doesn't matter.
The dict = part is there to make sure that dict is available in the current scope whether or not it was defined. ¶
Ferdinand Svehla about 1 hour later: (delete | show email)
I think this is also easily solvable in just one line with rails' group by,
I pasted the code here: http://p.caboo.se/49042, for I dont know how your blog handles source code. ¶
Manfred Stienstra about 2 hours later: (delete)
Awesome, thanks Ferdinand. ¶
Nicolas 5 days later: (delete)
take into consideration that for example in german you should "decompose" for example "ö" into "oe"... so I guess you have many such examples in different languages. i primarily mention that cause my first thought was "hey - nice for generating pretty-urls from titles..." what would give you wrong transformation. ¶
Manfred Stienstra 5 days later: (delete)
I was talking about decomposition as described in Unicode Normalization Forms [1]. Like I hinted at with the Arabic joke, I'm not sure if this creates satisfactory results in other languages than Dutch. Problems like this are constrained by culture and preference and therefor very hard to solve in general.
[1] http://unicode.org/reports/tr15/ ¶
Thijs van der Vossen 5 days later: (delete)
Nicolas, you can safely use ö in your url. ¶
nicolas 9 days later: (delete)
@thijs:
yes i know (i can safely use ö in my url) but - other people can't (or at least they have no idea how to type it). So in terms of usability it is a bad idea to use urls that can not be easily written by everybody so thats why i simplyfy all these letters (and there are a lot of them). Try to type Iñtërnâtiônàlizætiøn ;-). ¶
Thijs van der Vossen 9 days later: (delete)
I alt-n n t alt-u e r n alt-i a t i alt-i o n alt-` a l i z alt-' t i alt-o n ¶
graste 23 days later: (delete)
Iñtërnâtiônàlizætiøn *copy&paste* :P ¶
Julik 164 days later: (delete)
I stand by the assumtpion that if someone needs to enter ø in the URL he knows how to type it. Unicode-aware transliteration is something you DO want to avoid at all costs (and something which is quite absurd in practice, because it's locale dependent). ¶