Ordering and comparing text is a lot trickier than most people expect. Computer scientists even came up with a complicated name for it: “collation”.
There are two groups of problems associated with collation: cultural and technical. Today we’re not going to focus on technical problems, but rather how the cultural problems influence the technical solution.
An example of a cultural difference is letter ordering in a language, in Swedish the Ä is ordered after the z and in German it follows the letter a. You can also discuss whether the ä is an alternative form of the a or if it’s a completely different character, this is relevant when implementing search. When searching for ‘nächste’ you might also be interested in text containing ‘nachste’.
MySQL implements a number of collations solutions so you can use the one relevant for your application. For instance
utf8_icelandic_ci when you want to order UTF-8 encoded text based on the cultural norm in Iceland. You can get a full list of supported collations from the mysql client by running a
SHOW COLLATION query.
When you have an international site that might contain multiple languages you can use the default Unicode Collation Algorithm, which is called
utf8_unicode_ci in MySQL. This UCA is pretty sensible so a lot of frameworks, including Rails, use it as a default. Unfortunately this collation also changes character equality. Lets look at an example how this might go wrong.
Now we have a database with books, currently with two entries:
A sensible action might be to select a book by name:
But unfortunately this returns both books because in
utf8_unicode_ci P equals p, in the same way a equals ä equals A. This is useful for ordering and searching but not for selecting.
We can fix it by specifying a binary collation algorithm for the select so it will not use these fuzzy equality rules:
Note that this problem can introduce a lot of bugs and maybe even security problems. For instance, imagine two accounts: ‘Manfred’ and ‘manfred’. With the following query it is undetermined which of these two will be returned:
My advise is to set the default collation for your database to
utf8_bin and include the collation in queries where you want to order the entries nicely for the user interface or when you need fuzzy equality for searching..
In Rails you can specify the collation in