Ordering and comparing text in Rails and MySQL
Ordering and comparing text is a lot trickier than a lot of people expect, computer scientists even came up with a complicated name for it: collation. There are two groups of problems associated with collation: cultural and technical. Today we’re not going to focus on technical problems, but rather how the cultural problems influence the technical solution.
An example of a cultural difference is letter ordering in a language, in Swedish the Ä is ordered after the z and in German it follows the letter a. You can also discuss whether the ä is an alternative form of the a or if it’s a completely different character, this is relevant when implementing search. When searching for ‘nächste’ you might also be interested in text containing ‘nachste’.
MySQL implements a number of collations solutions so you can use the one relevant for your application. For instance utf8_icelandic_ci when you want to order UTF-8 encoded text based on the cultural norm in Iceland. You can get a full list of supported collations from the mysql client.
mysql> SHOW COLLATION;
When you have an international site that might contain multiple languages you can use the default Unicode collation algorithm, which is called utf8_unicode_ci in MySQL. The UCA is pretty sensible so a lot of frameworks, including Rails, use it as a default. Unfortunately this collation also changes character equality. Lets look at an example how this might go wrong.
mysql> CREATE DATABASE books_example CHARACTER SET utf8
COLLATE utf8_unicode_ci;
mysql> USE books_example;
mysql> CREATE TABLE books ( title VARCHAR(255) );
mysql> SHOW FIELDS FROM books;
+-------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| title | varchar(255) | YES | | NULL | |
+-------+--------------+------+-----+---------+-------+
1 row in set (0.00 sec)
mysql> INSERT INTO books SET title = 'Pokemon';
mysql> INSERT INTO books SET title = 'pokemon';
Now we have a database with books, currently with two entries.
mysql> SELECT * FROM books;
+---------+
| title |
+---------+
| Pokemon |
| pokemon |
+---------+
2 rows in set (0.00 sec)
A sensible action might be to select a book by name.
mysql> SELECT * FROM books WHERE title = 'pokemon';
+---------+
| title |
+---------+
| Pokemon |
| pokemon |
+---------+
2 rows in set (0.00 sec)
But unfortunately this returns both books because in utf8_unicode_ci P equals p, in the same way a equals ä equals A. This is useful for ordering and searching but not for selecting.
We can fix it by specifying a binary collation algorithm for the select so it will not use these fuzzy equality rules.
mysql> SELECT * FROM books WHERE title = 'pokemon' COLLATE utf8_bin;
+---------+
| title |
+---------+
| pokemon |
+---------+
1 row in set (0.00 sec)
Note that this problem can introduce a lot of bugs and maybe even security problems. For instance, imagine two accounts: ‘Manfred’ and ‘manfred’. With the following query it is undetermined which of these two will be returned.
SELECT * FROM accounts WHERE username = 'manfred' LIMIT 1
My advise is to set the default collation for your database to utf8_bin and include the collation in queries where you want to order the entries nicely for the user interface or when you need fuzzy equality for searching. In Rails you can specify the collation in config/database.yml.
development:
database: books_development
adapter: mysql
encoding: utf8
collation: utf8_bin
Subscribe