Ordering and comparing text in Rails and MySQL

Manfred Stienstra

Ordering and comparing text is a lot trickier than most people expect. Computer scientists even came up with a complicated name for it: “collation”.

There are two groups of problems associated with collation: cultural and technical. Today we’re not going to focus on technical problems, but rather how the cultural problems influence the technical solution.

An example of a cultural difference is letter ordering in a language, in Swedish the Ä is ordered after the z and in German it follows the letter a. You can also discuss whether the ä is an alternative form of the a or if it’s a completely different character, this is relevant when implementing search. When searching for ‘nächste’ you might also be interested in text containing ‘nachste’.

MySQL implements a number of collations solutions so you can use the one relevant for your application. For instance utf8_icelandic_ci when you want to order UTF-8 encoded text based on the cultural norm in Iceland. You can get a full list of supported collations from the mysql client by running a SHOW COLLATION query.

When you have an international site that might contain multiple languages you can use the default Unicode Collation Algorithm, which is called utf8_unicode_ci in MySQL. This UCA is pretty sensible so a lot of frameworks, including Rails, use it as a default. Unfortunately this collation also changes character equality. Lets look at an example how this might go wrong.

mysql> CREATE DATABASE books_example CHARACTER SET utf8
  COLLATE utf8_unicode_ci;
mysql> USE books_example;
mysql> CREATE TABLE books ( title VARCHAR(255) );
mysql> SHOW FIELDS FROM books;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| title | varchar(255) | YES  |     | NULL    |       | 
+-------+--------------+------+-----+---------+-------+
1 row in set (0.00 sec)
mysql> INSERT INTO books SET title = 'Pokemon';
mysql> INSERT INTO books SET title = 'pokemon';

Now we have a database with books, currently with two entries:

mysql> SELECT * FROM books;
+---------+
| title   |
+---------+
| Pokemon | 
| pokemon | 
+---------+
2 rows in set (0.00 sec)

A sensible action might be to select a book by name:

mysql> SELECT * FROM books WHERE title = 'pokemon';
+---------+
| title   |
+---------+
| Pokemon | 
| pokemon | 
+---------+
2 rows in set (0.00 sec)

But unfortunately this returns both books because in utf8_unicode_ci P equals p, in the same way a equals ä equals A. This is useful for ordering and searching but not for selecting.

We can fix it by specifying a binary collation algorithm for the select so it will not use these fuzzy equality rules:

mysql> SELECT * FROM books WHERE title = 'pokemon' COLLATE utf8_bin;
+---------+
| title   |
+---------+
| pokemon | 
+---------+
1 row in set (0.00 sec)

Note that this problem can introduce a lot of bugs and maybe even security problems. For instance, imagine two accounts: ‘Manfred’ and ‘manfred’. With the following query it is undetermined which of these two will be returned:

SELECT * FROM accounts WHERE username = 'manfred' LIMIT 1

My advise is to set the default collation for your database to utf8_bin and include the collation in queries where you want to order the entries nicely for the user interface or when you need fuzzy equality for searching..

development:
  database: books_development
  adapter: mysql
  encoding: utf8
  collation: utf8_bin

In Rails you can specify the collation in database.yml.