Terrible choices: MySQL

Sun 28 December 2014

I've used MySQL for a while now, and there were lots of surprising things I needed to cater for. This is from a Django and MySQL 5.5 perspective [*]. Later on you'll see the horrible things I did to work around. It was a terrible experience, as I've also used PostgreSQL ...

Feel free to add your own experiences in the comments section.

The defaults *

MySQL supports a large part of the ANSI SQL 99 standard, however the default settings are nowhere close to that.

If you used any other database then it's going to be a very perplexing experience.

SQL mode *

With the default settings MySQL truncates and does other unspeakable things to data for the sake of not giving errors. Where is this a correct choice, hard to say.

What the defaults allow:

  • Storing invalid dates like '0000-00-00' or '2010-01-00' [1].
  • Silently treating errors like:
    • Specifying an unavailable storage engine. It will use the default engine, silently [2].
    • Inserting invalid data. Larger strings get truncated to the maximum length. Larger integers get truncated to the maximum. Other things get converted to NULL if the column allows that. All silently [3].

And no, ORMs won't save you from this pain by default. For Django you need to have something like this in the settings:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'whatever',
        'OPTIONS': {
            'sql_mode': 'TRADITIONAL',
        }  # Note that later we find out that this is not enough. Read on.
    }
}

There's an open Django ticket to set this the default.

And yes, you can set this in the MySQL settings but you don't really want that. Your app will misbehave and corrupt data if you ever happen to forget to change the /etc/mysql/my.cnf before deployment. And if you ever forget to do that, your data is already going to be corrupted by the time you notice something is not quite right.

Collations and encodings *

So far I've used utf8 as the connection charset. Cause you never know what the default one is (probably latin1). However, note that the charset affects the maximum index size for VARCHAR columns. Ever wonder why you often see this:

myfield = models.CharField(max_length=255, db_index=True)

Yes, the index on utf8 columns is limited to 255 characters [4]. If you need to store emoticons like an angry face 😠- and most probably you'd want to, given the predicament - then you're out of luck, you need a different charset, utf8mb4 [5]. But that means smaller index, only 191 characters [11].

But that's not that bad, you should get some errors if you fail to set the correct encoding. Collations however, are more trippy. A collation is a set of rules for comparing characters in a character set [6].

Notice a pattern here? Note the _ci suffix:

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
| utf8mb4  | UTF-8 Unicode               | utf8mb4_general_ci  |      4 |
| utf32    | UTF-32 Unicode              | utf32_general_ci    |      4 |
+----------+-----------------------------+---------------------+--------+

All the default collations are case insensitive. This means your queries and indexes are also going to be case insensitive [6]. I'd say they are just insensitive. To your pain.

If you have a case insensitive collation all sorts of queries that aren't a text search will behave strangely [7]. You'll notice that get_or_create doesn't work as expected when you have accents or a different case.

The solution is to use utf_bin collation and specify a different one only when you need special accent handling and case folding. The settings now look like this:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'mydatabase',
        'OPTIONS': {
            'sql_mode': 'TRADITIONAL',
            'charset': 'utf8',
            'init_command': 'SET '
                'storage_engine=INNODB,'
                'character_set_connection=utf8,'
                'collation_connection=utf8_bin'
        }  # Note that later we find out that this is still not enough. Read on.
    }
}

Unfortunately I didn't know this from the start. So I had to fix the existing data and installation scripts. When you create a database, make sure that you specify the encoding and collation:

CREATE DATABASE mydatabase CHARACTER SET utf8 COLLATE utf8_bin;

To fix the existing data, a south migration would look like:

from south.db import db
from south.v2 import DataMigration

class Migration(DataMigration):
    def forwards(self, orm):
        print('')
        print('    Altering database ...')
        db.execute("ALTER DATABASE CHARACTER SET utf8 COLLATE utf8_bin;")
        for table, in db.execute('SHOW TABLES'):
            print('    Altering table %s ...' % table)
            db.execute(
                "ALTER TABLE %s CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin" % table
            )

    def backwards(self, orm):
        # Altering the tables takes lots of time and
        # locks the tables, since it copies all the data.
        raise RuntimeError(
            "This migration probably took 2 hours, you don't really want to rollback ..."
        )

With the new migrations in Django 1.7:

from django.db import migrations

def forwards(apps, schema_editor):
    with schema_editor.connection.cursor() as cursor:
        print('')
        print('    Altering database ...')
        cursor.execute("ALTER DATABASE CHARACTER SET utf8 COLLATE utf8_bin;")
        cursor.execute("SHOW TABLES;")
        for table, in cursor.fetchall():
            print('    Altering table %s ...' % table)
            cursor.execute(
                "ALTER TABLE %s CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin" % table
            )

def backwards(self, orm):
    # Altering the tables takes lots of time and
    # locks the tables, since it copies all the data.
    raise RuntimeError(
        "This migration probably took 2 hours, you don't really want to rollback ..."
    )

class Migration(migrations.Migration):
    dependencies = [
        # Needs to be filled, to figure it out run:
        #   django-admin makemigrations myapp --empty
    ]
    operations = [migrations.RunPython(forwards, backwards)]

Note that this will take long time to run (MySQL seems to copy all the tables for some reason) and it will change the encoding and collation for all the columns [8].

The transaction isolation level *

Unfortunately getting the collation and encodings right won't make get_or_create work flawlessly [9]. The default transaction level for MySQL is REPEATABLE READ, that means reads are consistent in the same transaction - they will return the same result, even if outside the transaction the data has changed. In other words, REPEATABLE READ will break get_or_create in a transaction - it's possible that code in a transaction won't "see" the object created outside the transaction (like in another process).

Unfortunately the defaults are hard to change without breaking existing apps [10], so you have to work this out in your connection settings. Now the settings are:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'mydatabase',
        'OPTIONS': {
            'sql_mode': 'TRADITIONAL',
            'charset': 'utf8',
            'init_command': 'SET '
                'storage_engine=INNODB,'
                'character_set_connection=utf8,'
                'collation_connection=utf8_bin,'
                'SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED',
        }  # Now we have a mild degree of confidence :-)
    }
}

DDL statements *

DDL statements in MySQL, not only they are slow and lock tables (that means downtime), they will ignore transactions. The almighty InnoDB can't save your sanity when an ALTER is used in a transaction. Thus, migrations in MySQL must be approached with great care (test them well) and to avoid downtime you need to use specific external tools.

Annoying, even for development:

! Error found during real run of migration! Aborting.

! Since you have a database that does not support running
! schema-altering statements in transactions, we have had
! to leave it in an interim state between migrations.

! The South developers regret this has happened, and would
! like to gently persuade you to consider a slightly
! easier-to-deal-with DBMS.

If this is the same with the migration system in 1.7 (seems it it, but you only get a plain traceback) then it's ideal to have migrations small in scope if you have custom sql that can easily fail, and squash them later, after you have successfully ran them.

The inflexibility *

There's something wrong with the query optimizer, it tends to use temporary tables for most queries that have a handful of joins and a GROUP BY. This quite common with Django but the why and how is a too large topic to tackle here. Maybe later ...

For now, a story of grief.

I have a model like this:

class Item(models.Model):

    class Meta:
        unique_together = "type", "name", "parent"

    type = models.CharField(max_length=20, db_index=True)
    name = models.CharField(max_length=200, db_index=True)
    parent = models.ForeignKey(
        "self",
        null=True,  # Notice anything peculiar?
        blank=True
    )

For reasons unknown I went ahead with this model. Why was it designed like that and why it remained like that needs not be questioned. But what is wrong with it?

You see, creating unique index on NULL columns is a bad idea as NULL values aren't included in the index. The database will disallow inserting "foobar" two times but will allow inserting a boundless number of NULL values. For example, this would be allowed:

>>> Item.objects.create(name="stuff", type="foobar").pk
1L
>>> Item.objects.create(name="stuff", type="foobar").pk
2L

Given unfavorable conditions (parallelism), get_or_create would create duplicate objects - because the database lets it.

This is something normal in SQL and ideally you'd redesign the model and MySQL is not to be blamed, right? But, alas, no. You see, I was using an ORM and I really liked the convenience of the ForeignKey field. Pity me for my weakness for I was like a thirsty fool in the desert.

The options I didn't have:

  • Keep the constraint but replace the parent in the index with a computed column. But MySQL doesn't have computed columns!
  • Use a conditional index. Nope, MySQL doesn't allow conditional indexes.
  • Create a view. Make the index on the view. No, MySQL doesn't allow creating indexes on views.

To add insult to the injury, PostgreSQL had all these.

Creating some workaround on the client side was no option. The application was highly parallelism - this had to be handled in the database. But I fought on and found a solution: :uppercase:TRIGGERS.

In my mad quest to solve this quickly I brought this into existence:

CREATE TRIGGER check_unique_on_item
BEFORE INSERT ON myapp_item
FOR EACH ROW BEGIN
    IF NEW.parent_id IS NULL
    THEN
        IF (
            SELECT COUNT(*)
            FROM myapp_item item
            WHERE item.parent_id IS NULL AND
                  item.name = NEW.name AND
                  item.type = NEW.type
        ) > 0
        THEN
            SET NEW = 'Error: Cannot insert this item. There is already an existing entry.';
        END IF;
    END IF;
END;

Because I was doing something invalid, just to stop the insert, get_or_create had to deal with a new type of error: DatabaseError. Thus the Item model needed a custom manager:

from django.db.utils import DatabaseError

class ItemManager(models.Manager):
    def get_or_create(self, **lookups):
        try:
            return super(ItemManager, self).get_or_create(**lookups)
        except DatabaseError as e:
            try:
                return self.get(**lookups), False
            except self.model.DoesNotExist:
                raise e

Now if you look closely at the trigger you'll notice that it's not really a real constraint, thus, not atomic. I have realized that it doesn't really work during load-tests ...

So I went back to the drawing board and came up with a new idea: create a shadow table that has all the constraints, and triggers to update that table before the real one gets changed.

So I removed the unique_together from the Item model and created this:

class ItemUniqueFixup(models.Model):

    class Meta:
        unique_together = "type", "name", "parent"

    type = models.CharField(max_length=20)
    name = models.CharField(max_length=200)
    parent = models.PositiveIntegerField()

And created the triggers and filled the shadow table:

DROP TRIGGER IF EXISTS check_unique_on_item;

TRUNCATE TABLE myapp_itemuniquefixup;

INSERT INTO myapp_itemuniquefixup (id, type, name, parent)
SELECT old.id AS id, old.type AS type, old.name AS name,
       IF(old.parent_id IS NULL, 0, old.parent_id) AS parent
FROM myapp_item old;

CREATE TRIGGER check_unique_on_item_insert
BEFORE INSERT ON myapp_item
FOR EACH ROW BEGIN
    INSERT INTO myapp_itemuniquefixup (id, type, name, parent)
    VALUES (
        NEW.id, NEW.type, NEW.name, IF(NEW.parent_id IS NULL, 0, NEW.parent_id)
    );
END;

CREATE TRIGGER check_unique_on_item_update
BEFORE UPDATE ON myapp_item
FOR EACH ROW BEGIN
    UPDATE myapp_itemuniquefixup fix
    SET fix.type = NEW.type,
        fix.name = NEW.name,
        fix.parent = IF(NEW.parent_id IS NULL, 0, NEW.parent_id)
    WHERE NEW.id = fix.id;
END;

CREATE TRIGGER check_unique_on_item_delete
BEFORE DELETE ON myapp_item
FOR EACH ROW BEGIN
    DELETE FROM myapp_itemuniquefixup WHERE id = OLD.id;
END;

Now you see, this was very tricky to get right. With conditional indexes in PostgreSQL this would had been as easy as:

CREATE UNIQUE INDEX myapp_item_noparent ON myapp_item(type, name)
       WHERE parent IS NULL;
CREATE UNIQUE INDEX myapp_item_withparent ON myapp_item(type, name, parent)
       WHERE parent IS NOT NULL;

In hindsight ... *

Had I knew all this from the start, maybe the ride would had been in easy mode. But the more you go on the more you can't stop thinking "what if I had used PostgreSQL". You can't shake off the dirty sensation of "I have made a terrible choice" if you have to explain these awful quirks of MySQL to every new programmer. In a way this is technical debt - you have to pay the cognitive burden to correctly use MySQL.

[*]If these are problems of the past, I don't care. It has been a terrible journey and it needs to be told.
[1]See: http://dev.mysql.com/doc/refman/5.5/en/sql-mode.html#sqlmode_no_zero_date and http://dev.mysql.com/doc/refman/5.5/en/sql-mode.html#sqlmode_no_zero_in_date
[2]See: http://dev.mysql.com/doc/refman/5.5/en/sql-mode.html#sqlmode_no_engine_substitution
[3]See: http://dev.mysql.com/doc/refman/5.5/en/sql-mode.html#sqlmode_strict_all_tables
[4]See: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html
[5]See: https://code.djangoproject.com/ticket/18392
[6](1, 2) See: http://dev.mysql.com/doc/refman/5.5/en/charset-general.html
[7]

More details: http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/

Also, worth checking out:

[8]See: http://dev.mysql.com/doc/refman/5.5/en/alter-table.html
[9]See: http://jmoiron.net/blog/innodb-transaction-isolation/
[10]See: #13906 #14026
[11]See: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html

This entry was tagged as django mysql postgresql python rant