The system in question
Our B2B product consists of a rails application where we use Active Record to handle our MySQL database. Our table relationships are connected using foreign keys in order to ensure data consistency.
The Problem
We needed to add a new column into a table that had almost 10 million records. A simple migration would cause the production database to lock and enqueue incoming requests from users, leading to timeouts. The only way to avoid that was to use an online migration solution.
Rejected Solutions
The following solutions were rejected either due to compatibility or usability reasons.
SoundCloud’s LHM
We tried using SoundCloud’s LHM, which another one of our services uses, but we discovered that it doesn’t support copying tables with foreign keys and it had stopped being maintained for many years.
GitHub’s gh-ost
Even though it is being actively maintained, it doesn’t support copying tables with foreign keys and it’s not easy to use either.
Facebook’s Solution
Not easy to use since it doesn’t have any gems connecting it to the rails app.
The Solution
pt-online-schema-change & departure gem
Percona’s pt-online-schema-change is a command-line tool that does the actual online migration. The departure gem connects pt-online-schema-change to the rails app.
How does it work?
When pt-online-schema-change is called from the departure gem, it creates a copy table and starts copying all the records from the original table to the copy table.
But what about the new records coming from production? How will they be copied to the new table?
The answer is Triggers. pt-online-schema-change creates interceptors called Triggers that intercept write requests to the server, ensuring the changes are reflected in the copy table as well.
Finally, when all the records have been successfully copied, the copy table replaces the original table.
Limitations
pt-online-schema-change only works for Debian and RedHat Linux distributions and only supports MySQL Databases.
AWS Challenges
Amazon RDS needs to be configured with an environment variable before running the migrations. The variable is log_bin_trust_function_creators and it needs to be set to 1 temporarily until the migration finishes. Leaving this option enabled is considered unsafe.
The reason why it should be disabled by default is to prevent creating a nondeterministic stored procedure marked as DETERMINISTIC. Otherwise, the bin log would become invalid and the replication to the failover instance might contain wrong data. We would not even notice this and if the snapshots are created from the failover instance, we might not even have proper backups.
Operating System Limitations
In order to run our application on production, we use an Alpine docker image mainly because of the much smaller size of the image and significantly fewer vulnerabilities.
Because pt-online-schema-change can only be used in Debian and RedHat builds, we had to create a development branch that used a Debian build in order to allow the tools to be installed and work smoothly.
Since we didn’t want the Debian build to become our production build, we decided to run our scripts from the development image directly to our production image by using our command runner tool. This tool took the docker image of the development branch and ran it on our production image using AWS Tasks without the production image being replaced.
You can think of it as an injection: we built the project within an image and ran the actual migration scripts in another image.
How do we migrate back if the code is in a development branch?
We have kept our development branch in case we need to rollback at some point, though it seems highly unlikely.
Isn’t it unsafe to delete the original table and keep the copy one?
The solution is well-tested by the creators and our team as well; we wouldn’t run something like this without thoroughly testing it first.
It is always recommended to backup your database before running things like this!
Why not InnoDB Storage Instant ADD COLUMN?
While we could have used it, we discovered it after implementing and testing pt-online-schema-change & departure gem. Another reason is that Instant ADD COLUMN has multiple constraints and it’s only used to add a new column to the table. We were searching for an all-around solution for any kind of database change not just column additions.