Microsoft SQL Server Database

Multi Tb migration using mydumper

4 min read

May 12, 2017 12:00:00 AM

In this post I will explain how to transfer a multi terabyte size database between two MySQL instances using mydumper, which is a logical backup and restore tool that works in parallel. I will also cover the case where you need to transfer only a subset of data. Big tables are often problematic because of the amount of time needed to do mysqldump/restore which is single threaded. It is not uncommon for this type of process to take several days. From MySQL 5.6 and on, we also have transportable tablespaces, but there are some limitations with that approach. Physical backups (e.g. xtrabackup) have the advantage to be faster, but there are some scenarios where you still need/want a logical backup (e.g migrating to RDS/Aurora/Google Cloud SQL).

Steps

Install mydumper/myloader

The first thing you'll need to do is get mydumper installed. Latest version at the time of this writing is 0.9.1 and you can get rpm here . You can also manually compile as follows (the instructions are for RHEL-based system):

wget https://launchpad.net/mydumper/0.9/0.9.1/+download/mydumper-0.9.1.tar.gz
 tar zxvf mydumper-0.6.2.tar.gz

You can also get the source from GitHub:

git clone https://github.com/maxbube/mydumper.git

Here's how to compile mydumper. This will put the files on /usr/local/bin/

sudo yum install glib2-devel mysql-devel zlib-devel pcre-devel openssl-devel cmake gcc-c++
 cmake .
 make
 make install

Export data using mydumper

By default, mydumper will work by creating several parallel processes (rule of thumb is 1 export process per core) that will read one table each and write table contents to one file for each table. In this case, I'm exporting City and State tables:

./mydumper -t 8 -B world_innodb -T City,State

This is an improvement over traditional mysqldump, but we can still do better. mydumper can split each table into chunks (e.g. 100k rows) and write each chunk to a separate file, allowing to parallelize the import later on. I have omitted the -T argument here, so all tables from world_innodb will be exported.

./mydumper -t 8 --rows=100000 -B world_innodb

If exporting Innodb tables only, it makes sense to use --trx-consistency-only so the tool uses less locking. You will still get the binlog file/pos needed to seed a slave.

./mydumper -t 8 --rows=100000 --trx-consistency-only -B world_innodb

You can also specify a regular expression to export only some databases, let's say db1 and db2.

./mydumper -t 8 --rows=100000 --trx-consistency-only --regex '^(db1|db2)' -B world_innodb

Other options include the ability to compress the exported data on the fly, and also export triggers, code and events. Finally, I also recommend the use of --verbose option for added visibility into what each thread is doing. Here is an example of the complete command:

./mydumper -t 8 \
  --rows=100000 \
  --regex '^(db1|db2)' \
  --compress \
  --triggers \
  --routines \
  --events \
  -v 3 \
  --compress \
  --trx-consistency-only \
  -B world_innodb \
  --outputdir /data/export \
  --logfile /data/mydumper.log

While running multiple threads, I noticed some contention related to adaptive hash index (this is on 5.5, I haven't tested if this happens on other versions as well). Disabling AHI can have an impact on read queries, so if you can afford having the host out of production while the export is running, I recommend to disable AHI temporarily. It is probably a good idea to run the export on a slave, as you will be hitting it hard with reads. Also if you keep the slave sql thread stopped, export will be much faster. I got up to 400% reduction in export time.

Import data using myloader

The load phase is the most painful part, usually taking way longer than the time it took to export data. If you run mydumper using the --rows option as described above, several myloader threads can insert concurrently on the same table, speeding up the process. Otherwise, you only get multiple tables imported in parallel, which is helpful but reduces the benefits if you have a handful of huge tables and mostly small tables. Keep in mind though, when using a single thread import rows can be inserted in primary key ascending order, which optimizes disk space. Running multiple insert threads on a single table will cause row distribution to be less optimal, potentially using significantly more disk space. Other potential way to reduce the import time is temporarily relax consistency by setting innodb_flush_log_at_trx_commit=0 sync_binlog=0. Also set query_cache_type=0 AND query_cache_size=0 to prevent the query cache mutex from being used. You can control myloader transaction size with queries-per-transaction parameter. Using the default value (1000) produced really big transactions, I had better results by reducing this to 100. The -o option will drop the tables on the destination database if they already exist. Here is an example of the command (rule of thumb in this case is have import_threads = cores / 2):

myloader --threads=4 -d /data/export/ -B db1 -q 100 -o -v 3

NOTE: myloader works by setting sql-log-bin=0 for the import session by default, so make sure to override that (option is -e) if you have any slaves down the chain.

Alternative myloader

There is a fork of myloader that defers creation of secondary indexes. The author suggests it is faster for tables over 150 MM rows. I haven't had the time to test it and, unfortunately, it is based on the older 0.6 myloader. Do let me know in the comments section if you have tried it.

Importing into RDS

If you are using the Multi-AZ feature of RDS, that means writes are synchronously applied to another standby RDS instance on a different availability zone. This greatly increases write latency, which is a performance killer for myloader. I advise to disable this feature until the import is complete.