BTRFS performance compared to LVM+EXT4 with regards to database workloads

Tags:
Aws,
Technical Track
Introduction
In many database builds, backups pose a very large problem. Most backup systems require an exclusive table lock and don’t have any support for incremental backups; they require a full backup every time. When database sizes grow to several terabytes, this becomes a huge problem. The normal solution to this is to rely on snapshots. In the cloud this is quite easy, since the cloud platform can take snapshots while still guaranteeing a certain level of performance. In the datacenter, few good solutions exist. One method frequently used is utilizing LVM on Linux to perform the snapshot at the block device layer.LVM snapshots
LVM is a Linux technology that allows for advanced block device manipulation including splitting block devices into many smaller ones, and combining smaller block devices into larger ones through either concatenation or striping methods, which include redundant striping commonly referred to as RAID. In addition to this, it also supports a copy on write (CoW) feature that allows for snapshots. The method used to implement this is to allocate a section of the underlying physical volumes that the original data is copied to before updating the main logical volume.BTRFS
According to the Btrfs kernel wiki: “Btrfs is a modern copy-on-write (CoW) filesystem for Linux aimed at implementing advanced features while also focusing on fault tolerance, repair and easy administration.” It is an inherently CoW filesystem, which means it supports snapshotting at the filesystem level in addition to many more advanced features.Experiment 1: a simple benchmark
The hypothesis
Since both LVM with snapshotting and Btrfs are CoW, it would stand to reason that the solution providing the features at a higher layer will be more performant and provide more flexibility compared to one at a lower layer that has less information to work with for optimization. Because of this, Btrfs should perform better, or at least similarly, and provide more flexibility and simplify management.The experiment
The experiment consisted of a custom-written script that would allocate a large block of data, pause to allow for a snapshot to be taken, then randomly update sections of the large block of data. A custom script was chosen because there are few benchmarks that allow for one to pause between initialization and testing stages. LVM had an EXT4 filesystem on top of it created using the following flags: -E lazy_itable_init=0,lazy_journal_init=0. Btrfs was created using the default options. The script is produced below:import multiprocessing import datetime import random EXTENT_SIZE = 4000 EXTENTS = 100000000000 / EXTENT_SIZE THREADS = 8 FRAGMENT_EXTENTS = 250000 def thread_setup(file): global urandom global output urandom = open('/dev/urandom', 'rb') output = open(file, 'w+b') def fill_random(args): output.seek(args['start'] * EXTENT_SIZE) for i in range(args['size']): output.write(urandom.read(EXTENT_SIZE)) output.flush() def fill_random_list(extents): for extent in extents: output.seek(extent * EXTENT_SIZE) output.write(urandom.read(EXTENT_SIZE)) output.flush() if __name__ == '__main__': p = multiprocessing.Pool(THREADS, thread_setup('test')) args = [] for i in range(THREADS): args.append({'start': int((EXTENTS/THREADS)*i), 'size': int(EXTENTS/THREADS)}) start = datetime.datetime.now() # Fill a test file p.map(fill_random, args, chunksize=1) end = datetime.datetime.now() print(end - start) print("File made, please make a snapshot now.") input("Press enter when snapshot made.") # Randomly fragment X pages extents = list(range(EXTENTS)) random.shuffle(extents) extents = extents[:FRAGMENT_EXTENTS] start = datetime.datetime.now() p.map(fill_random_list, extents) end = datetime.datetime.now() print(end - start) # Finally, a big linear seek start = datetime.datetime.now() with open('test', 'rb') as f: for i in range(EXTENTS): f.read(EXTENT_SIZE) end = datetime.datetime.now() |
The results
The results are tabulated below:Value | LVM | BTRFS | Ratio |
Initial Creation Time | 0:22:09.089155 | 0:28:43.236595 | 0.7712749130655504 |
Time to Randomly Update | 0:03:22.869733 | 0:01:55.728375 | 1.7529817816935562 |
Linear Read After Update | 0:16:46.113980 | 0:04:54.382375 | 3.4177113354697273 |
Fragmentation before Update | 69 extents | 100 extents | 0.69 |
Fragmentation after Update | 70576 extents | 63848 extents found | 1.1053752662573613 |
Experiment 2: a real world benchmark
With the success of the previous experiment, a more real world benchmark was warranted.The hypothesis
The hypothesis is that the previous findings would be maintained with the more mainstream benchmarking tool.The experiment
I chose blogbench as the test platform since it provides a good mix of both linear and random writes and reads. I targeted 10GB of space being used, which equated to 136 iterations. Blogbench 1.1 was used for the benchmark. The following script was utilized to automate the testing process:#!/bin/sh iterations=30 # Do BTRFS mkfs.btrfs /dev/sdb mount /dev/sdb /mnt cd /root/blogbench-1.1/src ./blogbench -d /mnt -i $iterations | tee ~/btrfs.blogbench.initial btrfs subvolume snapshot /mnt/ /mnt/snapshot ./blogbench -d /mnt -i $iterations | tee ~/btrfs.blogbench.snapshot umount /mnt wipefs -a /dev/sdb # Do LVM pvcreate /dev/sdb vgcreate vg0 /dev/sdb lvcreate -l 75%FREE -n lv0 vg0 mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vg0/lv0 mount /dev/vg0/lv0 /mnt cd /root/blogbench-1.1/src ./blogbench -d /mnt -i $iterations | tee ~/lvm.blogbench.initial lvcreate -l +100%FREE --snapshot -n lv0snap vg0/lv0 ./blogbench -d /mnt -i $iterations | tee ~/lvm.blogbench.snapshot umount /mnt lvremove -f /dev/vg0/lv0snap lvremove -f /dev/vg0/lv0 vgremove /dev/vg0 wipefs -a /dev/sdb |
The results
The results are tabulated below:Value | LVM | BTRFS | Ratio |
Initial Read Score | 167695 | 346567 | 0.4838746908967097 |
Initial Write Score | 1155 | 1436 | 0.8043175487465181 |
Post-snapshot Read Score | 88398 | 233204 | 0.37905867823879524 |
Post-snapshot Write Score | 848 | 964 | 0.8796680497925311 |
[root@btrfs-test ~]# cat lvm.blogbench.snapshot Frequency = 10 secs Scratch dir = [/mnt] Spawning 3 writers... Spawning 1 rewriters... Spawning 5 commenters... Spawning 100 readers... Benchmarking for 30 iterations. The test will run during 5 minutes. Nb blogs R articles W articles R pictures W pictures R comments W comments 351 255030 17729 222611 19185 174246 354 519 38783 8539 32165 8203 20519 0 521 91712 195 75868 225 52156 486 524 265205 44 219897 61 147229 0 524 312 0 257 0 264 0 524 0 0 0 0 0 0 524 0 0 0 0 0 0 524 0 0 0 0 0 0 524 0 0 0 0 0 0 524 0 1 0 0 0 0 524 0 0 0 0 0 0 524 0 49 0 44 0 61 542 204263 869 170643 1062 113274 2803 576 263147 1805 218163 1715 142694 1409 601 223393 1474 186252 1326 120374 0 630 229142 1252 191061 1876 122406 0 658 230185 1437 191368 1241 117970 0 693 294852 2044 240333 1635 144919 488 737 330354 2093 272406 2153 174214 805 778 379635 1635 313989 1963 184188 0 812 302766 1697 248385 1608 151070 0 814 385820 97 316903 143 184704 0 814 275654 0 228639 0 132450 0 814 412152 0 340600 0 195353 0 814 276715 0 227402 0 131327 0 842 230882 1243 191560 1226 113133 1314 848 274873 209 226790 296 126418 257 848 355217 0 291825 0 168253 0 848 237893 0 196491 0 110130 0 848 396703 0 323357 0 179002 0 Final score for writes: 848 Final score for reads : 88398 |