InnoDB Transactional Characteristics
Jul 1, 2008 / By Keith Murphy
InnoDB is a storage engine that uses MVCC (described shortly) to provide ACID-compliant transactional data storage using row-level locking. MVCC stands for Multi-Version Concurrency Control. It is how InnoDB allows multiple transactions to look at a data set of one or more tables and have a consistent view of the data. MVCC keeps a virtual snapshot of the dataset for each transaction. An example will make this clear.
Let’s assume you have two transactions (and only two transactions) running on a system. If transaction A starts at 10:45:56 and ends at 10:45:89, it gets a consistent view of the dataset during the time that the transaction runs. If transaction B starts at 10:45:65, it would see exactly the same view of the dataset that transaction A saw when it began the transaction. If transaction B started at 10:45:95, it would see the modified dataset after transaction A made modifications. During the duration of each transaction, the dataset that each sees does not change, except for the modifications the transaction itself makes. Consider that a typical production database server is running hundreds of queries a second, and you realize that the job of MVCC/the InnoDB storage engine gets very complicated maintaining all these views of the data.
MySQL server storage engines use three different locking options: table-level locking, page-level locking, and row-level locking. With table-level locking, if a query accesses the table it will lock the entire table and not allow access to the table from other queries. The benefit of this is that it entirely eliminates deadlocking issues. The disadvantage is that, as mentioned, no other queries have access to the table while it is locked. If you had a table with 16,000,000 rows and needed to modify one row, the entire table is inaccessible by other queries. The MyISAM and memory storage engine use table-level locking.
Page-level locking is locking of a group of rows instead of a the entire table. The number of rows actually locked will vary based on a number of factors. Going back to our example of a 16,000,000-row table, lets assume a page-level lock is used. If a page consists of 1,000 rows (this would vary depending on the size of the rows and the actual amount of memory allocated to a page), a lock would lock only a thousand rows. Any of the other 15,999,000 rows could be used by other queries without interference. The BDB storage engine uses page-level locking.
Row-level locking, as the name suggests, acquires a lock on as small an amount as a single row from a table. This will block the minimal amount of table content and allows for the most concurrency on a table without problems. InnoDB and Falcon both use row-level locking.
While the InnoDB configuration options are not strictly related to the transactional characteristics (other than
innodb_flush_log_at_trx_commit), I thought it would be useful to have them here for reference. These are the most common or most important configuration parameters:
innodb_buffer_pool_size: the size in bytes of the memory buffer InnoDB uses to cache data and indexes of its tables. A larger memory buffer means less I/O is needed to access data in tables. On a dedicated database server using primarily InnoDB tables, this should be set to 80 percent or more of the size of the system’s RAM. Be careful with this as setting it too large will cause competition for physical memory might cause paging in the operating system.
innodb_data_file_path: the paths to individual data files and their sizes. The full directory path to each data file is formed by concatenating
innodb_data_home_dirto each path specified here. The file sizes are specified in MB or GB (1024MB) by appending
Gto the size value. The sum of the sizes of the files must be at least 10MB. If you do not specify
innodb_data_file_path, the default behavior is to create a single 10MB auto-extending data file named
ibdata1. The size limit of individual files is determined by your operating system. You can set the file size to more than 4GB on those operating systems that support big files. You can also use raw disk partitions as data files.
innodb_data_home_dir: the common part of the directory path for all
InnoDBdata files. If you do not set this value, the default is the MySQL data directory. You can specify the value as an empty string, in which case you can use absolute file paths in
innodb_file_per_table: If this variable is enabled, InnoDB will create each new table using its own
.ibdfile for storing data and indexes, rather than in the shared tablespace. The default is to create tables in the shared tablespace.
innodb_flush_log_at_trx_commit: When this parameter is set to
0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit.When this value is
1(the default), the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file.
When set to
2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is
2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process-scheduling issues.
The default value of this variable is
1. This is required for ACID-compliance. You can, however, achieve better performance by setting the value other than
1. If you do, there is the possibility that you can lose at most one second worth of transactions in a crash. If you set the value to
0, then any
mysqldprocess crash can erase the last second of transactions. If you set the value to
2, then only an operating system crash or a power outage can erase the last second of transactions. If this is the case, InnoDB’s crash recovery is not affected and will continue to work.
Note that many operating systems and some disk hardware fool the flush-to-disk operation. They may tell
mysqldthat the flush has taken place even though a flush has not happened. Then the durability of transactions is not guaranteed even with a setting of
1. In the worse case, a power outage can even corrupt the InnoDB database. Using a battery-backed disk cache in the SCSI disk controller or in the disk itself speeds up file flushes, and makes the operation safer. You can also try using the Unix command
hdparmto disable the caching of disk writes in hardware caches, or use some other command specific to the hardware vendor.
innodb_log_file_size: the size in bytes of each of the InnoDB log files. The default is 5MB. The larger the value, the less checkpoint flush activity is needed in the buffer pool, which saves disk I/O. But larger log files also mean that recovery is slower in case of a crash. In my experience, reasonable values for this range between 128M and 256M.
I hope this has served as a good introduction both to the InnoDB storage engine and to how transactions are managed by InnoDB.