InnoDB Transactional Characteristics

Jul 1, 2008 / By Keith Murphy

Tags: ,

InnoDB is a storage engine that uses MVCC (described shortly) to provide ACID-compliant transactional data storage using row-level locking.  MVCC stands for Multi-Version Concurrency Control.  It is how InnoDB allows multiple transactions to look at a data set of one or more tables and have a consistent view of the data. MVCC keeps a virtual snapshot of the dataset for each transaction.  An example will make this clear.

Let’s assume you have two transactions (and only two transactions) running on a system. If transaction A starts at 10:45:56 and ends at 10:45:89, it gets a consistent view of the dataset during the time that the transaction runs.  If transaction B starts at 10:45:65, it would see exactly the same view of the dataset that transaction A saw when it began the transaction.  If transaction B started at 10:45:95, it would see the modified dataset after transaction A made modifications. During the duration of each transaction, the dataset that each sees does not change, except for the modifications the transaction itself makes.  Consider that a typical production database server is running hundreds of queries a second, and you realize that the job of MVCC/the InnoDB storage engine gets very complicated maintaining all these views of the data.

MySQL server storage engines use three different locking options: table-level locking, page-level locking, and row-level locking.  With table-level locking, if a query accesses the table it will lock the entire table and not allow access to the table from other queries.  The benefit of this is that it entirely eliminates deadlocking issues. The disadvantage is that, as mentioned, no other queries have access to the table while it is locked.  If you had a table with 16,000,000 rows and needed to modify one row, the entire table is inaccessible by other queries. The MyISAM and memory storage engine use table-level locking.

Page-level locking is locking of a group of rows instead of a the entire table. The number of rows actually locked will vary based on a number of factors. Going back to our example of a 16,000,000-row table, lets assume a page-level lock is used.  If a page consists of 1,000 rows (this would vary depending on the size of the rows and the actual amount of memory allocated to a page), a lock would lock only a thousand rows.  Any of the other 15,999,000 rows could be used by other queries without interference. The BDB storage engine uses page-level locking.

Row-level locking, as the name suggests, acquires a lock on as small an amount as a single row from a table. This will block the minimal amount of table content and allows for the most concurrency on a table without problems. InnoDB and Falcon both use row-level locking.

While the InnoDB configuration options are not strictly related to the transactional characteristics (other than innodb_flush_log_at_trx_commit), I thought it would be useful to have them here for reference.  These are the most common or most important configuration parameters:

  • innodb_buffer_pool_size: the size in bytes of the memory buffer InnoDB uses to cache data and indexes of its tables. A larger memory buffer means less I/O is needed to access data in tables. On a dedicated database server using primarily InnoDB tables, this should be set to 80 percent or more of the size of the system’s RAM. Be careful with this as setting it too large will cause competition for physical memory might cause paging in the operating system.
  • innodb_data_file_path: the paths to individual data files and their sizes. The full directory path to each data file is formed by concatenating innodb_data_home_dir to each path specified here. The file sizes are specified in MB or GB (1024MB) by appending M or G to the size value. The sum of the sizes of the files must be at least 10MB. If you do not specify innodb_data_file_path, the default behavior is to create a single 10MB auto-extending data file named ibdata1. The size limit of individual files is determined by your operating system. You can set the file size to more than 4GB on those operating systems that support big files. You can also use raw disk partitions as data files.
  • innodb_data_home_dir: the common part of the directory path for all InnoDB data files. If you do not set this value, the default is the MySQL data directory. You can specify the value as an empty string, in which case you can use absolute file paths in innodb_data_file_path.
  • innodb_file_per_table: If this variable is enabled, InnoDB will create each new table using its own .ibd file for storing data and indexes, rather than in the shared tablespace. The default is to create tables in the shared tablespace.
  • innodb_flush_log_at_trx_commit: When this parameter is set to 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit.When this value is 1 (the default), the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file.

    When set to 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is 2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process-scheduling issues.

    The default value of this variable is 1. This is required for ACID-compliance. You can, however, achieve better performance by setting the value other than 1.  If you do, there is the possibility that you can lose at most one second worth of transactions in a crash. If you set the value to 0, then any mysqld process crash can erase the last second of transactions. If you set the value to 2, then only an operating system crash or a power outage can erase the last second of transactions. If this is the case, InnoDB’s crash recovery is not affected and will  continue to work.

    Note that many operating systems and some disk hardware fool the flush-to-disk operation. They may tell mysqld that the flush has taken place even though a flush has not happened. Then the durability of transactions is not guaranteed even with a setting of 1. In the worse case, a power outage can even corrupt the InnoDB database. Using a battery-backed disk cache in the SCSI disk controller or in the disk itself speeds up file flushes, and makes the operation safer. You can also try using the Unix command hdparm to disable the caching of disk writes in hardware caches, or use some other command specific to the hardware vendor.

  • innodb_log_file_size: the size in bytes of each of the InnoDB log files. The default is 5MB. The larger the value, the less checkpoint flush activity is needed in the buffer pool, which saves disk I/O. But larger log files also mean that recovery is slower in case of a crash. In my experience, reasonable values for this range between 128M and 256M.

I hope this has served as a good introduction both to the InnoDB storage engine and to how transactions are managed by InnoDB.

4 Responses to “InnoDB Transactional Characteristics”

  • Vladimir says:

    Keith,
    it’s really a good intro. It would be nice to hear about innodb-specific data-storage features (e.g. clustering or pk-index-postfixing, etc…) and how this affects the performance of various kinds of queries…

  • Xuekun says:

    Hi, Keith

    It is very good. Thank you.
    I have a question. I heard that Innodb use “Mixed MVCC”, while Falcon use “Pure MVCC”. So what is “Mixed” meanings? Could you give more details? :-)

    Thx, Xuekun

  • Keith Murphy says:

    Xuekun,

    Kevin Lewis, the Falcon team lead said this in his talk “Falcon is not Innodb” at the MySQL Conference.

    Unfortunately, at this point, I don’t know enough to be able to give you a reasonable explanation. However, here is the slides of the talk I mentioned:

    en.oreilly.com/mysql2008/public/asset/attachment/2155

    Keith

  • Keith Murphy says:

    Xuekun,

    One other note about this. From the Falcon information on the MySQL website:

    “True Multi Version Concurrency Control (MVCC) enables records and tables to be updated without the overhead associated with row-level locking mechanisms. The MVCC implementation virtually eliminates the need to lock tables or rows during the update process.”

    Honestly, will be interesting to see if this really works!!

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>