Exposing Innodb internals via system variables: part 3, I/O (table data)

Introduction
This is part three of a five part blog series to explore InnoDB internals by looking at the related tunable system variables. In part 2 we covered variables that had the greatest impact on the file structure of InnoDB as well as how data is written to logs. In this section we will continue looking at I/O but more specifically looking at the mechanics on how data gets written to table files as well as how background threads read from them. Just like in part two, I would like to emphasize something that was written in part one of this blog post series. “I should note that while tuning recommendations are provided, this objective of this blog post series was NOT meant to be a tuning primer, but instead to explore the mechanics that each variable interacts with. As such I would like to advise against reading this guide and trying to fine tune all of the available InnoDB variables. System variable tuning is an exercise in diminishing returns, the most benefit you’ll get out of tuning your MySQL server will occur within the first 15 minutes of configuration. In order to help prevent excess tuning I will provide a list of variables out of each section that should be set when configuring a new system.”Variables
Okay! Let’s have a look at the rest of the I/O related InnoDB system variables. innodb_adaptive_flushing Definition: Innodb has to flush dirty pages from the buffer pool back to the table space as part of the checkpointing process, which ensures that you don’t run out of redo log space. Note, the redo log is cyclical so data information must be purged in order to make space for transaction information that’s coming in. Removing entries from the redo log space requires flushing the associated dirty pages back to the table space. If adaptive flushing is enabled then the rate of dirty page flushing adjusts itself automatically based on the current system workload. Flushing too many pages at once may cause contention for other operations, both read and write. Flushing too few pages may cause InnoDB to fall behind in its checkpointing process and it may run out of redo log space. Adaptive flushing ensures that there is a maintained balance. Default: 1 (ON) Tuning notes: I would recommend leaving this enabled, but if you would like InnoDB to take on characteristics of previous MySQL versions, you can disable it. Without adaptive flushing InnoDB has a tendency to flush data based solely on the amount of remaining space in the redo log as well as the maximum dirty page percentage and can cause spike filled I/O traffic patterns as opposed to something more balanced. You can see examples of spiky flushing patterns that are typically observed with disabling adaptive flushing by looking at this blog post by Devananda van der Veen. Considering the amount of effort that’s been placed on advancing the algorithms that drive adaptive flushing I would recommend that you exercise caution when disabling it and make sure you’ve done extensive testing to support disabling this feature should you choose to explore that option. Associated Metrics: global status: innodb_buffer_pool_pages_dirty and innodb_buffer_pool_bytes_dirty, this will show you what is pending to be flushed. innodb_adaptive_flushing_lwm Definition: If you have chosen to disable adaptive flushing, this is the ‘low water mark’ of the redo log where adaptive flushing will be enabled. IE: When only x% of the redo log space is remaining, adaptive flushing will be enabled, even if you set it to be disabled. Default: 10 (percent) Tuning notes: Change this when you have adaptive flushing disabled and either want it to never be enabled (set to 0) or you want it to be enabled sooner (set higher than 10) Per the MySQL reference guide this should only be modified on servers that run write-heavy workloads for long periods of time and have little reduced load time to catch up with changes waiting to be written to disk. innodb_max_dirty_pages_pct Definition: This is one of the elements that feeds into the flushing algorithm (adaptive or not) that determines how much data should be flushed. The algorithm is going to try and keep the buffer pool dirty pages below this percent based on the number of dirty pages in the buffer pool and the rate of accumulation of the redo log. It should be noted that this does not impact the rate of flushing, but is used to determine a target for flushing activity at each flush iteration. Default: 75 (percent) Tuning notes: Reduce this if you want to lower the amount of dirty pages in order to support more unwritten/unchanged pages for read heavy environments. You can also impact the target for flushing activity by adjusting this variable, lowering it will increase the activity per flush, raising it will decrease the activity. Also, reducing this variable will speed up the process of shutting down MySQL as InnoDB has to flush dirty pages to disk as part of the MySQL shutdown process. Associated Metrics: global status: innodb_buffer_pool_pages_dirty and innodb_buffer_pool_bytes_dirty, this will show you what is pending to be flushed. innodb_max_dirty_pages_pct_lwm Definition: When the dirty page percentage of the buffer pool reaches the number set by this variable, it will begin ‘pre-flushing’ dirty pages out of the buffer pool (essentially more aggressively flushing at each flush iteration) in hopes of avoiding the innodb_max_dirty_pages_pct value. Default: 0 (percent) (disabled) Tuning notes: Adjust this variable when you want InnoDB to more aggressively flush when a certain dirty page percentage is hit. Per the MySQL reference guide this should only be modified on servers that run write-heavy workloads for long periods of time and have little reduced load time to catch up with changes waiting to be written to disk. innodb_flushing_avg_loops Definition: When InnoDB performs flushes, it keeps a snapshot of the flushing ‘state’, IE: how is the current rate of flushing working when we compare this against the growth of the redo log, dirty pages, etc. It keeps a history of these states in order to develop a trend in which it will use to forecast the needed rate / target for flushing. The value of this variable is the number of snapshots that are kept in history. Default: 30 (states) Tuning notes: This is like any kind of trending in that the more data you have, the smoother the trend line is going to be. If you have spiky IO loads on a server where the workload quickly changes, you can increase this value to smooth out the I/O pattern. However, if you have a system where the load is predictable, but need to respond faster (think milliseconds) to sudden spikes in activity, you should lower this value. Per the MySQL reference guide this should only be modified on servers that run write-heavy workloads for long periods of time and have little reduced load time to catch up with changes waiting to be written to disk. Associated Metrics: You need to have access to historical trending data in order to determine the effectiveness of this variable.You will want to see how spikes in transactions compare against the content of dirty pages and how quickly spikes are handled. innodb_lru_scan_depth Definition: InnoDB has a process that runs once a second called the ‘page_cleaner’ and this feeds into the algorithm that’s used to determine rate and target of flushing. This value specifies how far down the LRU list ( least recently used, see section 1 in the the innodb_old_blocks_pct section for how this applies to the InnoDB buffer pool) page_cleaner process should scan when looking for dirty pages to flush. Note that in the buffer pool, the LRU can consist of both old and dirty pages, and since this process is looking for dirty pages, the deeper into the LRU list the cleaner scans the more pages it will find to flush. Default: 1024 (pages) Tuning notes: If you have a large buffer pool, and if you have a write intensive workload that’s saturating I/O, then decrease this value. It’s also theorized that if you increase innodb_io_capacity, you should increase this variable as well, which you can read about in this blog post by Mark Callaghan. innodb_flush_neighbors Definition: If this variable is set to enabled, when InnoDB performs a flush of a page from the buffer pool to persistent storage, it will check to see if any other dirty pages in the buffer pool belong in the same extent, and if so it will flush those as well in order to group writes and avoid disk seek time latency. The options for this variable are as follows... 1 = Enabled, but only flushes dirty neighbor pages from the buffer pool if they are in the same extent and the are considered contiguous. 2 = Enabled, but the requirement for contiguous pages is not present in this setting, which means neighbor flushing is more likely to occur. 0 = Disabled Default: 1 Tuning notes: Disable when you are using SSDs or fast IO solution. innodb_flush_method Definition: This defines the system calls the InnoDB will use to push data to physical storage for both data and log files. I would recommend reading Ivan Groenewold’s blog post ‘ Innodb flushing and linux I/O’ to get a better understanding of what each option means and how it relates to the usage of fsync(), fdatasync(), O_SYNC, and O_DIRECT. Default: Null (On linux this defaults to fsync, on windows it defaults to async_unbuffered) Tuning notes: I agree with Ivan’s recommendation that O_DIRECT should be considered for most modern workloads, especially if you have a properly configured InnoDB buffer pool. It should also be noted that if you are using InnoDB file per table you may see overhead using fsync, which makes use of OS level write buffers to help restructure the order of write operations to disk. When InnoDB file per table is in use, these write operations cannot be merged and each file will need to have it’s own fsync process. As such O_DIRECT should be considered when using InnoDB file per table. If you are not using InnoDB file per table, then you may see better performance with fsync. innodb_doublewrite Definition: The double write buffer is used to ensure that a request to write a page to disk actually makes it to the disk before the associated change records is purged from the InnoDB redo log. More specifically it’s there to protect against partial page writes. Pages are written to the double write buffer first before being written to table space. When InnoDB performs recovery it will check page integrity by verifying the data against the associate checksum at the end of the page. It will do this for both pages in the table and in the double write buffer. If it find a page in the table space that’s inconsistent with its checksum it will recover it from the doublewrite buffer. If it finds a pages in the double write buffer that is inconsistent with its checksum, it discards it. I would recommend reading Peter Zaitsev’s blog post for more details on how this works. Default: On Tuning notes: The double write buffer is really small, only large enough to support 100 pages, and is sequential in nature. Despite having to write the page twice, the I/O overhead is not double, but it is notable. It’s been stated in the past that you can consider disabling the double write buffer if you’re using a file system that supports block level journaling. However, if you take a closer look in the comments section of this blog post by Yves Trudeau, you’ll notice that there have been several individuals that have reported issues with disabling the doublewrite buffer in this case. As such it’s only advisable to disable the double write buffer if you are using a file system that has atomic writes, such as FusionIO or ZFS, or if you are willing to sacrifice data integrity for performance. Otherwise my recommendation is to leave is enabled. Associated Metrics: Performance schema: wait/synch/mutex/InnoDB/buf_dblwr_mutex innodb_read_ahead_threshold Definition: Innodb has 2 different methods of read ahead and this variable ties directly to the linear read ahead method. This setting dictates how many pages must be read sequentially in one extent (1 meg of page data, typically 64 pages unless you’re using the compression row format) in order for read ahead to occur. When it does occur, it automatically pulls in all of the pages from the next extent. For example, if you stay with the default of 56, then when InnoDB reads data from an extent if 56 pages are pulled in sequentially, then it will automatically pull in the full contents of the next extent. Default: 56 (pages) Tuning notes: Read ahead was originally implemented to reduce the amount of seek time that was needed to pull information from disk to the buffer pool. If you have SSDs you may want to consider increasing the value of this variable to make linear read ahead less likely. Associated Metrics: In SHOW ENGINE INNODB STATUS under BUFFER POOL AND MEMORY section there is Pages read ahead and evicted without access. You will also see the values innodb_buffer_pool_read_ahead and innodb_buffer_pool_read_ahead_evicted in SHOW GLOBAL STATUS. The read ahead shows how many pages have been pulled into the buffer pool via read ahead requests, and the evicted status shows how many of them were evicted from the buffer pool without ever having been accessed. innodb_random_read_ahead Definition: This is like innodb_read_ahead_threshold, but this variable is associated with random read ahead. When this is enabled InnoDB will check the buffer pool and if 13 consecutive pages are found in the buffer pool it will make an I/O request for the rest of the pages in the associated extent. Default: OFF Tuning notes: Unlike linear read ahead where you can tune the likelihood of read head, random read ahead can only be enabled or disabled. It’s generally considered best to leave this at it’s default value of ‘off’, especially if you have SSDs, but if you have have a lot of read requests that pull data from a table in a way that doesn’t always coincide with the order of the clustered index, you may see improvement from enabling this feature. However you should keep a close eye on the associated metric to make sure you’re benefiting from having this enabled. Associated Metrics: In the output of SHOW ENGINE INNODB STATUS, in the BUFFER POOL AND MEMORY section you will find random read aheads and evicted without access. Much like linear read aheads, SHOW GLOBAL STATUS has a status to show how many pages have been pulled in from the random read ahead requests, innodb_buffer_pool_read_ahead_rnd. Unfortunately there is only one evicted status, innodb_buffer_pool_read_ahead_evicted, and that leaves you unable to know whether the pages that were evicted came from the linear or random read ahead. innodb_change_buffering Definition: When dirty pages are flushed in InnoDB, the flushing typically only pertains to the clustered index of the table, which is the main table data for InnoDB. Secondary indexes are typically not written as part of this process unless the associated secondary index page is already in the buffer pool as updates to these values typically leave the secondary index in a highly unsorted order. These modified secondary index pages are saved in the change buffer until the secondary index page is pulled into the buffer pool by other means. This process prevents InnoDB from performing unnecessary expensive I/O operations. This variable denotes whether the change buffer is enabled or disabled, and what types of data changes should be handled by the change buffer. Default: All (see the reference guide for all available options) Tuning notes: You can disable this if you have a very read heavy workload with minimal updates and want to preserve space in the buffer pool to allow for more unchanged pages. However I would recommend revising the innodb_change_buffer_max_size instead of disabling change buffering entirely. Associated Metrics: In the output of SHOW ENGINE INNODB STATUS in the INSERT BUFFER AND ADAPTIVE HASH INDEX section, you will see some basic information pertaining to the current state of the buffer. Please note that in the output the change buffer is referred to as the insert buffer. The reason for this is due to the fact that in previous versions of MySQL, changes to the secondary index were part of the buffer only if they were the result of an insert operation. This is no longer the case. Here are some of the more important metrics to consider.- Seg size: The total number of change buffer pages allocated. This includes 1 page for the change buffer header.
- Free list len: The number of available pages in the change buffer.
- Ibuf: size: The number of populated pages in the change buffer.
- Inserts: Number of inserts in the buffer waiting to be merged. It’s important to know that since the change buffer persists to the system tablespace, this is NOT the number of inserts that have gone into the change buffer since the server started and instead notes the total number of changes waiting to be merged.
- Merged Recs: The number of records that have been merged since the server started
- Merges: The number of merges that occurred since the server started.
- Merged operations: The represents records that were merged by change buffering
- Insert: number of inserts merged
- Delete mark: number of deleted records merged
- Delete: number of purge records merged
- Discarded operations: Number of operations in the change buffer that were discarded
- Insert: number of insert merged operations discarded
- Delete mark: number of deleted merged operations discarded
- Delete: number of purge merged operations discarded