Why do we sometimes want to keep duplicate indexes?

I’ve done dutiful DBA work in the past to identify and remove what are commonly called duplicate indexes. That is, those indexes that look like (a) and (a,b). The thought is that a query will utilize an index as easily on (a) as on (a,b), and removing (a) will save storage cost and write performance. I’ve had the experience, though, of removing (a) and seeing performance tank.

(As an aside, these are really redundant indexes. A duplicate index would be (a,b) and (a,b) by two different names – this can commonly be done by object relational mapping (ORM) or other automated schema creation tools. I’ll call (a) and (a,b) redundant indexes below.)

This test is on Percona Server 5.7.14 with the sys schema installed and performance schema enabled.

Given two tables with the same number of rows and with the same schema except an extra index on the second table:

mysql>SHOW CREATE TABLE mysqlslap.t5\G
       TABLE: t5
CREATE TABLE: CREATE TABLE `t5` (
  `intcol1` INT(11) DEFAULT NULL,
  `intcol2` INT(11) DEFAULT NULL,
  `intcol3` INT(11) DEFAULT NULL,
  `charcol1` VARCHAR(255) DEFAULT NULL,
  `charcol2` VARCHAR(255) DEFAULT NULL,
  `charcol3` VARCHAR(255) DEFAULT NULL,
  KEY `one` (`intcol1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 ROW IN SET (0.00 sec)
mysql>SHOW CREATE TABLE mysqlslap.t6\G
       TABLE: t6
CREATE TABLE: CREATE TABLE `t6` (
  `intcol1` INT(11) DEFAULT NULL,
  `intcol2` INT(11) DEFAULT NULL,
  `intcol3` INT(11) DEFAULT NULL,
  `charcol1` VARCHAR(255) DEFAULT NULL,
  `charcol2` VARCHAR(255) DEFAULT NULL,
  `charcol3` VARCHAR(255) DEFAULT NULL,
  KEY `one` (`intcol1`),
  KEY `two` (`intcol1`,`charcol1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 ROW IN SET (0.00 sec)

Notice that table t6 has a redundant index on intcol1.

The tables were populated with a statement like the following:

sudo mysqlslap –concurrency=2 –iterations=10000 –query=“insert into mysqlslap.t5(intcol1,intcol2,intcol3,charcol1,charcol2,charcol3,charcol4) values(FLOOR(RAND() * 10000),FLOOR(RAND() * 10000),FLOOR(RAND() * 10000),SUBSTRING(MD5(RAND()) FROM 1 FOR 50),SUBSTRING(MD5(RAND()) FROM 1 FOR 50),SUBSTRING(MD5(RAND()) FROM 1 FOR 50),now());” –delimiter=“;” –verbose

And then for the remaining tables, like this:

mysql>INSERT INTO t6 SELECT * FROM t5;

In both cases, the explain plan is showing the following query will use index one (look at the read_cost value, too):

mysql>EXPLAIN format=json SELECT intcol1 FROM mysqlslap.t5 WHERE intcol1>1910858200\G
EXPLAIN: {
  “query_block”: {
    “select_id”: 1,
    “cost_info”: {
      “query_cost”: “7092.73},TABLE”: {TABLE_NAME”: “t5”,
      “access_type”: “range”,
      “possible_keys”: [
        “one”
      ],KEY”: “one”,
      “used_key_parts”: [
        “intcol1”
      ],
      “key_length”: “5,
      “rows_examined_per_scan”: 17670,
      “rows_produced_per_join”: 17670,
      “filtered”: “100.00,
      “using_index”: TRUE,
      “cost_info”: {
        “read_cost”: “3558.73,
        “eval_cost”: “3534.00,
        “prefix_cost”: “7092.73,
        “data_read_per_join”: “13M”
      },
      “used_columns”: [
        “intcol1”
      ],
      “attached_condition”: “(`mysqlslap`.`t5`.`intcol1`>1910858200)}
  }
}
 
mysql>EXPLAIN format=json SELECT intcol1 FROM mysqlslap.t6 WHERE intcol1>1910858200\G
EXPLAIN: {
  “query_block”: {
    “select_id”: 1,
    “cost_info”: {
      “query_cost”: “7092.73},TABLE”: {TABLE_NAME”: “t6”,
      “access_type”: “range”,
      “possible_keys”: [
        “one”,
        “two”
      ],KEY”: “one”,
      “used_key_parts”: [
        “intcol1”
      ],
      “key_length”: “5,
      “rows_examined_per_scan”: 17670,
      “rows_produced_per_join”: 17670,
      “filtered”: “100.00,
      “using_index”: TRUE,
      “cost_info”: {
        “read_cost”: “3558.73,
        “eval_cost”: “3534.00,
        “prefix_cost”: “7092.73,
        “data_read_per_join”: “13M”
      },
      “used_columns”: [
        “intcol1”
      ],
      “attached_condition”: “(`mysqlslap`.`t6`.`intcol1`>1910858200)}
  }
}
 
mysql>EXPLAIN format=json SELECT intcol1 FROM mysqlslap.t6 force INDEX(two) WHERE intcol1>1910858200\G
EXPLAIN: {
  “query_block”: {
    “select_id”: 1,
    “cost_info”: {
      “query_cost”: “7508.43},TABLE”: {TABLE_NAME”: “t6”,
      “access_type”: “range”,
      “possible_keys”: [
        “two”
      ],KEY”: “two”,
      “used_key_parts”: [
        “intcol1”
      ],
      “key_length”: “5,
      “rows_examined_per_scan”: 17368,
      “rows_produced_per_join”: 17368,
      “filtered”: “100.00,
      “using_index”: TRUE,
      “cost_info”: {
        “read_cost”: “4034.84,
        “eval_cost”: “3473.60,
        “prefix_cost”: “7508.44,
        “data_read_per_join”: “12M”
      },
      “used_columns”: [
        “intcol1”
      ],
      “attached_condition”: “(`mysqlslap`.`t6`.`intcol1`>1910858200)}
  }
}

But the explain tool only shows what the optimizer predicts it will do; is the optimizer predicting the best execution plan? What really happens?

mysqlslap –concurrency=2 –iterations=1000 –create-schema=mysqlslap –query=“select intcol1 from t5 where intcol1>FLOOR(RAND() * 10000000);select intcol1 from t6 where intcol1>FLOOR(RAND() * 10000000);select intcol1 from t6 force index(two) where intcol1>FLOOR(RAND() * 10000000);” –delimiter=“;” –verbose
mysql>SELECT * FROM sys.schema_index_statistics;
------------------------------------------------------------------------------------------------------------
| table_schema | TABLE_NAME | index_name | rows_selected | select_latency | rows_inserted | insert_latency | rows_updated | update_latency | rows_deleted | delete_latency |
------------------------------------------------------------------------------------------------------------
| mysqlslap    | t6         | two        |     200692044 | 5.32 m         |             0 | 0 ps           |            0 | 0 ps           |            0 | 0 ps           |
| mysqlslap    | t6         | one        |     200692044 | 4.92 m         |             0 | 0 ps           |            0 | 0 ps           |            0 | 0 ps           |
| mysqlslap    | t5         | one        |     200692044 | 4.87 m         |             0 | 0 ps           |            0 | 0 ps           |            0 | 0 ps           |
| sys          | sys_config | PRIMARY    |             0 | 0 ps           |             0 | 0 ps           |            0 | 0 ps           |            0 | 0 ps           |
------------------------------------------------------------------------------------------------------------

The select_latency is higher for the index on (intcol1,charcol1). That value is coming from the performance schema via the sys schema:

`sys`.`format_time`(`performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_FETCH`) AS `select_latency`

The full sys schema view is:

CREATE ALGORITHM=MERGE DEFINER=`mysql.sys`@`localhost` SQL SECURITY INVOKER VIEW `sys`.`schema_index_statistics` AS SELECT `performance_schema`.`table_io_waits_summary_by_index_usage`.`OBJECT_SCHEMA` AS `table_schema`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`OBJECT_NAME` AS `table_name`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`INDEX_NAME` AS `index_name`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`COUNT_FETCH` AS `rows_selected`,`sys`.`format_time`(`performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_FETCH`) AS `select_latency`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`COUNT_INSERT` AS `rows_inserted`,`sys`.`format_time`(`performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_INSERT`) AS `insert_latency`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`COUNT_UPDATE` AS `rows_updated`,`sys`.`format_time`(`performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_UPDATE`) AS `update_latency`,`performance_schema`.`table_io_waits_summary_by_index_usage`.`COUNT_DELETE` AS `rows_deleted`,`sys`.`format_time`(`performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_INSERT`) AS `delete_latency` FROM `performance_schema`.`table_io_waits_summary_by_index_usage` WHERE (`performance_schema`.`table_io_waits_summary_by_index_usage`.`INDEX_NAME` IS NOT NULL) ORDER BY `performance_schema`.`table_io_waits_summary_by_index_usage`.`SUM_TIMER_WAIT` DESC

Just to prove the results aren’t being confounded by the redundant index itself, here are two new tables with the same data:

mysql>SHOW CREATE TABLE mysqlslap.t7\G
       TABLE: t7
CREATE TABLE: CREATE TABLE `t7` (
  `intcol1` INT(11) DEFAULT NULL,
  `intcol2` INT(11) DEFAULT NULL,
  `intcol3` INT(11) DEFAULT NULL,
  `charcol1` VARCHAR(255) DEFAULT NULL,
  `charcol2` VARCHAR(255) DEFAULT NULL,
  `charcol3` VARCHAR(255) DEFAULT NULL,
  KEY `one` (`intcol1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 ROW IN SET (0.00 sec)
 
mysql>SHOW CREATE TABLE mysqlslap.t8\G
       TABLE: t8
CREATE TABLE: CREATE TABLE `t8` (
  `intcol1` INT(11) DEFAULT NULL,
  `intcol2` INT(11) DEFAULT NULL,
  `intcol3` INT(11) DEFAULT NULL,
  `charcol1` VARCHAR(255) DEFAULT NULL,
  `charcol2` VARCHAR(255) DEFAULT NULL,
  `charcol3` VARCHAR(255) DEFAULT NULL,
  KEY `two` (`intcol1`,`charcol1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 ROW IN SET (0.00 sec)

(Restarted to reset the performance schema and sys tables.)

mysqlslap –concurrency=2 –iterations=1000 –create-schema=mysqlslap –query=“select intcol1 from t7 where intcol1>FLOOR(RAND() * 10000000);select intcol1 from t8 where intcol1>FLOOR(RAND() * 10000000);” –delimiter=“;” –verbose
mysql>SELECT * FROM sys.schema_index_statistics;
------------------------------------------------------------------------------------------------------------
| table_schema | TABLE_NAME | index_name | rows_selected | select_latency | rows_inserted | insert_latency | rows_updated | update_latency | rows_deleted | delete_latency |
------------------------------------------------------------------------------------------------------------
| mysqlslap    | t8         | two        |     200012000 | 5.16 m         |             0 | 0 ps           |            0 | 0 ps           |            0 | 0 ps           |
| mysqlslap    | t7         | one        |     200012000 | 4.67 m         |             0 | 0 ps           |            0 | 0 ps           |            0 | 0 ps           |
------------------------------------------------------------------------------------------------------------

Remember, this is the same data and schema except for the indexes.

Conclusion: Sometimes redundant indexes are helpful.

Thank you to Abhinav Gupta for the technical review.