Why are your indexes larger than your actual data?

4 min read
Oct 11, 2011 12:00:00 AM

It’s a common but startling sight for any DBA: you look at your database statistics and realize your indexes are larger—sometimes much larger—than the actual data they are supposed to be organizing. When a table becomes this unwieldy, it naturally triggers a "crisis of architecture." Should you partition? Shard? Or just move the whole mess to a different database engine?

Before you start migrating your life's work to a new platform, let's look at why this happens and how to fix it without the nuclear option.

Understanding the InnoDB Index Bloat Phenomenon

The problem is especially common with InnoDB tables due to several structural realities:

Why InnoDB Tables Get So Large

  • Clustered Indexes: In InnoDB, the primary key is the "Clustered Index." This means the primary key is included in every secondary key. If your primary key is large (like a long VARCHAR or a complex composite key), it has a detrimental "multiplier effect" on every other index on that table.
  • Slight Index Overhead: Each index record contains a five-byte header that may be preceded by a variable-length header. This header links consecutive records and manages row-level locking. It’s small individually, but it adds up across millions of rows.
  • Fragmentation: Indexes are rarely packed perfectly. They become fragmented over time, leaving "holes" in the B-tree pages.

When comparing this to MyISAM, which compresses its string indexes, InnoDB can look like it’s put on some serious holiday weight.

Diagnosing the Problem: Is Your Indexing Out of Control?

To see if you're suffering from index bloat, you can run a high-level check on your storage engines.

Warning: Be careful when you run the following queries. Try to run them on an available and non-production server whenever possible, as they may seriously slow down your server.

Identifying Top Offenders via Information Schema

First, let's look at the "Index Fraction" across your engines:

SELECT      engine,      count(*) tables,      concat(round(sum(table_rows)/1000000,2),'M') rows,      concat(round(sum(data_length)/(1024*1024*1024),2),'G') data,      concat(round(sum(index_length)/(1024*1024*1024),2),'G') idx,      concat(round(sum(data_length+index_length)/(1024*1024*1024),2),'G') total_size,      round(sum(index_length)/sum(data_length),2) idxfrac  FROM information_schema.TABLES  GROUP BY engine  ORDER BY sum(data_length+index_length) DESC  LIMIT 10; 

In typical results, you might see InnoDB with an idxfrac of 1.57, while MyISAM sits at 0.50. Now, let's find the specific tables where the index is 3 to 4.5 times the size of the data:

SELECT      CONCAT(table_schema, '.', table_name),      CONCAT(ROUND(table_rows / 1000000, 2), 'M') rows,      CONCAT(ROUND(data_length / ( 1024 * 1024 * 1024 ), 2), 'G') DATA,      CONCAT(ROUND(index_length / ( 1024 * 1024 * 1024 ), 2), 'G') idx,      CONCAT(ROUND(( data_length + index_length ) / ( 1024 * 1024 * 1024 ), 2), 'G') total_size,      ROUND(index_length / data_length, 2) idxfrac  FROM information_schema.TABLES  ORDER BY data_length + index_length DESC  LIMIT 3; 

Root Causes of Index Overgrowth

In my experience, the technical reasons for bloat usually boil down to:

  1. Too many indexes (indexing every column "just in case").
  2. Indexes on massive columns (like VARCHAR(255) with UTF8).
  3. Redundant or duplicate indexes.

But what are the human causes? Usually, it’s a mix of:

  • A "happy trigger" on the phpMyAdmin "add index" button.
  • Indiscriminately adding indexes to fix slow queries when the table was small.
  • Trying a dozen different combinations of multi-column indexes.
  • Using indexes as a "band-aid" for bad schema design.

Step-by-Step Solutions to Reclaim Space

Solving this is a learning process, but here is a four-step framework that consistently yields results.

1. Pruning Unused and Duplicate Indexes

You need to identify which indexes aren't pulling their weight. I recommend using pt-index-usage. By feeding it your general log (or a slow log with long_query_time=0), it can generate a report of which indexes were actually touched by your application and suggest DROP statements for the rest.

  • Pro Tip: Remember that queries not running during your log window (like monthly reports) won't show up. Don't run DROP statements blindly; use them as a guide.
  • Duplicate Check: Use pt-duplicate-key-checker to find obvious overlaps.

2. Deep-Diving Into Specific Tables

Pick a top offender and investigate the index definitions. Look for overlap like index1(a,b) and index2(a).

You can often combine indexes to increase cardinality. For example, if searching by fullname yields too many results (like 100 "John Smiths"), adding an active column to the index might narrow that down to 30. ALTER TABLE users DROP INDEX index1, ADD INDEX index1(fullname, active);

3. Implementing Prefix Indexing for Large Columns

If you have a VARCHAR(255) using UTF8, a full index can use up to 767 bytes per entry. You can drastically reduce size by indexing only the first few characters.

To find the "sweet spot" for precision:

SELECT      round(count(distinct left(fullname,100))/@unique_total*100,2) as c100,     round(count(distinct left(fullname,30))/@unique_total*100,2) as c30,     round(count(distinct left(fullname,20))/@unique_total*100,2) as c20  FROM users; 

If 30 characters give you 99.81% uniqueness, there’s no reason to index all 255. Be careful, though—setting the prefix too low forces the database to do extra work checking rows that share the same prefix.

4. Optimizing Data Types and Normalization

If a VARCHAR column has a limited set of possible values, consider converting it to an ENUM (1-2 bytes). If the cardinality is high but repetitive, normalize it to another table and keep an INT ID in the original table. This saves a massive amount of space, especially when multiplied across clustered indexes.

MySQL Database Consulting Services

Ready to optimize your MySQL Database for the future?

On this page

Ready to unlock value from your data?

With Pythian, you can accomplish your data transformation goals and more.