Why are your indexes larger than your actual data?
It’s a common but startling sight for any DBA: you look at your database statistics and realize your indexes are larger—sometimes much larger—than the actual data they are supposed to be organizing. When a table becomes this unwieldy, it naturally triggers a "crisis of architecture." Should you partition? Shard? Or just move the whole mess to a different database engine?
Before you start migrating your life's work to a new platform, let's look at why this happens and how to fix it without the nuclear option.
Understanding the InnoDB Index Bloat Phenomenon
The problem is especially common with InnoDB tables due to several structural realities:
Why InnoDB Tables Get So Large
- Clustered Indexes: In InnoDB, the primary key is the "Clustered Index." This means the primary key is included in every secondary key. If your primary key is large (like a long
VARCHARor a complex composite key), it has a detrimental "multiplier effect" on every other index on that table. - Slight Index Overhead: Each index record contains a five-byte header that may be preceded by a variable-length header. This header links consecutive records and manages row-level locking. It’s small individually, but it adds up across millions of rows.
- Fragmentation: Indexes are rarely packed perfectly. They become fragmented over time, leaving "holes" in the B-tree pages.
When comparing this to MyISAM, which compresses its string indexes, InnoDB can look like it’s put on some serious holiday weight.
Diagnosing the Problem: Is Your Indexing Out of Control?
To see if you're suffering from index bloat, you can run a high-level check on your storage engines.
Warning: Be careful when you run the following queries. Try to run them on an available and non-production server whenever possible, as they may seriously slow down your server.
Identifying Top Offenders via Information Schema
First, let's look at the "Index Fraction" across your engines:
SELECT engine, count(*) tables, concat(round(sum(table_rows)/1000000,2),'M') rows, concat(round(sum(data_length)/(1024*1024*1024),2),'G') data, concat(round(sum(index_length)/(1024*1024*1024),2),'G') idx, concat(round(sum(data_length+index_length)/(1024*1024*1024),2),'G') total_size, round(sum(index_length)/sum(data_length),2) idxfrac FROM information_schema.TABLES GROUP BY engine ORDER BY sum(data_length+index_length) DESC LIMIT 10;
In typical results, you might see InnoDB with an idxfrac of 1.57, while MyISAM sits at 0.50. Now, let's find the specific tables where the index is 3 to 4.5 times the size of the data:
SELECT CONCAT(table_schema, '.', table_name), CONCAT(ROUND(table_rows / 1000000, 2), 'M') rows, CONCAT(ROUND(data_length / ( 1024 * 1024 * 1024 ), 2), 'G') DATA, CONCAT(ROUND(index_length / ( 1024 * 1024 * 1024 ), 2), 'G') idx, CONCAT(ROUND(( data_length + index_length ) / ( 1024 * 1024 * 1024 ), 2), 'G') total_size, ROUND(index_length / data_length, 2) idxfrac FROM information_schema.TABLES ORDER BY data_length + index_length DESC LIMIT 3;
Root Causes of Index Overgrowth
In my experience, the technical reasons for bloat usually boil down to:
- Too many indexes (indexing every column "just in case").
- Indexes on massive columns (like
VARCHAR(255)with UTF8). - Redundant or duplicate indexes.
But what are the human causes? Usually, it’s a mix of:
- A "happy trigger" on the phpMyAdmin "add index" button.
- Indiscriminately adding indexes to fix slow queries when the table was small.
- Trying a dozen different combinations of multi-column indexes.
- Using indexes as a "band-aid" for bad schema design.
Step-by-Step Solutions to Reclaim Space
Solving this is a learning process, but here is a four-step framework that consistently yields results.
1. Pruning Unused and Duplicate Indexes
You need to identify which indexes aren't pulling their weight. I recommend using pt-index-usage. By feeding it your general log (or a slow log with long_query_time=0), it can generate a report of which indexes were actually touched by your application and suggest DROP statements for the rest.
- Pro Tip: Remember that queries not running during your log window (like monthly reports) won't show up. Don't run
DROPstatements blindly; use them as a guide. - Duplicate Check: Use
pt-duplicate-key-checkerto find obvious overlaps.
2. Deep-Diving Into Specific Tables
Pick a top offender and investigate the index definitions. Look for overlap like index1(a,b) and index2(a).
You can often combine indexes to increase cardinality. For example, if searching by fullname yields too many results (like 100 "John Smiths"), adding an active column to the index might narrow that down to 30. ALTER TABLE users DROP INDEX index1, ADD INDEX index1(fullname, active);
3. Implementing Prefix Indexing for Large Columns
If you have a VARCHAR(255) using UTF8, a full index can use up to 767 bytes per entry. You can drastically reduce size by indexing only the first few characters.
To find the "sweet spot" for precision:
SELECT round(count(distinct left(fullname,100))/@unique_total*100,2) as c100, round(count(distinct left(fullname,30))/@unique_total*100,2) as c30, round(count(distinct left(fullname,20))/@unique_total*100,2) as c20 FROM users;
If 30 characters give you 99.81% uniqueness, there’s no reason to index all 255. Be careful, though—setting the prefix too low forces the database to do extra work checking rows that share the same prefix.
4. Optimizing Data Types and Normalization
If a VARCHAR column has a limited set of possible values, consider converting it to an ENUM (1-2 bytes). If the cardinality is high but repetitive, normalize it to another table and keep an INT ID in the original table. This saves a massive amount of space, especially when multiplied across clustered indexes.
MySQL Database Consulting Services
Ready to optimize your MySQL Database for the future?
Share this
Share this
More resources
Learn more about Pythian by reading the following blogs and articles.
Duplicate indexes in MySQL
Getting started with Oracle Database Backup Service in 2018
Can "between" and ">= and <=" Differ in Oracle?
Ready to unlock value from your data?
With Pythian, you can accomplish your data transformation goals and more.