It’s a common but startling sight for any DBA: you look at your database statistics and realize your indexes are larger—sometimes much larger—than the actual data they are supposed to be organizing. When a table becomes this unwieldy, it naturally triggers a "crisis of architecture." Should you partition? Shard? Or just move the whole mess to a different database engine?
Before you start migrating your life's work to a new platform, let's look at why this happens and how to fix it without the nuclear option.
The problem is especially common with InnoDB tables due to several structural realities:
VARCHAR or a complex composite key), it has a detrimental "multiplier effect" on every other index on that table.When comparing this to MyISAM, which compresses its string indexes, InnoDB can look like it’s put on some serious holiday weight.
To see if you're suffering from index bloat, you can run a high-level check on your storage engines.
Warning: Be careful when you run the following queries. Try to run them on an available and non-production server whenever possible, as they may seriously slow down your server.
First, let's look at the "Index Fraction" across your engines:
SELECT engine, count(*) tables, concat(round(sum(table_rows)/1000000,2),'M') rows, concat(round(sum(data_length)/(1024*1024*1024),2),'G') data, concat(round(sum(index_length)/(1024*1024*1024),2),'G') idx, concat(round(sum(data_length+index_length)/(1024*1024*1024),2),'G') total_size, round(sum(index_length)/sum(data_length),2) idxfrac FROM information_schema.TABLES GROUP BY engine ORDER BY sum(data_length+index_length) DESC LIMIT 10;
In typical results, you might see InnoDB with an idxfrac of 1.57, while MyISAM sits at 0.50. Now, let's find the specific tables where the index is 3 to 4.5 times the size of the data:
SELECT CONCAT(table_schema, '.', table_name), CONCAT(ROUND(table_rows / 1000000, 2), 'M') rows, CONCAT(ROUND(data_length / ( 1024 * 1024 * 1024 ), 2), 'G') DATA, CONCAT(ROUND(index_length / ( 1024 * 1024 * 1024 ), 2), 'G') idx, CONCAT(ROUND(( data_length + index_length ) / ( 1024 * 1024 * 1024 ), 2), 'G') total_size, ROUND(index_length / data_length, 2) idxfrac FROM information_schema.TABLES ORDER BY data_length + index_length DESC LIMIT 3;
In my experience, the technical reasons for bloat usually boil down to:
VARCHAR(255) with UTF8).But what are the human causes? Usually, it’s a mix of:
Solving this is a learning process, but here is a four-step framework that consistently yields results.
You need to identify which indexes aren't pulling their weight. I recommend using pt-index-usage. By feeding it your general log (or a slow log with long_query_time=0), it can generate a report of which indexes were actually touched by your application and suggest DROP statements for the rest.
DROP statements blindly; use them as a guide.pt-duplicate-key-checker to find obvious overlaps.Pick a top offender and investigate the index definitions. Look for overlap like index1(a,b) and index2(a).
You can often combine indexes to increase cardinality. For example, if searching by fullname yields too many results (like 100 "John Smiths"), adding an active column to the index might narrow that down to 30. ALTER TABLE users DROP INDEX index1, ADD INDEX index1(fullname, active);
If you have a VARCHAR(255) using UTF8, a full index can use up to 767 bytes per entry. You can drastically reduce size by indexing only the first few characters.
To find the "sweet spot" for precision:
SELECT round(count(distinct left(fullname,100))/@unique_total*100,2) as c100, round(count(distinct left(fullname,30))/@unique_total*100,2) as c30, round(count(distinct left(fullname,20))/@unique_total*100,2) as c20 FROM users;
If 30 characters give you 99.81% uniqueness, there’s no reason to index all 255. Be careful, though—setting the prefix too low forces the database to do extra work checking rows that share the same prefix.
If a VARCHAR column has a limited set of possible values, consider converting it to an ENUM (1-2 bytes). If the cardinality is high but repetitive, normalize it to another table and keep an INT ID in the original table. This saves a massive amount of space, especially when multiplied across clustered indexes.
Ready to optimize your MySQL Database for the future?