Redshift table maintenance: vacuuming
OverviewPart of the appeal of AWS' Redshift is that it's a managed service, which means lower administration costs. While you don't have to hire a full time DBA to make sure it runs smoothly (from Pythian's experience it takes ~10-20 hours/month to manage Redshift), there are still some tasks that should be attended to keep it happy:
- Skew analysis
- Compression analysis
- Query monitoring
OperationsLet us start with the process itself. It's simple enough and you can get syntax documentation from AWS . There's not too much that's tricky with the syntax and for most use cases
VACUUM myschema.mytablename;will suffice. Note that INTERLEAVED sort keys need the REINDEX parameter added for all re-indexing to occur. You can discern which tables have this set up by using the query: [code language="sql"] select schemaname, tablename from (SELECT n.nspname AS schemaname ,c.relname AS tablename ,min(attsortkeyord) min_sort FROM pg_namespace AS n INNER JOIN pg_class AS c ON n.oid = c.relnamespace INNER JOIN pg_attribute AS a ON c.oid = a.attrelid WHERE c.relkind = 'r' AND abs(a.attsortkeyord) > 0 AND a.attnum > 0 group by 1,2 ) where min_sort<0 [/code] In order to give the vacuum process more resources, we preface this command with
SET wlm_query_slot_count TO <N>;where N is the maximum number of query slots we think we can get away with. If you're not sure what that number should be (we'll discuss WLM queues in another post), usually 5 is a safe number though be warned that if the value of wlm_query_slot_count is larger than the number of available slots for the service class, the vacuum command will fail. Knowing when to vacuum is reasonably straight forward. Anytime after substantial inserts, updates, or deletes are made is always appropriate, but you can be more exacting by querying two tables:
select * from STL_ALERT_EVENT_LOG where Solution LIKE '%VACUUM command%'and
select * from SVV_TABLE_INFO where unsorted > 8The latter check works great for daily loads. We will often set the threshold at 8 (percent) immediately after the loads, then run another vacuum process in the evening with a lower threshold (4 percent) that addresses larger tables that take a fair amount of time to vacuum since we want to avoid that situation in the morning. Last fall AWS built a nice tool to automate vacuums, Analyze & Vacuum Schema Utility, that incorporated these queries. It works quite well, and we recommend it to our clients as a simple way to set up this maintenance. However, note that it does not automatically add the REINDEX parameter for those tables with INTERLEAVED sortkeys. The code is all available, so it is easy enough to adjust to make more custom filtering of tables (on fact_* and dim_* for instance) within a schema. AWS has built a very useful view, v_get_vacuum_details, (and a number of others that you should explore if you haven't already) in their Redshift Utilities repository that you can use to gain some insight into how long the process took and what it did. None of the system tables for vacuuming keep any historical information which would be nice for tracking growing process times, but you can see them for a week in STL_QUERY which gets purged to a history of 7 days. I recommend creating a simple process to track the vacuum data: First create the table:
create table vacuum_history sortkey (xid) as select * from v_get_vacuum_details where processing_seconds > 0;Then set up a cron process to populate:
0 18 * * * psql -h myRScluster -U myUser -p5439 -c "INSERT INTO vacuum_history SELECT * FROM v_get_vacuum_details WHERE xid > (SELECT MAX(xid) FROM vacuum_history) where processing_seconds > 0;" &> /var/log/vacuum_history.logOnce you start to see tables taking an inordinate amount of time to vacuum, some additional intervention may be appropriate. Our team recently ran into a sizable table (3 billion records) that had been taking 3 hours to vacuum daily. Some issue occurred where the table needed a partial reload of 2 billion rows. Once that finished, we ran a vacuum which kept going all afternoon. Checking SVV_VACUUM_PROGRESS we could see that it would take almost 30 hours to complete. Note that restarting a stopped vacuum does not mean the process will pick up where it left off. Since this would have impacted the daily load performance, we killed the vacuum with "cancel <pid>" using the pid pulled from
select pid, text from SVV_QUERY_INFLIGHT where text like '%Vacuum%'We then ran a deep copy (created a new version of the table and ran a SELECT INTO) which took about 5 hours. The load into an empty table triggers the correct sorting, so a subsequent vacuum took only a few minutes to complete. Just a note on killing long running vacuums: it sometimes doesn't work especially once it's in the initialize merge phase. We've found that continually issuing the cancel command while it's in the sort phase is effective, but the point it to be wary of vacuuming large tables for their first time. Vacuums on large, unsorted tables write temporary data to disk, so there is also the potential to run out of disk and freeze the cluster, so be sure to always check that up to 3x the table size of disk space is available. There are a few simple strategies to prevent long running vacuums:
- Load your data in SORTKEY order: The incoming data doesn't have to be pre-ordered, just greater than existing data.
- Vacuum often: A table with a small unsorted region vacuums faster than one with a large unsorted region.
- If tables become too large to vacuum within a maintenance window, consider breaking them apart: We often see multi-billion record tables where the only data being queried is from the last month or two.
- Deep copies can be a faster solution than vacuums.