Oracle Amazon RDS

Technical Track Cloud Amazon Web Services (AWS) Amazon Relational Database Service (Rds)

AWS Redshift cluster sizing

3 min read

Dec 3, 2015 12:00:00 AM

Sizing a database is one of the primary DBA functions no matter what the technology. Though Redshift is highly managed for us, we must still address this task. The first thing to note is that in sizing a cluster, we start with an estimated need of storage capacity, since the amount of storage available per node of the cluster is a fixed amount. While you get the disk space you pay for, AWS guidelines and user experience shows that performance can suffer when space becomes tight (>80%). So when sizing the cluster for a specific capacity an extra 20% will need to be tacked onto your calculations. AWS currently offers two types of instances for Redshift clusters, dense compute (dc1.*) or dense storage (ds2.*) servers. The dc1 series offers about 6x the CPU and 6x the memory per terabyte of storage. These are good for use cases where there is a regular load of heavy analytics querying involving multiple joins. The ds2 series is more cost effective when the tables are highly denormalized and the analytics can be offloaded into a BI tool such as Microstrategy. It is possible to migrate to a new node type through snapshots, but the process will involve some downtime. To address a few points of the Redshift architecture, note that only the compute nodes hold storage so the leader node is not considered (nor do you have to pay for it). Each storage disk is replicated to two others for redundancy, but these additional disks are not part of the calculations or specifications used for sizing (though they affect the disk monitoring calculations that we see later). Here we examine example clusters on a budget of around $10k using yearly contracted instances (saving 30+% over on-demand instances):

(7) dc1.large with 160Gb SSD @ $1380/year = 1120Gb @ $9660/year. (2) ds2.xlarge with 2Tb HHD @ $4295/year = 4000Gb @ $8590/year.

The dc1 instances in this case are around 4x as expensive per terabyte (though still quite the bargain as compared to hosting the cluster in-house) and give a 30-80% performance gain (depending on the benchmarks, ( example)). And while you can always add nodes to accommodate data growth, you can only add nodes of the same instance type which could potentially become quite expensive if you’re using instances of the small disk capacity dc1’s. Once your cluster is up, it is vital to monitor disk and CPU utilization so you know when to add new nodes. It is highly advisable to watch the graphs under the Performance tab in the Redshift Console as you add new load to the cluster. There are built in Cloudwatch alarms for disk usage, and these should be configured to alert above 70%. I like to know well in advance when it is getting there, so I regularly use Period= 5 minutes, Statistic = average, over 1 consecutive period, but since loads and vacuums can create usage surge spikes, you might want to configure the alert over more or longer periods. While Cloudwatch is great for this monitoring, it is convenient to also be able to compute capacity. There are several ways to query disk usage that render subtly different results, unfortunately none of which will yield the stats given by Cloudwatch. Here’s an example for a 6-node 12Tb cluster that currently shows disk space as 32% used on each node in the Console yet displays as 23%:

select host ,sum(capacity)/3 as total ,sum(used)/3 as used ,sum(capacity)/3 - sum(used)/3 as free ,(((sum(used)/3)/(sum(capacity)/3.00)) * 100.00) as pct_used from STV_PARTITIONS group by host

host	total	used	free	pct_used
0	1904780	450450	1454330	23.65
1	1904780	449895	1454885	23.62
2	1904780	449776	1455004	23.61
3	1904780	450673	1454107	23.66
4	1904780	451483	1453297	23.7
5	1904780	447840	1456940	23.51

The point here is to be wary of querying disk space and rely on Cloudwatch and the Console. Compression encoding on your tables can save substantial space (50-75% on average depending on the encoding) and can improve performance by 100+%. Encoding can also increase CPU usage, so we want to monitor it as we implement encoding. A cluster can be brought to a standstill by just one node hitting 100% CPU utilization, so we also need to setup Cloudwatch alarms to make sure we average < 70% or don’t hit spikes > 90% for more than 10 minutes. Once we bump up against those metrics, it’s time to add another node. Discover more about our expertise in the Cloud.

On this page

Ready to unlock value from your data?

With Pythian, you can accomplish your data transformation goals and more.

Speak with Pythian consultants now →

What We Do

AWS Redshift cluster sizing

Share this

Share this

More resources

Azure Backup for SQL Server public preview

Options for Tracing Oracle dbms_stats

How to run DBSAT on RAC Instances

Ready to unlock value from your data?