A couple of weeks ago I had a discussion about AWS RDS with one of my colleagues and he mentioned some unexpected IO problem during migration. It was during production cutover when they switched from the old environment on-prem to the freshly restored database on RDS. The migration itself is out of scope for the today's topic. We are going to point our attention to the unexpected IO problem. They should have plenty of IO bandwidth and everything was totally fine when they tested it before, but somehow many of the queries to the database were performing extremely slow for around 30 or 40 minutes and even after that they observed sporadic spikes in the number of sessions waiting for IO. After a couple of additional questions, it was more or less clear that they most likely hit a known problem described in AWS documentation. I am talking about "First touch penalty" on AWS. For this topic, I will use an Oracle RDS database to demonstrate the issue and how you can prepare for it. AWS documentation doesn't call it the "first touch penalty" anymore or maybe they have moved the page with the definition somewhere; I was not able to find even I though know it was there before. Still, you can read about it in the storage section on Elastic Block Storage (EBS) performance. In short, if you restore your RDS database or your EBS volume(s) from a snapshot the IO performance can drop below 50 percent of the expected level. It doesn't apply to any newly created volumes, only to those restored from a snapshot. When will it hit you? In my experience, I've seen it happen when people were testing migration procedure saving EBS volumes or creating snapshot backups for an RDS database before a migration to AWS. When the actual migration starts, the snapshot is restored and the migration process is severely delayed or even cancelled because the final cut-off took much more time than expected or performance was extremely impacted. In some cases, it was the final copying of the data to AWS, and in other cases, it was the final replication piece which was working slower than during pre-migration tests. How bad can it be? The problem appears only when you read a block first time. So, it depends on how many different blocks are going to be touched the first time. All subsequent IO operations with the blocks will be performed with the expected speed, and performance will be as good as expected even after rebooting an RDS or an instance. To demonstrate the issue I've prepared a simple test on Oracle RDS database with a straightforward select from a big 4Gb table just after restoring from a snapshot and using the same query again after restarting the instance. In both cases Oracle has chosen direct path read to access the data and we can see the difference in direct path read average waits and the total time for execution. Let's look a bit closer at both runs. Here is a table used for the tests:
CREATE TABLE test.testtab02 AS
SELECT LEVEL AS id,
dbms_random.String('x', 8) AS rnd_str_1,
SYSDATE - ( LEVEL + dbms_random.Value(0, 1000) ) AS use_date,
dbms_random.String('x', 8) AS rnd_str_2,
SYSDATE - ( LEVEL + dbms_random.Value(0, 1000) ) AS acc_date
FROM dual
CONNECT BY LEVEL < 1
/
INSERT /*+ append */ INTO test.testtab02
WITH v1
AS (SELECT dbms_random.String('x', 8) AS rnd_str_1,
SYSDATE - ( LEVEL + dbms_random.Value(0, 1000) ) AS use_date
FROM dual
CONNECT BY LEVEL < 10000),
v2
AS (SELECT dbms_random.String('x', 8) AS rnd_str_2,
SYSDATE - ( LEVEL + dbms_random.Value(0, 1000) ) AS acc_date
FROM dual
CONNECT BY LEVEL < 10000)
SELECT ROWNUM AS id,
v1.rnd_str_1,
v1.use_date,
v2.rnd_str_2,
v2.acc_date
FROM v1,
v2
/
The table is simple and about 4Gb in size without any indexes. To eliminate potential impact of network and other factors I've used "select count(*)". After creating and filling the table with data I took a snapshot from my RDS instance and restored it to another RDS instance. All the tests have been done on the latest RDS Oracle 12.1.0.2 v11 EE on db.t2.medium instance. Now we can look at the query and the results.
orcl> set timing on
orcl> set autotrace traceonly
orcl> select count(*) from TESTTAB02;
Elapsed: 00:04:43.56
Execution Plan
----------------------------------------------------------
Plan hash value: 3686556234
------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| TESTTAB02 | 1 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------
Statistics
----------------------------------------------------------
26 recursive calls
0 db block gets
630844 consistent gets
630808 physical reads
0 redo size
530 bytes sent via SQL*Net to client
511 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
5 sorts (memory)
0 sorts (disk)
1 rows processed
orcl>
And here is an excerpt from an AWR report for the run:

orcl> set timing on
orcl> set autotrace traceonly
orcl> select count(*) from TESTTAB02;
Elapsed: 00:01:19.98
Execution Plan
----------------------------------------------------------
Plan hash value: 3686556234
------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| TESTTAB02 | 1 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------
Statistics
----------------------------------------------------------
26 recursive calls
0 db block gets
630844 consistent gets
630808 physical reads
0 redo size
530 bytes sent via SQL*Net to client
511 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
5 sorts (memory)
0 sorts (disk)
1 rows processed
orcl>
And here is the AWR for the second run:



Share this
You May Also Like
These Related Stories
Merge Replication Identity Range Management - An Identity Disorder? Part 1

Merge Replication Identity Range Management - An Identity Disorder? Part 1
Jan 24, 2014
10
min read
Getting started with Orchestrator
Getting started with Orchestrator
Jun 22, 2018
5
min read
Adaptive Log File Sync: Oracle, Please Don't Do That Again
Adaptive Log File Sync: Oracle, Please Don't Do That Again
Oct 19, 2012
8
min read
No Comments Yet
Let us know what you think