Virtual CPUs with Amazon Web Services

8 min read

Jun 24, 2014 12:00:00 AM

Some months ago, Amazon Web Services changed the way they measure CPU capacity on their EC2 compute platform. In addition to the old ECUs, there is a new unit to measure compute capacity: vCPUs. The instance type page defines a vCPU as “a hyperthreaded core for M3, C3, R3, HS1, G2, and I2.” The description seems a bit confusing: is it a dedicated CPU core (which has two hyperthreads in the E5-2670 v2 CPU platform being used), or is it a half-core, single hyperthread?

I decided to test this out for myself by setting up one of the new-generation m3.xlarge instances (with thanks to Christo for technical assistance). It is stated to have 4 vCPUs running E5-2670 v2 processor at 2.5GHz on the Ivy Bridge-EP microarchitecture (or sometimes 2.6GHz in the case of xlarge instances).

Investigating for ourselves

I’m going to use paravirtualized Amazon Linux 64-bit for simplicity:

$ ec2-describe-images ami-fb8e9292 -H
Type ImageID Name Owner State Accessibility ProductCodes Architecture ImageType KernelId RamdiskId Platform RootDeviceType VirtualizationType Hypervisor
IMAGE ami-fb8e9292 amazon/amzn-ami-pv-2014.03.1.x86_64-ebs amazon available public x86_64 machine aki-919dcaf8 ebs paravirtual xen
BLOCKDEVICEMAPPING /dev/sda1 snap-b047276d 8

Launching the instance:

$ ec2-run-instances ami-fb8e9292 -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d
RESERVATION r-cde66bb3 462281317311 default
INSTANCE i-b5f5a2e6 ami-fb8e9292 pending marc-aws 0 m3.xlarge 2014-06-16T20:23:48+0000 us-east-1d aki-919dcaf8 monitoring-disabled ebs paravirtual xen sg-5fc61437 default

The instance is up and running within a few minutes:

$ ec2-describe-instances i-b5f5a2e6 -H
Type ReservationID Owner Groups Platform
RESERVATION r-cde66bb3 462281317311 default
INSTANCE i-b5f5a2e6 ami-fb8e9292 ec2-54-242-182-88.compute-1.amazonaws.com ip-10-145-209-67.ec2.internal running marc-aws 0 m3.xlarge 2014-06-16T20:23:48+0000 us-east-1d aki-919dcaf8 monitoring-disabled 54.242.182.88 10.145.209.67 ebs paravirtual xen sg-5fc61437 default
BLOCKDEVICE /dev/sda1 vol-1633ed53 2014-06-16T20:23:52.000Z true

Logging in as ec2-user. First of all, let’s see what /proc/cpuinfo says:

[ec2-user@ip-10-7-160-199 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor : 0
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
processor : 1
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
processor : 2
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
processor : 3
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1

Looks like I got some of the slightly faster 2.6GHz CPUs. /proc/cpuinfo shows four processors, each with physical id 0 and core id 0. Or in other words, one single-core processor with 4 threads. We know that the E5-2670 v2 processor is actually a 10-core processor, so the information we see at the OS level is not quite corresponding.

Nevertheless, we’ll proceed with a few simple tests. I’m going to run “gzip”, an integer-compute-intensive compression test, on 2.2GB of zeroes from /dev/zero. By using synthetic input and discarding output, we can avoid effects of disk I/O. I’m going to combine this test with taskset comments to impose processor affinity on the process.

A simple test

The simplest case: a single thread, on processor 0:

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0 $$
pid 1531's current affinity list: 0-3
pid 1531's new affinity list: 0
[ec2-user@ip-10-7-160-199 ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 17.8837 s, 121 MB/s

With the single processor, we can process 121 MB/sec. Let’s try running two gzips at once. Sharing a single processor, we should see half the throughput.

[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 35.8279 s, 60.6 MB/s
2170552320 bytes (2.2 GB) copied, 35.8666 s, 60.5 MB/s

Sharing those cores

Now, let’s make things more interesting: two threads, on adjacent processors. If they are truly dedicated CPU cores, we should get a full 121 MB/s each. If our processors are in fact hyperthreads, we’ll see throughput drop.

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,1 $$
pid 1531's current affinity list: 0
pid 1531's new affinity list: 0,1
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 27.1704 s, 79.9 MB/s
2170552320 bytes (2.2 GB) copied, 27.1687 s, 79.9 MB/s

We have our answer: throughput has dropped by a third, to 79.9 MB/sec, showing that processors 0 and 1 are threads sharing a single core. (But note that Hyperthreading is giving performance benefits here: 79.9 MB/s on a shared core is higher than then 60.5 MB/s we see when sharing a single hyperthread.)

Trying the exact same test, but this time, non-adjacent processors 0 and 2:

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,2 $$
pid 1531's current affinity list: 0,1
pid 1531's new affinity list: 0,2
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 17.8967 s, 121 MB/s
2170552320 bytes (2.2 GB) copied, 17.8982 s, 121 MB/s

All the way up to full-speed, showing dedicated cores.

What does this all mean? Let’s go back to the Amazon’s vCPU definition

Each vCPU is a hyperthreaded core

As our tests have shown, a vCPU is most definitely not a core. It’s a half of a shared core, or one hyperthread.

A side effect: inconsistent performance

There’s another issue at play here too: the shared-core behavior is hidden from the operating system. Going back to /proc/cpuinfo:

[ec2-user@ip-10-7-160-199 ~]$ grep 'core id' /proc/cpuinfo
core id : 0
core id : 0
core id : 0
core id : 0

This means that the OS scheduler has no way of knowing which processors have shared cores, and can not schedule tasks around it. Let’s go back to our two-thread test, but instead of restricting it to two specific processors, we’ll let it run on any of them.

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0-3 $$
pid 1531's current affinity list: 0,2
pid 1531's new affinity list: 0-3
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 18.041 s, 120 MB/s
2170552320 bytes (2.2 GB) copied, 18.0451 s, 120 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 21.2189 s, 102 MB/s
2170552320 bytes (2.2 GB) copied, 21.2215 s, 102 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 26.2199 s, 82.8 MB/s
2170552320 bytes (2.2 GB) copied, 26.22 s, 82.8 MB/s

We see throughput varying between 82 MB/sec and 120 MB/sec, for the exact same workload. To get some more performance information, we’ll configure top to run 10-second samples with per-processor usage information:

[ec2-user@ip-10-7-160-199 ~]$ cat > ~/.toprc <<-EOF
RCfile for "top with windows" # shameless braggin'
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
Def fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
winflags=25913, sortindx=10, maxtasks=2
summclr=1, msgsclr=1, headclr=3, taskclr=1
Job fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
winflags=62777, sortindx=0, maxtasks=0
summclr=6, msgsclr=6, headclr=7, taskclr=6
Mem fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
winflags=62777, sortindx=13, maxtasks=0
summclr=5, msgsclr=5, headclr=4, taskclr=5
Usr fieldscur=ABDECGfhijlopqrstuvyzMKNWX
winflags=62777, sortindx=4, maxtasks=0
summclr=3, msgsclr=3, headclr=2, taskclr=3
EOF
[ec2-user@ip-10-7-160-199 ~]$ top -b -n10 -U ec2-user
top - 21:07:50 up 43 min, 2 users, load average: 0.55, 0.45, 0.36
Tasks: 86 total, 4 running, 82 sleeping, 0 stopped, 0 zombie
Cpu0 : 96.7%us, 3.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 1.4%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.3%hi, 0.0%si, 0.3%st
Cpu2 : 96.0%us, 4.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 1.0%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.7%hi, 0.0%si, 0.3%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1766 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:06.08 gzip
1768 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:06.08 gzip

Here two non-adjacent CPUs are in use. But 3 seconds later, the processes are running on adjacent CPUs:

top - 21:07:53 up 43 min, 2 users, load average: 0.55, 0.45, 0.36
Tasks: 86 total, 4 running, 82 sleeping, 0 stopped, 0 zombie
Cpu0 : 96.3%us, 3.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 96.0%us, 3.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.3%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.0%si, 0.3%st
Cpu3 : 0.3%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1766 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:09.08 gzip
1768 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:09.08 gzip

Although usage percentages are similar, we’ve seen earlier that throughput drops by a third when cores are shared, and we see varied throughput as the processes are context-switched between processors.

This type of situation arises where compute-intensive workloads are running, and when there are fewer processes than total CPU threads. And if only AWS would report correct core IDs to the system, this problem wouldn’t happen: the OS scheduler would make sure processes did not share cores unless necessary.

Here’s a chart summarizing the results:

Summing up

Over the course of the testing I’ve learned two things:

A vCPU in an AWS environment actually represents only half a physical core. So if you’re looking for equivalent compute capacity to, say, an 8-core server, you would need a so-called 4xlarge EC2 instance with 16 vCPUs. So take it into account in your costing models!
The mislabeling of the CPU threads as separate single-core processors can result in performance variability as processes are switched between threads. This is something the AWS and/or Xen teams should be able to fix in the kernel.

Readers: what has been your experience with CPU performance in AWS? If any of you has access to a physical machine running E5-2670 processors, it would be interesting to see how the simple gzip test runs.