Some months ago, Amazon Web Services changed the way they measure CPU capacity on their EC2 compute platform. In addition to the old ECUs, there is a new unit to measure compute capacity: vCPUs. The instance type page defines a vCPU as “a hyperthreaded core for M3, C3, R3, HS1, G2, and I2.” The description seems a bit confusing: is it a dedicated CPU core (which has two hyperthreads in the E5-2670 v2 CPU platform being used), or is it a half-core, single hyperthread?
I decided to test this out for myself by setting up one of the new-generation m3.xlarge instances (with thanks to Christo for technical assistance). It is stated to have 4 vCPUs running E5-2670 v2 processor at 2.5GHz on the Ivy Bridge-EP microarchitecture (or sometimes 2.6GHz in the case of xlarge instances).
I’m going to use paravirtualized Amazon Linux 64-bit for simplicity:
$ ec2-describe-images ami-fb8e9292 -H Type ImageID Name Owner State Accessibility ProductCodes Architecture ImageType KernelId RamdiskId Platform RootDeviceType VirtualizationType Hypervisor IMAGE ami-fb8e9292 amazon/amzn-ami-pv-2014.03.1.x86_64-ebs amazon available public x86_64 machine aki-919dcaf8 ebs paravirtual xen BLOCKDEVICEMAPPING /dev/sda1 snap-b047276d 8
Launching the instance:
$ ec2-run-instances ami-fb8e9292 -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d RESERVATION r-cde66bb3 462281317311 default INSTANCE i-b5f5a2e6 ami-fb8e9292 pending marc-aws 0 m3.xlarge 2014-06-16T20:23:48+0000 us-east-1d aki-919dcaf8 monitoring-disabled ebs paravirtual xen sg-5fc61437 default
The instance is up and running within a few minutes:
$ ec2-describe-instances i-b5f5a2e6 -H Type ReservationID Owner Groups Platform RESERVATION r-cde66bb3 462281317311 default INSTANCE i-b5f5a2e6 ami-fb8e9292 ec2-54-242-182-88.compute-1.amazonaws.com ip-10-145-209-67.ec2.internal running marc-aws 0 m3.xlarge 2014-06-16T20:23:48+0000 us-east-1d aki-919dcaf8 monitoring-disabled 54.242.182.88 10.145.209.67 ebs paravirtual xen sg-5fc61437 default BLOCKDEVICE /dev/sda1 vol-1633ed53 2014-06-16T20:23:52.000Z true
Logging in as ec2-user. First of all, let’s see what /proc/cpuinfo says:
[ec2-user@ip-10-7-160-199 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo processor : 0 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2599.998 physical id : 0 siblings : 4 core id : 0 cpu cores : 1 processor : 1 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2599.998 physical id : 0 siblings : 4 core id : 0 cpu cores : 1 processor : 2 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2599.998 physical id : 0 siblings : 4 core id : 0 cpu cores : 1 processor : 3 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2599.998 physical id : 0 siblings : 4 core id : 0 cpu cores : 1
Looks like I got some of the slightly faster 2.6GHz CPUs. /proc/cpuinfo shows four processors, each with physical id 0 and core id 0. Or in other words, one single-core processor with 4 threads. We know that the E5-2670 v2 processor is actually a 10-core processor, so the information we see at the OS level is not quite corresponding.
Nevertheless, we’ll proceed with a few simple tests. I’m going to run “gzip”, an integer-compute-intensive compression test, on 2.2GB of zeroes from /dev/zero. By using synthetic input and discarding output, we can avoid effects of disk I/O. I’m going to combine this test with taskset comments to impose processor affinity on the process.
The simplest case: a single thread, on processor 0:
[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0 $$ pid 1531's current affinity list: 0-3 pid 1531's new affinity list: 0 [ec2-user@ip-10-7-160-199 ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null 2170552320 bytes (2.2 GB) copied, 17.8837 s, 121 MB/s
With the single processor, we can process 121 MB/sec. Let’s try running two gzips at once. Sharing a single processor, we should see half the throughput.
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 35.8279 s, 60.6 MB/s
2170552320 bytes (2.2 GB) copied, 35.8666 s, 60.5 MB/s
Now, let’s make things more interesting: two threads, on adjacent processors. If they are truly dedicated CPU cores, we should get a full 121 MB/s each. If our processors are in fact hyperthreads, we’ll see throughput drop.
[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,1 $$
pid 1531's current affinity list: 0
pid 1531's new affinity list: 0,1
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 27.1704 s, 79.9 MB/s
2170552320 bytes (2.2 GB) copied, 27.1687 s, 79.9 MB/s
We have our answer: throughput has dropped by a third, to 79.9 MB/sec, showing that processors 0 and 1 are threads sharing a single core. (But note that Hyperthreading is giving performance benefits here: 79.9 MB/s on a shared core is higher than then 60.5 MB/s we see when sharing a single hyperthread.)
Trying the exact same test, but this time, non-adjacent processors 0 and 2:
[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,2 $$
pid 1531's current affinity list: 0,1
pid 1531's new affinity list: 0,2
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 17.8967 s, 121 MB/s
2170552320 bytes (2.2 GB) copied, 17.8982 s, 121 MB/s
All the way up to full-speed, showing dedicated cores.
What does this all mean? Let’s go back to the Amazon’s vCPU definition
Each vCPU is a hyperthreaded core
As our tests have shown, a vCPU is most definitely not a core. It’s a half of a shared core, or one hyperthread.
There’s another issue at play here too: the shared-core behavior is hidden from the operating system. Going back to /proc/cpuinfo:
[ec2-user@ip-10-7-160-199 ~]$ grep 'core id' /proc/cpuinfo core id : 0 core id : 0 core id : 0 core id : 0
This means that the OS scheduler has no way of knowing which processors have shared cores, and can not schedule tasks around it. Let’s go back to our two-thread test, but instead of restricting it to two specific processors, we’ll let it run on any of them.
[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0-3 $$
pid 1531's current affinity list: 0,2
pid 1531's new affinity list: 0-3
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 18.041 s, 120 MB/s
2170552320 bytes (2.2 GB) copied, 18.0451 s, 120 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 21.2189 s, 102 MB/s
2170552320 bytes (2.2 GB) copied, 21.2215 s, 102 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 26.2199 s, 82.8 MB/s
2170552320 bytes (2.2 GB) copied, 26.22 s, 82.8 MB/s
We see throughput varying between 82 MB/sec and 120 MB/sec, for the exact same workload. To get some more performance information, we’ll configure top to run 10-second samples with per-processor usage information:
[ec2-user@ip-10-7-160-199 ~]$ cat > ~/.toprc <<-EOF RCfile for "top with windows" # shameless braggin' Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0 Def fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX winflags=25913, sortindx=10, maxtasks=2 summclr=1, msgsclr=1, headclr=3, taskclr=1 Job fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX winflags=62777, sortindx=0, maxtasks=0 summclr=6, msgsclr=6, headclr=7, taskclr=6 Mem fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX winflags=62777, sortindx=13, maxtasks=0 summclr=5, msgsclr=5, headclr=4, taskclr=5 Usr fieldscur=ABDECGfhijlopqrstuvyzMKNWX winflags=62777, sortindx=4, maxtasks=0 summclr=3, msgsclr=3, headclr=2, taskclr=3 EOF [ec2-user@ip-10-7-160-199 ~]$ top -b -n10 -U ec2-user top - 21:07:50 up 43 min, 2 users, load average: 0.55, 0.45, 0.36 Tasks: 86 total, 4 running, 82 sleeping, 0 stopped, 0 zombie Cpu0 : 96.7%us, 3.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 1.4%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.3%hi, 0.0%si, 0.3%st Cpu2 : 96.0%us, 4.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 1.0%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.7%hi, 0.0%si, 0.3%st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1766 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:06.08 gzip 1768 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:06.08 gzip
Here two non-adjacent CPUs are in use. But 3 seconds later, the processes are running on adjacent CPUs:
top - 21:07:53 up 43 min, 2 users, load average: 0.55, 0.45, 0.36 Tasks: 86 total, 4 running, 82 sleeping, 0 stopped, 0 zombie Cpu0 : 96.3%us, 3.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 96.0%us, 3.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.3%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.0%si, 0.3%st Cpu3 : 0.3%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1766 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:09.08 gzip 1768 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:09.08 gzip
Although usage percentages are similar, we’ve seen earlier that throughput drops by a third when cores are shared, and we see varied throughput as the processes are context-switched between processors.
This type of situation arises where compute-intensive workloads are running, and when there are fewer processes than total CPU threads. And if only AWS would report correct core IDs to the system, this problem wouldn’t happen: the OS scheduler would make sure processes did not share cores unless necessary.
Here’s a chart summarizing the results:
Over the course of the testing I’ve learned two things:
Readers: what has been your experience with CPU performance in AWS? If any of you has access to a physical machine running E5-2670 processors, it would be interesting to see how the simple gzip test runs.
Ready to optimize your use of Cloud's tools?