Virtual CPUs with Amazon Web Services

Jun 24, 2014 / By Marc Fielding

Tags: , , , , , , , , ,

Some months ago, Amazon Web Services changed the way they measure CPU capacity on their EC2 compute platform. In addition to the old ECUs, there is a new unit to measure compute capacity: vCPUs. The instance type page defines a vCPU as “a hyperthreaded core for M3, C3, R3, HS1, G2, and I2.” The description seems a bit confusing: is it a dedicated CPU core (which has two hyperthreads in the E5-2670 v2 CPU platform being used), or is it a half-core, single hyperthread?

I decided to test this out for myself by setting up one of the new-generation m3.xlarge instances (with thanks to Christo for technical assistance). It is stated to have 4 vCPUs running E5-2670 v2 processor at 2.5GHz on the Ivy Bridge-EP microarchitecture (or sometimes 2.6GHz in the case of xlarge instances).

Investigating for ourselves

I’m going to use paravirtualized Amazon Linux 64-bit for simplicity:

$ ec2-describe-images ami-fb8e9292 -H
Type    ImageID Name    Owner   State   Accessibility   ProductCodes    Architecture    ImageType       KernelId        RamdiskId Platform        RootDeviceType  VirtualizationType      Hypervisor
IMAGE   ami-fb8e9292    amazon/amzn-ami-pv-2014.03.1.x86_64-ebs amazon  available       public          x86_64  machine aki-919dcaf8                      ebs     paravirtual     xen
BLOCKDEVICEMAPPING      /dev/sda1               snap-b047276d   8

Launching the instance:

$ ec2-run-instances ami-fb8e9292 -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d
RESERVATION     r-cde66bb3      462281317311    default
INSTANCE        i-b5f5a2e6      ami-fb8e9292                    pending marc-aws        0               m3.xlarge       2014-06-16T20:23:48+0000  us-east-1d      aki-919dcaf8                    monitoring-disabled                              ebs                                      paravirtual     xen             sg-5fc61437     default

The instance is up and running within a few minutes:

$ ec2-describe-instances i-b5f5a2e6 -H
Type    ReservationID   Owner   Groups  Platform
RESERVATION     r-cde66bb3      462281317311    default
INSTANCE        i-b5f5a2e6      ami-fb8e9292    ec2-54-242-182-88.compute-1.amazonaws.com       ip-10-145-209-67.ec2.internal     running marc-aws        0               m3.xlarge       2014-06-16T20:23:48+0000        us-east-1d      aki-919dcaf8                      monitoring-disabled     54.242.182.88   10.145.209.67                   ebs                      paravirtual      xen             sg-5fc61437     default
BLOCKDEVICE     /dev/sda1       vol-1633ed53    2014-06-16T20:23:52.000Z        true

Logging in as ec2-user. First of all, let’s see what /proc/cpuinfo says:

[ec2-user@ip-10-7-160-199 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2599.998
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1

Looks like I got some of the slightly faster 2.6GHz CPUs. /proc/cpuinfo shows four processors, each with physical id 0 and core id 0. Or in other words, one single-core processor with 4 threads. We know that the E5-2670 v2 processor is actually a 10-core processor, so the information we see at the OS level is not quite corresponding.

Nevertheless, we’ll proceed with a few simple tests. I’m going to run “gzip”, an integer-compute-intensive compression test, on 2.2GB of zeroes from /dev/zero. By using synthetic input and discarding output, we can avoid effects of disk I/O. I’m going to combine this test with taskset comments to impose processor affinity on the process.

A simple test

The simplest case: a single thread, on processor 0:

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0 $$
pid 1531's current affinity list: 0-3
pid 1531's new affinity list: 0
[ec2-user@ip-10-7-160-199 ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 17.8837 s, 121 MB/s

With the single processor, we can process 121 MB/sec. Let’s try running two gzips at once. Sharing a single processor, we should see half the throughput.

[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 35.8279 s, 60.6 MB/s
2170552320 bytes (2.2 GB) copied, 35.8666 s, 60.5 MB/s

Sharing those cores

Now, let’s make things more interesting: two threads, on adjacent processors. If they are truly dedicated CPU cores, we should get a full 121 MB/s each. If our processors are in fact hyperthreads, we’ll see throughput drop.

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,1 $$
pid 1531's current affinity list: 0
pid 1531's new affinity list: 0,1
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 27.1704 s, 79.9 MB/s
2170552320 bytes (2.2 GB) copied, 27.1687 s, 79.9 MB/s

We have our answer: throughput has dropped by a third, to 79.9 MB/sec, showing that processors 0 and 1 are threads sharing a single core. (But note that Hyperthreading is giving performance benefits here: 79.9 MB/s on a shared core is higher than then 60.5 MB/s we see when sharing a single hyperthread.)

Trying the exact same test, but this time, non-adjacent processors 0 and 2:

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,2 $$
pid 1531's current affinity list: 0,1
pid 1531's new affinity list: 0,2
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 17.8967 s, 121 MB/s
2170552320 bytes (2.2 GB) copied, 17.8982 s, 121 MB/s

All the way up to full-speed, showing dedicated cores.

What does this all mean? Let’s go back to the Amazon’s vCPU definition

Each vCPU is a hyperthreaded core

As our tests have shown, a vCPU is most definitely not a core. It’s a half of a shared core, or one hyperthread.

A side effect: inconsistent performance

There’s another issue at play here too: the shared-core behavior is hidden from the operating system. Going back to /proc/cpuinfo:

[ec2-user@ip-10-7-160-199 ~]$ grep 'core id' /proc/cpuinfo
core id         : 0
core id         : 0
core id         : 0
core id         : 0

This means that the OS scheduler has no way of knowing which processors have shared cores, and can not schedule tasks around it. Let’s go back to our two-thread test, but instead of restricting it to two specific processors, we’ll let it run on any of them.

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0-3 $$
pid 1531's current affinity list: 0,2
pid 1531's new affinity list: 0-3
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 18.041 s, 120 MB/s
2170552320 bytes (2.2 GB) copied, 18.0451 s, 120 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 21.2189 s, 102 MB/s
2170552320 bytes (2.2 GB) copied, 21.2215 s, 102 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 26.2199 s, 82.8 MB/s
2170552320 bytes (2.2 GB) copied, 26.22 s, 82.8 MB/s

We see throughput varying between 82 MB/sec and 120 MB/sec, for the exact same workload. To get some more performance information, we’ll configure top to run 10-second samples with per-processor usage information:

[ec2-user@ip-10-7-160-199 ~]$ cat > ~/.toprc <<-EOF
RCfile for "top with windows"           # shameless braggin'
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
Def     fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
        winflags=25913, sortindx=10, maxtasks=2
        summclr=1, msgsclr=1, headclr=3, taskclr=1
Job     fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
        winflags=62777, sortindx=0, maxtasks=0
        summclr=6, msgsclr=6, headclr=7, taskclr=6
Mem     fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
        winflags=62777, sortindx=13, maxtasks=0
        summclr=5, msgsclr=5, headclr=4, taskclr=5
Usr     fieldscur=ABDECGfhijlopqrstuvyzMKNWX
        winflags=62777, sortindx=4, maxtasks=0
        summclr=3, msgsclr=3, headclr=2, taskclr=3
EOF
[ec2-user@ip-10-7-160-199 ~]$ top -b -n10 -U ec2-user
top - 21:07:50 up 43 min,  2 users,  load average: 0.55, 0.45, 0.36
Tasks:  86 total,   4 running,  82 sleeping,   0 stopped,   0 zombie
Cpu0  : 96.7%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  1.4%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.3%hi,  0.0%si,  0.3%st
Cpu2  : 96.0%us,  4.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.7%hi,  0.0%si,  0.3%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1766 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:06.08 gzip
 1768 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:06.08 gzip

Here two non-adjacent CPUs are in use. But 3 seconds later, the processes are running on adjacent CPUs:

top - 21:07:53 up 43 min,  2 users,  load average: 0.55, 0.45, 0.36
Tasks:  86 total,   4 running,  82 sleeping,   0 stopped,   0 zombie
Cpu0  : 96.3%us,  3.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 96.0%us,  3.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.0%si,  0.3%st
Cpu3  :  0.3%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.3%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1766 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:09.08 gzip
 1768 ec2-user  20   0  4444  608  400 R 99.7  0.0   0:09.08 gzip

Although usage percentages are similar, we’ve seen earlier that throughput drops by a third when cores are shared, and we see varied throughput as the processes are context-switched between processors.

This type of situation arises where compute-intensive workloads are running, and when there are fewer processes than total CPU threads. And if only AWS would report correct core IDs to the system, this problem wouldn’t happen: the OS scheduler would make sure processes did not share cores unless necessary.

Here’s a chart summarizing the results:

aws-cpu

Summing up

Over the course of the testing I’ve learned two things:

  • A vCPU in an AWS environment actually represents only half a physical core. So if you’re looking for equivalent compute capacity to, say, an 8-core server, you would need a so-called 4xlarge EC2 instance with 16 vCPUs. So take it into account in your costing models!
  • The mislabeling of the CPU threads as separate single-core processors can result in performance variability as processes are switched between threads. This is something the AWS and/or Xen teams should be able to fix in the kernel.

Readers: what has been your experience with CPU performance in AWS? If any of you has access to a physical machine running E5-2670 processors, it would be interesting to see how the simple gzip test runs.

5 Responses to “Virtual CPUs with Amazon Web Services”

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>