Comparing CPU Throughput of Azure and AWS EC2

Jul 10, 2014 / By Marc Fielding

Tags: , , , , , , , , ,

After observing CPU core sharing with Amazon Web Services EC2, I thought it would be interesting to see if Microsoft Azure platform exhibits the same behavior.

Signing up for Azure’s 30-day trial gives $200 in credit to use over the next 30-day period: more than enough for this kind of testing. Creating a new virtual machine, using the “quick create” option with Oracle Linux, and choosing a 4-core “A3″ standard instance.

I must say I like the machine naming into built-in “clouadpp.net” DNS that Azure uses: no mucking around with IP addresses. The VM provisioning definitely takes longer than AWS, though no more than a few minutes. And speaking of IP addresses, both start with 191.236. addresses assigned to Microsoft’s Brazilian subsidiary through the Latin American LACNIC registry, due to the lack of north american IP addresses.

Checking out the CPU specs as reported to the OS:

[azureuser@marc-cpu ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4

2.2GHz rather than 2.6GHz, but otherwise the same family and architecture as the E5-2670 under AWS. Identified as a single-socket, 4-core processor, without hyperthreads at all.

Running the tests

[azureuser@marc-cpu ~]$ taskset -pc 0 $$
pid 1588's current affinity list: 0-3
pid 1588's new affinity list: 0
[azureuser@marc-cpu ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 36.9319 s, 58.8 MB/s
[azureuser@marc-cpu ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 72.8379 s, 29.8 MB/s
2170552320 bytes (2.2 GB) copied, 73.6173 s, 29.5 MB/s

Pretty low; that’s half the throughput we saw on AWS, albeit with a slower clock speed here.

[azureuser@marc-cpu ~]$ taskset -pc 0,1 $$
pid 1588's current affinity list: 0
pid 1588's new affinity list: 0,1
[azureuser@marc-cpu ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
[azureuser@marc-cpu ~]$ 2170552320 bytes (2.2 GB) copied, 36.4285 s, 59.6 MB/s
2170552320 bytes (2.2 GB) copied, 36.7957 s, 59.0 MB/s

[azureuser@marc-cpu ~]$ taskset -pc 0,2 $$
pid 1588's current affinity list: 0,1
pid 1588's new affinity list: 0,2
[azureuser@marc-cpu ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
[azureuser@marc-cpu ~]$ 2170552320 bytes (2.2 GB) copied, 36.3998 s, 59.6 MB/s
2170552320 bytes (2.2 GB) copied, 36.776 s, 59.0 MB/s

Pretty consistent results, so no core sharing, but running considerably slower than we saw with AWS.

Kicking off 20 runs in a rows:

[azureuser@marc-cpu ~]$ taskset -pc 0-3 $$
pid 1588's current affinity list: 0,2
pid 1588's new affinity list: 0-3
[azureuser@marc-cpu ~]$ for run in {1..20}; do
>  for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2>> output | gzip -c > /dev/null & done
> wait
> done
...
[azureuser@marc-cpu ~]$ cat output | awk '/copied/ {print $8}' | sort | uniq -c
      1 59.1
      4 59.2
      1 59.3
      2 59.4
      2 59.5
      8 59.6
     12 59.7
      7 59.8
      3 59.9

We get very consistent results, between 59.1 and 59.9 mB/sec

Results from “top” while running:

cat > ~/.toprc <<-EOF
RCfile for "top with windows"           # shameless braggin'
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
Def     fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
        winflags=25913, sortindx=10, maxtasks=2
        summclr=1, msgsclr=1, headclr=3, taskclr=1
Job     fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
        winflags=62777, sortindx=0, maxtasks=0
        summclr=6, msgsclr=6, headclr=7, taskclr=6
Mem     fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
        winflags=62777, sortindx=13, maxtasks=0
        summclr=5, msgsclr=5, headclr=4, taskclr=5
Usr     fieldscur=ABDECGfhijlopqrstuvyzMKNWX
        winflags=62777, sortindx=4, maxtasks=0
        summclr=3, msgsclr=3, headclr=2, taskclr=3
EOF
[azureuser@marc-cpu ~]$  top -b -n20 -U azureuser
...
top - 14:38:41 up 2 min,  2 users,  load average: 2.27, 0.78, 0.28
Tasks: 205 total,   3 running, 202 sleeping,   0 stopped,   0 zombie
Cpu0  : 95.4%us,  4.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 94.4%us,  5.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1606 azureuse  20   0  4292  800  400 R 97.0  0.0   0:03.49 gzip
 1604 azureuse  20   0  4292  796  400 R 96.7  0.0   0:03.50 gzip

top - 14:38:44 up 2 min,  2 users,  load average: 2.25, 0.80, 0.29
Tasks: 205 total,   3 running, 202 sleeping,   0 stopped,   0 zombie
Cpu0  : 94.4%us,  5.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 72.3%us,  3.9%sy,  0.0%ni, 23.4%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu3  : 12.0%us,  0.7%sy,  0.0%ni, 85.6%id,  1.4%wa,  0.0%hi,  0.4%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1604 azureuse  20   0  4292  796  400 R 96.8  0.0   0:06.42 gzip
 1606 azureuse  20   0  4292  800  400 R 96.4  0.0   0:06.40 gzip

top - 14:38:47 up 2 min,  2 users,  load average: 2.25, 0.80, 0.29
Tasks: 205 total,   3 running, 202 sleeping,   0 stopped,   0 zombie
Cpu0  : 94.9%us,  5.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  9.7%us,  0.3%sy,  0.0%ni, 89.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  : 51.8%us,  2.8%sy,  0.0%ni, 45.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 17.9%us,  1.4%sy,  0.0%ni, 80.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1604 azureuse  20   0  4292  796  400 R 96.5  0.0   0:09.34 gzip
 1606 azureuse  20   0  4292  800  400 R 95.5  0.0   0:09.29 gzip

It’s using full CPUs and all from gzip, so no large system overhead here. Also, “%st”, time reported “stolen” by the hypervisor, is zero. We’re simply getting half the throughput of AWS.

Basic instances

In addition to standard instances, Microsoft makes available basic instances, which claim to offer “similar machine configurations as the Standard tier of instances offered today (Extra Small [A0] to Extra Large [A4]). These instances will cost up to 27% less than the corresponding instances in use today (which will now be called “Standard”) and do not include load balancing or auto-scaling, which are included in Standard” (http://azure.microsoft.com/blog/2014/03/31/microsoft-azure-innovation-quality-and-price/)

Having a look at throughput here, by creating a basic A3 instance “marc-cpu-basic” that otherwise matches exactly marc-cpu created earlier.

[azureuser@marc-cpu-basic ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4

CPU specs are identical to marc-cpu. Running the same tests:

[azureuser@marc-cpu-basic ~]$ taskset -pc 0 $$
pid 1566's current affinity list: 0-3
pid 1566's new affinity list: 0
[azureuser@marc-cpu-basic ~]$  dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 54.6678 s, 39.7 MB/s
for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
[azureuser@marc-cpu-basic ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 107.73 s, 20.1 MB/s
2170552320 bytes (2.2 GB) copied, 107.846 s, 20.1 MB/s

Now that’s very slow: even with the identical stated CPU specs as marc-cpu, marc-cpu-basic comes in with 33% less throughput.

Doing 20 runs in a rows:

[azureuser@marc-cpu-basic ~]$ taskset -pc 0-3 $$
pid 1566's current affinity list: 0
pid 1566's new affinity list: 0-3
[azureuser@marc-cpu-basic ~]$ for run in {1..20}; do
> for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2>> output | gzip -c > /dev/null & done
> wait
> done
...
[azureuser@marc-cpu-basic ~]$ cat output | awk '/copied/ {print $8}' | sort | uniq -c
      4 40.4
     15 40.5
     14 40.6
      7 40.7

Very consistent results, but consistently slow. They do show that cores aren’t being shared, but throughput is lower than even a shared core under AWS.

Wrapping up

Comparison chart

Under this simple gzip test, we are testing CPU integer performance. The Azure standard instance got half the throughput of the equivalent AWS instance, in spite of a clock speed only 15% slower. But the throughput was consistent: no drops when running on adjacent cores. The basic instance was a further 33% slower than a standard instance, in spite of having the same CPU configuration.

Under Azure, we simply aren’t getting a full physical core’s worth of throughput. Perhaps the hypervisor is capping throughput, and capping even lower for basic instances? Or maybe the actual CPU is different than the E5-2660 reported? For integer CPU-bound workloads like our gzip test, we would need to purchase at least twice as much capacity under Azure than AWS, making Azure considerably more expensive as a platform.

4 Responses to “Comparing CPU Throughput of Azure and AWS EC2”

  • Warner says:

    Hey Marc, did you have a single disk attached to the VM? The max per sec throughput of a single disk in Azure Standard is documented at 60MB/Sec and 500 IOPS. I think it’s 300 IOPS for Basic (not sure about MB rate). Looks like you’re hitting the IO limits before maxing the CPU.

  • nv says:

    This is quite an innovative benchmarking method, thumbs up! I’m not familiar with Azure but it would be interesting to see the results of benchmarking Azure CPU with sysbench and comparing the results with a similar sized EC2 if possible.

    n

    • Hello nv,

      I installed the latest sysbench, and here are the results using
      sysbench –test=cpu –cpu-max-prime=50000 –num-threads=2 run

      AWS m3.xlarge (E5-2670 v2 @ 2.50GHz this time):

      total time: 51.6462s
      total number of events: 10000
      total time taken by event execution: 103.2857
      per-request statistics:
      min: 10.29ms
      avg: 10.33ms
      max: 10.42ms
      approx. 95 percentile: 10.35ms

      Azure A3 (E5-2660 0 @ 2.20GHz):

      total time: 117.0863s
      total number of events: 10000
      total time taken by event execution: 234.1447
      per-request statistics:
      min: 13.92ms
      avg: 23.41ms
      max: 37.12ms
      approx. 95 percentile: 27.83ms

      Azure A3 basic (E5-2660 0 @ 2.20GHz):

      total time: 166.9538s
      total number of events: 10000
      total time taken by event execution: 333.8806
      per-request statistics:
      min: 24.61ms
      avg: 33.39ms
      max: 46.80ms
      approx. 95 percentile: 37.22ms

      I should note that although the Azure results show high pre-request variability, the results are very consistent across runs.

      Marc

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>