Comparing CPU Throughput of Azure and AWS EC2

Posted in: Technical Track

After observing CPU core sharing with Amazon Web Services EC2, I thought it would be interesting to see if Microsoft Azure platform exhibits the same behavior.

Signing up for Azure’s 30-day trial gives $200 in credit to use over the next 30-day period: more than enough for this kind of testing. Creating a new virtual machine, using the “quick create” option with Oracle Linux, and choosing a 4-core “A3” standard instance.

I must say I like the machine naming into built-in “clouadpp.net” DNS that Azure uses: no mucking around with IP addresses. The VM provisioning definitely takes longer than AWS, though no more than a few minutes. And speaking of IP addresses, both start with 191.236. addresses assigned to Microsoft’s Brazilian subsidiary through the Latin American LACNIC registry, due to the lack of north american IP addresses.

Checking out the CPU specs as reported to the OS:

[azureuser@marc-cpu ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4

2.2GHz rather than 2.6GHz, but otherwise the same family and architecture as the E5-2670 under AWS. Identified as a single-socket, 4-core processor, without hyperthreads at all.

Running the tests

[azureuser@marc-cpu ~]$ taskset -pc 0 $$
pid 1588's current affinity list: 0-3
pid 1588's new affinity list: 0
[azureuser@marc-cpu ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 36.9319 s, 58.8 MB/s
[azureuser@marc-cpu ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 72.8379 s, 29.8 MB/s
2170552320 bytes (2.2 GB) copied, 73.6173 s, 29.5 MB/s

Pretty low; that’s half the throughput we saw on AWS, albeit with a slower clock speed here.

[azureuser@marc-cpu ~]$ taskset -pc 0,1 $$
pid 1588's current affinity list: 0
pid 1588's new affinity list: 0,1
[azureuser@marc-cpu ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
[azureuser@marc-cpu ~]$ 2170552320 bytes (2.2 GB) copied, 36.4285 s, 59.6 MB/s
2170552320 bytes (2.2 GB) copied, 36.7957 s, 59.0 MB/s

[azureuser@marc-cpu ~]$ taskset -pc 0,2 $$
pid 1588's current affinity list: 0,1
pid 1588's new affinity list: 0,2
[azureuser@marc-cpu ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
[azureuser@marc-cpu ~]$ 2170552320 bytes (2.2 GB) copied, 36.3998 s, 59.6 MB/s
2170552320 bytes (2.2 GB) copied, 36.776 s, 59.0 MB/s

Pretty consistent results, so no core sharing, but running considerably slower than we saw with AWS.

Kicking off 20 runs in a rows:

[azureuser@marc-cpu ~]$ taskset -pc 0-3 $$
pid 1588's current affinity list: 0,2
pid 1588's new affinity list: 0-3
[azureuser@marc-cpu ~]$ for run in {1..20}; do
>  for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2>> output | gzip -c > /dev/null & done
> wait
> done
...
[azureuser@marc-cpu ~]$ cat output | awk '/copied/ {print $8}' | sort | uniq -c
      1 59.1
      4 59.2
      1 59.3
      2 59.4
      2 59.5
      8 59.6
     12 59.7
      7 59.8
      3 59.9

We get very consistent results, between 59.1 and 59.9 mB/sec

Results from “top” while running:

cat > ~/.toprc <<-EOF
RCfile for "top with windows"           # shameless braggin'
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
Def     fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
        winflags=25913, sortindx=10, maxtasks=2
        summclr=1, msgsclr=1, headclr=3, taskclr=1
Job     fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
        winflags=62777, sortindx=0, maxtasks=0
        summclr=6, msgsclr=6, headclr=7, taskclr=6
Mem     fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
        winflags=62777, sortindx=13, maxtasks=0
        summclr=5, msgsclr=5, headclr=4, taskclr=5
Usr     fieldscur=ABDECGfhijlopqrstuvyzMKNWX
        winflags=62777, sortindx=4, maxtasks=0
        summclr=3, msgsclr=3, headclr=2, taskclr=3
EOF
[azureuser@marc-cpu ~]$  top -b -n20 -U azureuser
...
top - 14:38:41 up 2 min,  2 users,  load average: 2.27, 0.78, 0.28
Tasks: 205 total,   3 running, 202 sleeping,   0 stopped,   0 zombie
Cpu0  : 95.4%us,  4.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 94.4%us,  5.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1606 azureuse  20   0  4292  800  400 R 97.0  0.0   0:03.49 gzip
 1604 azureuse  20   0  4292  796  400 R 96.7  0.0   0:03.50 gzip

top - 14:38:44 up 2 min,  2 users,  load average: 2.25, 0.80, 0.29
Tasks: 205 total,   3 running, 202 sleeping,   0 stopped,   0 zombie
Cpu0  : 94.4%us,  5.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 72.3%us,  3.9%sy,  0.0%ni, 23.4%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu3  : 12.0%us,  0.7%sy,  0.0%ni, 85.6%id,  1.4%wa,  0.0%hi,  0.4%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1604 azureuse  20   0  4292  796  400 R 96.8  0.0   0:06.42 gzip
 1606 azureuse  20   0  4292  800  400 R 96.4  0.0   0:06.40 gzip

top - 14:38:47 up 2 min,  2 users,  load average: 2.25, 0.80, 0.29
Tasks: 205 total,   3 running, 202 sleeping,   0 stopped,   0 zombie
Cpu0  : 94.9%us,  5.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  9.7%us,  0.3%sy,  0.0%ni, 89.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  : 51.8%us,  2.8%sy,  0.0%ni, 45.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 17.9%us,  1.4%sy,  0.0%ni, 80.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1604 azureuse  20   0  4292  796  400 R 96.5  0.0   0:09.34 gzip
 1606 azureuse  20   0  4292  800  400 R 95.5  0.0   0:09.29 gzip

It’s using full CPUs and all from gzip, so no large system overhead here. Also, “%st”, time reported “stolen” by the hypervisor, is zero. We’re simply getting half the throughput of AWS.

Basic instances

In addition to standard instances, Microsoft makes available basic instances, which claim to offer “similar machine configurations as the Standard tier of instances offered today (Extra Small [A0] to Extra Large [A4]). These instances will cost up to 27% less than the corresponding instances in use today (which will now be called “Standard”) and do not include load balancing or auto-scaling, which are included in Standard” (http://azure.microsoft.com/blog/2014/03/31/microsoft-azure-innovation-quality-and-price/)

Having a look at throughput here, by creating a basic A3 instance “marc-cpu-basic” that otherwise matches exactly marc-cpu created earlier.

[azureuser@marc-cpu-basic ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4

CPU specs are identical to marc-cpu. Running the same tests:

[azureuser@marc-cpu-basic ~]$ taskset -pc 0 $$
pid 1566's current affinity list: 0-3
pid 1566's new affinity list: 0
[azureuser@marc-cpu-basic ~]$  dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 54.6678 s, 39.7 MB/s
for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
[azureuser@marc-cpu-basic ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 107.73 s, 20.1 MB/s
2170552320 bytes (2.2 GB) copied, 107.846 s, 20.1 MB/s

Now that’s very slow: even with the identical stated CPU specs as marc-cpu, marc-cpu-basic comes in with 33% less throughput.

Doing 20 runs in a rows:

[azureuser@marc-cpu-basic ~]$ taskset -pc 0-3 $$
pid 1566's current affinity list: 0
pid 1566's new affinity list: 0-3
[azureuser@marc-cpu-basic ~]$ for run in {1..20}; do
> for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2>> output | gzip -c > /dev/null & done
> wait
> done
...
[azureuser@marc-cpu-basic ~]$ cat output | awk '/copied/ {print $8}' | sort | uniq -c
      4 40.4
     15 40.5
     14 40.6
      7 40.7

Very consistent results, but consistently slow. They do show that cores aren’t being shared, but throughput is lower than even a shared core under AWS.

Wrapping up

Comparison chart

Under this simple gzip test, we are testing CPU integer performance. The Azure standard instance got half the throughput of the equivalent AWS instance, in spite of a clock speed only 15% slower. But the throughput was consistent: no drops when running on adjacent cores. The basic instance was a further 33% slower than a standard instance, in spite of having the same CPU configuration.

Under Azure, we simply aren’t getting a full physical core’s worth of throughput. Perhaps the hypervisor is capping throughput, and capping even lower for basic instances? Or maybe the actual CPU is different than the E5-2660 reported? For integer CPU-bound workloads like our gzip test, we would need to purchase at least twice as much capacity under Azure than AWS, making Azure considerably more expensive as a platform.

Interested in working with Marc? Schedule a tech call.

About the Author

Marc is a passionate and creative problem solver, drawing on deep understanding of the full enterprise application stack to identify the root cause of problems and to deploy sustainable solutions. Marc has a strong background in performance tuning and high availability, developing many of the tools and processes used to monitor and manage critical production databases at Pythian. He is proud to be the very first DataStax Platinum Certified Administrator for Apache Cassandra.

10 Comments. Leave new

Hey Marc, did you have a single disk attached to the VM? The max per sec throughput of a single disk in Azure Standard is documented at 60MB/Sec and 500 IOPS. I think it’s 300 IOPS for Basic (not sure about MB rate). Looks like you’re hitting the IO limits before maxing the CPU.

Reply
Marc Fielding
July 16, 2014 11:13 am

Hi Warner,

I’m reading from /dev/zero and writing to /dev/null. No disk I/O at all. Benchmarking disk I/O throughput is an entire other can of worms :-)

Marc

Reply

This is quite an innovative benchmarking method, thumbs up! I’m not familiar with Azure but it would be interesting to see the results of benchmarking Azure CPU with sysbench and comparing the results with a similar sized EC2 if possible.

n

Reply
Marc Fielding
July 17, 2014 8:07 pm

Hello nv,

I installed the latest sysbench, and here are the results using
sysbench –test=cpu –cpu-max-prime=50000 –num-threads=2 run

AWS m3.xlarge (E5-2670 v2 @ 2.50GHz this time):

total time: 51.6462s
total number of events: 10000
total time taken by event execution: 103.2857
per-request statistics:
min: 10.29ms
avg: 10.33ms
max: 10.42ms
approx. 95 percentile: 10.35ms

Azure A3 (E5-2660 0 @ 2.20GHz):

total time: 117.0863s
total number of events: 10000
total time taken by event execution: 234.1447
per-request statistics:
min: 13.92ms
avg: 23.41ms
max: 37.12ms
approx. 95 percentile: 27.83ms

Azure A3 basic (E5-2660 0 @ 2.20GHz):

total time: 166.9538s
total number of events: 10000
total time taken by event execution: 333.8806
per-request statistics:
min: 24.61ms
avg: 33.39ms
max: 46.80ms
approx. 95 percentile: 37.22ms

I should note that although the Azure results show high pre-request variability, the results are very consistent across runs.

Marc

Reply

Hello Mark, does the amount of ram impact the bench mark? The m3.xlarge has more than twice the ram. The c3.xlarge has similar ram spec. I am not an expert in the subject that is why I am asking.

Haz

Reply

Hi Haz,

One of the reasons I chose this particular workload is that it is almost entirely CPU-bound. I would not expect an increase in RAM size to have any impact.

Marc

Reply

I’ve done the same job internally in our cloud environment using a Dell m820 Blade (E5-4650 2.70 GHz, 20M Cache, 8.0GT/s QPI, Turbo, 8C, 130W), Dell equallogic Storage PS6210XS iSCSI, with a VM with 4 Cores, 8gb of Ram.

Gzip throughput was 134
CPU Benchmark 53

very nice way to compare if the cloud could really replace what you have …

thanks for the article!

Raf

Reply

Hi Raf,

Thanks for sharing your results. Based on your numbers, I assume you were doing a single-threaded test; at 134 mB/sec, you’re getting slightly faster throughput than I got under AWS, but then again you have slightly faster clock speed as well.

The point of my original post is that, if you run the same test on two CPU cores, you should see both get more or less the same throughput, unlike AWS where the “core” is actually shared.

Marc

Reply

Sounds like you are postulating A3 Standard should be equivalent to Mx.xlarge? On what basis are you calling those an equivalent comparison? At a quick glance there is no directly comparable products based on mhz/$ when other specs on the tiers are taken into account.. Just sayin…..

Reply

Very nice stats. Thanks for sharing.

The discrepancies in Azure’s result could possibly be attributed to Azure oversubscribing the CPU. They may allocate ‘full core’ to your VM, but that same ‘full core’ might also be shared by other VMs running on the same server.

Whereas AWS, if according to a very outdated source (http://itknowledgeexchange.techtarget.com/cloud-computing/amazon-does-not-oversubscribe/), do not oversubscribe their CPU.

Just me guessing. Hard to find any authoritative information on AWS or Azure on how they allocate their CPU/Memory/Disk resoucing.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *