Performance tuning: HugePages in Linux
Recently we quickly and efficiently resolved a major performance issue with one of our New York clients. In this blog, I will discuss about this performance issue and its solution.
Problem statement
The client’s central database was intermittently freezing because of high CPU usage, and their business severely affected. They had already worked with vendor support and the problem was still unresolved.
Symptoms
Intermittent High Kernel mode CPU usage was the symptom. The server hardware was 4 dual-core CPUs, hyperthreading enabled, with 20GB of RAM, running a Red Hat Linux OS with a 2.6 kernel.
During this database freeze, all CPUs were using kernel mode and the database was almost unusable. Even log-ins and simple SQL such as SELECT * from DUAL; took a few seconds to complete. A review of the AWR report did not help much, as expected, since the problem was outside the database.
Analyzing the situation, collecting system activity reporter (sar) data, we could see that at 08:32 and then at 8:40, CPU usage in kernel mode was almost at 70%. It is also interesting to note that, SADC (sar data collection) also suffered from this CPU spike, since SAR collection at 8:30 completed two minutes later at 8:32, as shown below.
A similar issue repeated at 10:50AM:
07:20:01 AM CPU %user %nice %system %iowait %idle 07:30:01 AM all 4.85 0.00 77.40 4.18 13.58 07:40:01 AM all 16.44 0.00 2.11 22.21 59.24 07:50:01 AM all 23.15 0.00 2.00 21.53 53.32 08:00:01 AM all 30.16 0.00 2.55 15.87 51.41 08:10:01 AM all 32.86 0.00 3.08 13.77 50.29 08:20:01 AM all 27.94 0.00 2.07 12.00 58.00 08:32:50 AM all 25.97 0.00 25.42 10.73 37.88 <-- 08:40:02 AM all 16.40 0.00 69.21 4.11 10.29 <-- 08:50:01 AM all 35.82 0.00 2.10 12.76 49.32 09:00:01 AM all 35.46 0.00 1.86 9.46 53.22 09:10:01 AM all 31.86 0.00 2.71 14.12 51.31 09:20:01 AM all 26.97 0.00 2.19 8.14 62.70 09:30:02 AM all 29.56 0.00 3.02 16.00 51.41 09:40:01 AM all 29.32 0.00 2.62 13.43 54.62 09:50:01 AM all 21.57 0.00 2.23 10.32 65.88 10:00:01 AM all 16.93 0.00 3.59 14.55 64.92 10:10:01 AM all 11.07 0.00 71.88 8.21 8.84 10:30:01 AM all 43.66 0.00 3.34 13.80 39.20 10:41:54 AM all 38.15 0.00 17.54 11.68 32.63 <-- 10:50:01 AM all 16.05 0.00 66.59 5.38 11.98 <-- 11:00:01 AM all 39.81 0.00 2.99 12.36 44.85
Performance forensic analysis
The client had access to a few tools, none of which were very effective. We knew that there is excessive kernel mode CPU usage. To understand why, we need to look at various metrics at 8:40 and 10:10.
Fortunately, sar data was handy. Looking at free memory, we saw something odd. At 8:32, free memory was 86MB; at 8:40 free memory climbed up to 1.1GB. At 10:50 AM free memory went from 78MB to 4.7GB. So, within a range of ten minutes, free memory climbed up to 4.7GB.
07:40:01 AM kbmemfree kbmemused %memused kbbuffers kbcached 07:50:01 AM 225968 20323044 98.90 173900 7151144 08:00:01 AM 206688 20342324 98.99 127600 7084496 08:10:01 AM 214152 20334860 98.96 109728 7055032 08:20:01 AM 209920 20339092 98.98 21268 7056184 08:32:50 AM 86176 20462836 99.58 8240 7040608 08:40:02 AM 1157520 19391492 94.37 79096 7012752 08:50:01 AM 1523808 19025204 92.58 158044 7095076 09:00:01 AM 775916 19773096 96.22 187108 7116308 09:10:01 AM 430100 20118912 97.91 218716 7129248 09:20:01 AM 159700 20389312 99.22 239460 7124080 09:30:02 AM 265184 20283828 98.71 126508 7090432 10:41:54 AM 78588 20470424 99.62 4092 6962732 <-- 10:50:01 AM 4787684 15761328 76.70 77400 6878012 <-- 11:00:01 AM 2636892 17912120 87.17 143780 6990176 11:10:01 AM 1471236 19077776 92.84 186540 7041712
This tells us that there is a correlation between this CPU usage and the increase in free memory. If free memory goes from 78MB to 4.7GB, then the paging and swapping daemons must be working very hard. Of course, releasing 4.7GB of memory to the free pool will sharply increase paging/swapping activity, leading to massive increase in kernel
mode CPU usage. This can lead to massive kernel mode CPU usage.
Most likely, much of SGA pages also can be paged out, since SGA is not locked in memory.
Memory breakdown
The client’s question was, if paging/swapping is indeed the issue, then what is using all my memory? It’s a 20GB server, SGA size is 10GB and no other application is running. It gets a few hundred connections at a time, and PGA_aggregated_target is set to 2GB. So why would it be suffering from memory starvation? If memory is the issue, how can there be 4.7GB of free memory at 10:50AM?
Recent OS architectures are designed to use all available memory. Therefore, paging daemons doesn’t wake up until free memory falls below a certain threshold. It’s possible for the free memory to drop near zero and then climb up quickly as the paging/swapping daemon starts to work harder and harder. This explains why free memory went down to 78MB and rose to 4.7GB 10 minutes later.
What is using my memory though? /proc/meminfo is useful in understanding that, and it shows that the pagetable size is 5GB. How interesting!
Essentially, pagetable is a mapping mechanism between virtual and physical address. For a default OS Page size of 4KB and a SGA size of 10GB, there will be 2.6 Million OS pages just for SGA alone. (Read wikipedia’s entry on page table for more information about page tables.) On this server, there will be 5 million OS pages for 20GB total memory. It will be an enormous workload for the paging/swapping daemon to manage all these pages.
cat /proc/meminfo MemTotal: 20549012 kB MemFree: 236668 kB Buffers: 77800 kB Cached: 7189572 kB ... PageTables: 5007924 kB <--- 5GB! ... HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB
HugePages
Fortunately, we can use HugePages in this version of Linux. There are couple of important benefits of HugePages:
- Page size is set 2MB instead of 4KB
- Memory used by HugePages is locked and cannot be paged out.
With a pagesize of 2MB, 10GB SGA will have only 5000 pages compared to 2.6 million pages without HugePages. This will drastically reduce the page table size. Also, HugeTable memory is locked and so SGA can’t be swapped out. The working set of buffers for the paging/swapping daemon will be smaller.
To setup HugePages, the following changes must be completed:
- Set the
vm.nr_hugepageskernel parameter to a suitable value. In this case, we decided to use 12GB and set the parameter to 6144 (6144*2M=12GB). You can run:echo 6144 > /proc/sys/vm/nr_hugepages
or
sysctl -w vm.nr_hugepages=6144
Of course, you must make sure this set across reboots too.
- The
oracleuserid needs to be able to lock a greater amount of memory. So,/etc/securities/limits.confmust be updated to increase soft and hard memlock values fororacleuserid.oracle soft memlock 12582912 oracle hard memlock 12582912
After setting this up, we need to make sure that SGA is indeed using HugePages. The value, (HugePages_Total- HugePages_Free)*2MB will be the approximate size of SGA (or it will equal the shared memory segment shown in the output of ipcs -ma).
cat /proc/meminfo |grep HugePages HugePages_Total: 6144 HugePages_Free: 1655 <-- Free pages are less than total pages. Hugepagesize: 2048 kB
Summary
Using HugePages resolved our client’s performance issues. The PageTable size also went down to a few hundred MB. If your database is running in Linux and has HugePages capability, there is no reason not to use it.
This can be read in a presentation format at Investigations: Performance and hugepages (PDF). See also our hugepages tag.








November 10th, 2008 at 4:21 pm
[…] After contacting Oracle Support with this stack, they confirmed it to be Bug #6752308 which was closed as Duplicate of Bug 6139856. There is patch for 10.2.0.3 available and they also recommend to implement hugepages. By the way, there is an interesting article on the effect of utilizing - or not utilizing - hugepage… […]
November 11th, 2008 at 1:21 pm
What you have found is that badly tuned VM system can cause trouble. Your solution was to exempt large part of the system
memory from the paging system. Of course, there is a price to
pay for that, too. You have to turn off dynamic SGA sizing, very convenient feature in 10.2. In other words, you need to
set up shared pool, buffer cache, large pool and java pool and get it right based on the rules of thumb. I have tested
huge pages and found out that there is not much difference with the VM parameters set right. In other words, hugepages setup is a crutch and you pay a high price for using that crutch. I prefer doing things the right way and that is to correctly set up the paging parameters.
Kindest regards,
Mladen Gogala
November 12th, 2008 at 2:51 am
Thanks for all to share this invaluable experience.
November 12th, 2008 at 11:06 am
Hi Mladen
Thanks for reading our blog.
I am afraid that there is some form of nomenclature issues here. dynamic_sga is a term associated with 9i. You probably are referring to ASMM (Automatic Shared Memory Management) or VLM. Are you saying that use of hugepages will exclude use of ASMM? I doubt that.
So, VLM [ _use_indirect_buffers] is what’s in question. Well, in a 64 bit software, there is no need for indirect buffers. In a 32 bit environment, I guess, it needs to be carefully considered: Effect of SGA size increase vs effect of excessive scanning for free pages. Either way, I am biased against indirect buffers due to its overhead.
But, specifically,
1. How would you control paging daemons from scanning 10GB SGA pages, looking for free memory?
2. How would you reduce size of paging tables using just vm setup?
Cheers
Riyaj
November 13th, 2008 at 9:40 am
I’m just curious, i’ve systems with linux kernel 2.6 and SGA between 4 and 10 GB with RAM installed between 7 and 12 GB by i’ve never seen such behaviour. What does it mean PageTables: 5007924 kB ?
On my systems i’ve never seen such an increase in memory free, why?
November 14th, 2008 at 8:01 am
Riyaj, your blog is great and I read it on the regular basis. Let me answer your questions:
1) ASMM and HugePages are mutually exclusive, at least in Oracle10g. Look at the ML note 317141.1 which explicitly asks you to remove SGA_TARGET. And yes, I am a bit old, my nomenclature is from 9i. I do prefer descriptive names like “dynamic SGA management” to the alphabet soup like “ASMM”.
2) There is no scanning of 10GB or memory. System is scanning page tables, not pages themselves. Page tables are 4096 times smaller. The scanning, however, is not a problem. If you leave enough free memory, searching for free memory will not be a problem. In particular, setting vm.min_free_kbytes to 1048576 would make sure that the system will always maintain 1GB of free memory. Also, setting vm.overcommit to 1 would eliminate the need for checking swap every time the memory is allocated. The page cluster should be set to 5, to enable fast writing where possible. Also, you should turn off that pesky swappiness as it would devour resources needlessly.
Kindest regards,
Mladen Gogala
November 14th, 2008 at 12:54 pm
[…] on the Pythian Group Blog, Riyaj Shamsudeen contributed an item on performance tuning with HugePages in Linux, showing again the real advantages of knowing your way around the host […]
November 17th, 2008 at 12:21 pm
Hi Mladen
Thanks for your kind words.
1. I just tested it out in my linux server running 2.6 kernel. ASMM uses hugepages, as long as, available hugepages is greater than sga_max_size. ML note you referred is for 32 bit+use_indirect_buffers and Of course, use_indirect_buffers will not work with hugepages. But, ASMM itself works fine with hugepages.
11g AMM will not work with hugepages though.
2. You are right, I should have said “5GB pagetable” need to be scanned. Nevertheless, scanning 5GB of page table will consume enormous amount of CPU.
Thanks for those paging parameters. I see your point that if all these parameters are optimally setup, we might be able to reduce this effect.
I would rather prefer to keep page table itself much smaller, two reasons: 1)Bigger page table results in higher CPU usage from user processes due to higher TLB misses 2) Unnecessary waste for page table memory. For e.g., in this specific scenario, after setting up hugepages, pagetable size went down from 5GB to 400MB, a net gain of 4.4GB. We could allocate this memory to SGA allowing further gain.
Hi Cristian
Thanks for reading our blog. We might need more data to understand your specific situation.
Cheers
Riyaj