Pythian Goodies: The Answer to Free Memory, Swap, Oracle, and Everything

Dec 19, 2007 / By Christo Kutrovsky

Tags: , , , , ,

I gave this talk at the UKOUG, and I have received a few requests to post the slides online. Instead of just posting the PowerPoint I took some time to give the presentation again (internally here at Pythian) and this time we recorded the session and we are posting it here in a variety of formats. This is a bit of a departure from the typical Pythian Goodies, in that it is scripted, and there is a lot of content here in the whitepaper, but there hasn’t been a Goodie in a while so why not!

I’d love to hear from you, so please feel free to ask any follow-up questions to this post in the comments.

Abstract

Do I have enough memory? Why is my free memory so low? Am I swapping to disk? Can I increase my SGA (db cache) size? Can I add another instance to this server? Are my system resources used optimally? These are all questions that often haunt DBAs. This presentation is The Answer. It covers, in detail, the different types of memory, how to monitor memory, and how to optimally use it with Oracle. Multiple examples in the presentation demonstrate how certain actions on the database side cause different memory areas to be allocated and used on the OS side. Key underlying differences in operating systems approaches to managing memory will be highlighted, with special attention given to Linux, Solaris, and Windows. Using Linux as an example throughout, this presentation explains how to effectively use tools such as “top”, “vmstat” and “/proc/meminfo” to look into into a system’s allocation and use of memory.

Below you should see a flash video with me giving the session.

Download this presentation!
Powerpoint
IPod video (right-click and Save As . . .)
MP3 audio only

And below you will find the complete contents of the whitepaper. This is intended to be a good overall reference resource for how memory works in Oracle, using Linux as an example.

What is memory?

Memory in the computer world is the same as storage shelves in the real world. Its purpose is to store items (data) and allow easy access to them.

Computers are all about manipulating data based on a predefined set of rules. The efficiency of a computer is a function of how fast it can manipulate the data (CPU speed) and how fast it can retrieve and store that data (CPU cache, RAM, Disk).

In this paper I will concentrate only on the data-access component. I will refer to it as ‘memory’ and I will go into some practical details of how to make the most of it with Oracle.

Types of Memory in a Modern Computer

Building a computer to perform involves putting the fastest CPU unit (or units) available and the fastest memory possible. Unfortunately, building such a computer would be expensive, and as human beings we always look for more cost-efficient ways of accomplishing what we need.

Welcome to the world of compromises.

There are three main properties of memory:

  1. Latency: how much time it takes between making a request and receiving the data requested
  2. Capacity: how much it can hold (remember)
  3. Throughput: how fast can this data be retrieved or stored

Every memory chip or device has these three basic characteristics. These characteristics differ between chips or devices, based on whether the operation performed is reading or writing, and the location of reading/writing within the device in question.

Based on the above characteristics, in a typical computer there are four physical types of memory:

  1. CPU registers – this is were the CPU manipulates the data – no latency, very low capacity
  2. CPU cache – memory direct access to the CPU unit – some latency for requesting memory
  3. RAM (main memory) – some noticeable latency associated with accessing this memory but in the same time, significantly larger capacity
  4. Disk – significant latency – very large capacity (in database world). This is also sometimes referred to as ‘main storage’ as it is the only non-volatile memory. That is memory that does not get reset on computer shutdown

A modern computer uses sophisticated rules and techniques to manage these. Typically the hardware manages the CPU cache automatically, and the software controls how CPU registers, RAM, and Disk are used. The CPU registers are a bit more special, as usually they are not dynamically controlled. Rather when a given program is compiled, it has a built-in specific CPU registers usage.

Virtualization and 32bit vs 64bit

In order to save programmers from recoding the same idea over and over again, the operating system (OS) provides certain abstractions. The most significant one is memory virtualization. To understand that, you must first understand how memory is accessed.

Memory is used to store data. The smallest unit of memory is a byte, which represents 8 bits. Memory capacity is measured in bytes. In today’s large-capacity servers, “gigabyte” is more often used. A gigabyte is 1024*1024*1024*1024 bytes. 1 Gb (gigabyte) is approximately 1 billion bytes.

When storing and retrieving data, an offset is used to determine the location in which the data is stored. For example a data request would be “retrieve 8 bytes of data starting at offset 8100″. This is similar to saying “give me the fifth binder from the second row” in the real world.

Currently, the world is slowly in transition from 32-bit computers to 64-bit computers. The “bits” number of a computer represents how big the memory offset number can be. For example a 32 bit computer can “address” up to 4 Gb of memory. In contrast, a 64-bit computer can represent 17,179,869,184 Gb. That’s not an error, it truly is a 11-digit number in Gb.

Current 32-bit computers use some workarounds to be able to address more than 4Gb of RAM by adding extra registers and extra bits into the addressing portion. This, however, still does not allow for processes (running programs) to be able to freely address more than 4Gb of RAM.

Modern operating systems run processes in “protected mode”. What this mean is that a process can only “see” its own memory. It cannot interfere with other processes – memory or with kernel memory (memory used by the OS). This is accomplished by mapping the offsets of the memory used by a process to real memory. This mapping is maintained by mapping tables. The whole process is managed behind the scenes by the OS.

For example, a process may request the value at offset 5000, but the OS translates this to offset 134000 in real memory and retrieves the data from there.

The results is that each process has its own “addressing space” which it can use as it likes. The only limitation imposed is by the 32-bit or 64-bit mode of the process. A 64-bit OS can execute both 32-bit and 64-bit programs. A 32-bit OS can only execute 32-bit programs.

Virtual Memory Management

As a very complicated structure, virtual memory requires management. For kernel developers there’s a numbers of very good articles at http://linux-mm.org/LinuxMMDocumentation. In this document, I will discuss only the structures that concerns us.

Virtual memory is managed in pages. This means that memory is split into chunks. Each chunk can have different states such as “empty”, “in memory”, or “on disk”. In Linux, those chunks are typically 4Kb.

The chunk mappings and states are managed by sophisticated memory structure. Part of this structure is “page table entries” or PTEs. Each memory page is described by 1 PTEs. PTEs are similar to primary key indexes in Oracle. They describe the location of “this” page. The PTE memory structure is organized in a similar way as an Oracle index. It has three layers, and up to three pointers need to be accessed to find where a given page is located. This can be a very time-consuming operation, so each modern CPU has a fixed number of Cache entries called the “TLB cache” (Translation Lookaside Buffer). Each PTE entry also takes some small amount of memory. If you have many processes, consuming a lot memory, the memory needed for PTEs can be significant.

Virtual Memory Benefits

Having this virtual memory abstraction allows the OS to provide some advanced services to applications. Some of them are:

  • Swapping: the ability to use the disk to extend the amount of RAM available. This is accomplished by temporary writing out some of the less-needed memory pages to disk
  • Memory-mapped files: reading files as if they were fully loaded in memory and the OS performing reads behind the scene
  • Shared memory: the ability for multiple processes to have access to the same memory. Modifications to this shared memory are immediately visible for all processes. This is extensively used by Oracle.

Virtual Memory Simplified

Simplifying all the above information, conceptually speaking, there are only three types of memory:

  1. Kernel memory: memory needed by the OS to manage itself
  2. Private memory: memory used by user programs
  3. Shared memory: memory that can be used by multiple programs simultaneously

There are many types of memory statistics available. They are all called differently depending on the OS and the type of monitoring tools. Reading and understanding is a matter of figuring out which of the three types is shown where. The biggest confusion comes from counting the same memory multiple times.

Each memory “page” (chunk) can be in either one of these following states:

  1. in RAM or not in RAM
  2. on disk or not on disk (here, disk also could be swap)
  3. shared or not shared

A memory page can be both in memory and on disk, and if you sum up everything in memory and everything on disk you will count the page twice. Shared pages are even more complicated, as only the fact that they are shared is known. How many processes share is not known. If you sum up all the processes memory that is “in RAM” you will count the shared pages multiple times.

Oracle Memory Types

The Oracle database has two main memory areas”

  1. PGA – Process Global Area. This memory is private for each Oracle process. It is not shared, and it is dynamically allocated during runtime.
  2. SGA – Shared Global Area. This memory is shared amongst all Oracle processes. The two major sections of this memory are the “db cache” and the “pools” (shared pool, large pool etc.)

Each Oracle process has these areas.

Linux Memory Areas

Many of the Linux memory areas are similar to other OSes, such as Solaris and AIX. They may be called or displayed differently, but in essence, it is the same concept.

Direct user-dependant types of memory

Consumption of memory in these areas is directly related to user activities.

  • Buffers (shared) – raw buffers. Usually the biggest consumer is meta-data blocks of an ext3 file system
  • Cached (shared) – files cached at a logical level. This includes the data portion of ‘”file system cache”, including binary files “cached” for execution.
  • Anonymous (private or shared) – memory allocated by programs. This includes stack, private variables, etc.
  • HugePages – locked in memory, 2mb pages (chunks), usable only by special users via modified system calls.

Indirect (system) managed areas

Memory in these structures is a result of allocating memory in the above structures, or based on user activity, number of processes, etc.

  • Slab – kernel structures. This is an optimized memory allocation structure for small objects. The biggest consumers are inode_cache (file entity cache), dentry_cache (directory entry cache) and buffer_head (headers of buffers, any type).
  • PageTables, the “leaf” objects of the PTE tables. These can consume a lot of large memory.

Virtual Memory, Oracle and the Filesystem Cache

When an Oracle instance is started , the first process allocates a relatively large shared memory segment. The size of this segment depends on multiple parameters. This segment is called the SGA — shared global area. Each following process “attaches” to this memory segment by mapping it into its own addressing space. This allows each process to manipulate the SGA. This also includes shadow processes that are a direct result of a user connecting to the database.

The SGA memory is reported into the “cached” section for Linux. The Oracle SGA does not have a disk representation, it is anonymous.

Monitoring Memory Usage — vmstat

Vmstat is one of the most commonly used tools for monitoring of virtual memory usage and file system activity. It is available on almost all Unix platforms.

When running, vmstat prints a line every x seconds, containing information about the current state of key memory and disk statistics. The command vmstat 2 means one line every two seconds. Example output:

vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0 3631424  11096 120204    0    0    35    31  255    20  0  0 99  0
 0  0      0 3631488  11096 120204    0    0     0     0 1014    18  0  0 100  0
 0  0      0 3631488  11096 120204    0    0     0     0 1012    16  0  0 100  0

A quick overview of the columns.

  1. r – run queue: how many processes currently waiting for or running on the CPU
  2. b: how many processes waiting (blocking), usually waiting on IO
  3. swpd: swap memory usage in Kb
  4. free: free memory in Kb
  5. cache: file system cache size in Kb
  6. si/so: swap in / out – in Kb/sec
  7. bi/bo: bytes in / out – in Kb/sec
  8. cs: context switches
  9. us/sy/id/wa: user/system/idle/wait time for CPUs in percents.

More details about each value are discussed further below.

Monitoring Memory Usage — top

One of the most common tools for monitoring a *nix OS is the tool called “top”. Almost all OSes have top or similar (it’s topas on AIX). This tool uses information from the /proc filesystem to gather overall and per-process information.

top is also the most misunderstood tool. Its output can be confusing, as it presents information explained in documentation with single words or at most single sentences. This makes it open for free interpretation.

Since top takes all its information from the /proc filesystem, specifically /proc/meminfo and /proc/pid/statm, I will be covering /proc/meminfo in detail further in this paper.

Sample output of top:

top - 22:03:11 up  3:19,  2 users,  load average: 2.98, 1.22, 0.52
Tasks:  89 total,   1 running,  88 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.7% us,  0.8% sy,  0.0% ni,  0.3% id, 98.0% wa,  0.2% hi,  0.0% si
Cpu1  :  0.0% us,  0.8% sy,  0.0% ni, 97.6% id,  1.4% wa,  0.2% hi,  0.0% si
Cpu2  :  0.0% us,  0.2% sy,  0.0% ni, 99.7% id,  0.2% wa,  0.0% hi,  0.0% si
Cpu3  :  0.2% us,  0.2% sy,  0.0% ni, 33.6% id, 66.1% wa,  0.0% hi,  0.0% si
Mem:   8310308k total,  8049068k used,   261240k free,    36620k buffers
Swap:  7823644k total,      572k used,  7823072k free,  3395900k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8494 oracle    16   0 1662m 1.6g 1.5g D  2.0 19.8   0:03.15 oracletest(LOCAL=YE
 4796 oracle    16   0 1626m 1.5g 1.5g S  1.0 19.5   0:03.91 ora_dbw1_test
 4794 oracle    15   0 1626m 1.5g 1.5g S  0.7 19.5   0:12.23 ora_dbw0_test
 4798 oracle    16   0 1626m 1.5g 1.5g S  0.7 19.5   0:03.97 ora_dbw2_test
 4800 oracle    16   0 1626m 1.5g 1.5g S  0.7 19.5   0:04.09 ora_dbw3_test
    1 root      16   0  2384  600  512 S  0.0  0.0   0:00.86 init [3]
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 [migration/0]
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 [ksoftirqd/0]

NOTE: This is only a sample output. I do not recommend running multiple dbwriter processes.

top has two sections: the overall statistics and the process list. The overall statistics display:

  1. the load average
  2. per-CPU usage (toggled with the key “1”)
  3. memory summaries (discussed in next section)

The second section displays details per-process. The default row ordering is by %CPU. Additional columns can be added, or the existing ones re-ordered. Press the ‘“H” key for the online help doing that. To go over the columns shown above:

  1. PID – the process ID of that process. A unique identifier for the process.
  2. USER – the user executing that process. You can press ‘“u” and type in a user to filter the process list.
  3. PR – current priority of the process. A lower number indicates a higher priority
  4. NI – nice value, the priority modifier for that process. Allows you to raise or lower priority for the process.
  5. VIRT – the virtual memory size of the process. How much memory the process can address in its private addressing space.
  6. RES – resident size. The amount of memory currently in RAM. NOTE: shared memory can be resident for multiple processes. In the example above, each dbwriter process has 1.5gb resident.
  7. SHR – the amount of memory shared. Usually matches the SGA size
  8. S – state. This is sampled and not overall for the interval monitored. The states are:
    1. is sleeping (an oracle process is sleeping while waiting for a lock for
      example)
    2. waiting on disk io (actively reading)
    3. currently running on CPU
  9. %CPU – percent CPU usage, relatively to a single CPU (toggle with Shift-I). Note that this number can go above 100% if the process is multi-threaded, as each thread can leverage a different CPU.
  10. %MEM – percentage of memory taken, counting the RES column. Note: shared memory is counted multiple times.
  11. TIME+ – how much overall CPU time this process has consumed in its lifetime
  12. COMMAND – name of the process running. By default displays only the binary file name. I always switch it to display the actual command by pressing ‘“C”. This allows me to see which Oracle process is running (background, shadow).

Reading /proc/meminfo

The /proc file system is a text-file interface to kernel structures. It represents both read-only and modifiable settings.

/proc/meminfo is one of the read-only text files. It is user friendly, meaning that each line has its own description. Some files are intended for automated read by other programs.

To read meminfo issue cat /proc/meminfo. Here’s sample output:

cat /proc/meminfo
MemTotal:      8310308 kB
MemFree:         93448 kB
Buffers:        132036 kB
Cached:        3413324 kB
SwapCached:          0 kB
Active:        1658252 kB
Inactive:      1942032 kB
HighTotal:     7470528 kB
HighFree:         8768 kB
LowTotal:       839780 kB
LowFree:         84680 kB
SwapTotal:     7823644 kB
SwapFree:      7823072 kB
Dirty:             100 kB
Writeback:           0 kB
Mapped:          82500 kB
Slab:            92028 kB
Committed_AS:   490700 kB
PageTables:       3952 kB
VmallocTotal:   106488 kB
VmallocUsed:      5964 kB
VmallocChunk:    99900 kB
HugePages_Total:  2200
HugePages_Free:   1088
Hugepagesize:     2048 kB

Here is the explanation of the significant values.

MemTotal

The total memory visible by the OS. If it’s not what you’ve put in the machine, probably you have a bad SIM/DIMM.

MemFree

Memory that is currently unoccupied and available to use immediately. This is not the maximum amount of memory available at the moment.

If this value goes low, do not panic. Every bit of memory in this section is currently unused. You can even call it wasted. Some tools offer checks, and can page you, if “free memory” goes too low. You can remove those checks because in most cases all you need is a very small amount of memory free to allow programs to quickly allocate memory without the OS having to free some additional memory.
In some rare cases, you may need to have more free memory available for some rare, spiky memory-hungry applications.

The amount of memory that Linux will target to keep free can be controlled by (Linux RH4) /proc/sys/vm/min_free_kbytes. Note that this is only a target. The idea behind free memory is to have asynchronous memory management, i.e. not to cause programs to wait for resources to be deallocated. Instead, the OS deallocates some resources in anticipation of memory needs.

grep MemFree /proc/meminfo
MemFree:         26568 kB
echo 900000 > /proc/sys/vm/min_free_kbytes
grep MemFree /proc/meminfo            
MemFree:        210056 kB

Buffers

The cache of raw disk blocks. Do not confuse it with filesystem cache (next).

Usually occupied with ext3 data pointers (extent management data). Generally, if you have a large and active database, there’s 1 Mb of metadata associated with each 1 Gb of data. Not all of it needs to be buffered, but the active portion does. The ext3 pointers, depending on the file size, are direct, indirect, and double indirect. On large files, when accessing the tail end of the file, up to 2 additional disk reads may be required.

It does not matter whether you use DirectIO for your datafiles or not — buffers are still needed. ASM (Oracle’s Automatic Storage Manager) manages data in much larger chunks, so it requires less memory. Its metadata is in the shared pool. If you have a large database, it may make sense to have a larger shared pool for the ASM metadata.

Cached

Often referred to as filesystem cache. This is the actual data of the files. Each block is associated with a file (or inode). Thus the file cache is logical. For example: when removing a file, its associated cached pages are removed. Note that removing the “file” portion of the file, does not necessary free up the inode immediately. If the file is opened by an application, the inode (and “the file”) will remain open and its cache will not be released.

This file system cache is also used for executing binary files. For example: if you copy a binary executable file and run it immediately, it will already be in memory and will not need to be loaded. For reference, the “Oracle” binary is 100 Mb. Usually, at least 50% of it is cached. Also a number of binary libraries end up in this area.

On systems with large amounts of RAM, file system cache would be the largest area, unless DirectIO is used.

If you do not have DirectIO enabled in your database and you use a normal file system to store your data (ASM implies DirectIO), then your cache will be polluted with database activity. The same data will be cached both in the database and in the OS cache. Usually the result is the system “feels” sluggish, as any rare activity will require its data to be re-read from disk.

The Oracle SGA is also in this area, when not using HugePages. This can be quite confusing. Especially since 10g, the SGA is allocated slowly on a touch basis. An example of this behavior is in the case studies section further below.

Here are a few examples of filesystem cache usage behaviour. The machine is freshly started, lots of free memory will be consumed by file system cache.

The test case involves reading a few large files and observing memory behavior.

[root@ ~]#  cat /proc/meminfo 
...
MemFree:       8232512 kB
Buffers:          9328 kB
Cached:          28372 kB
...
du -smc indx01_*
1714    indx01_01.dbf
1761    indx01_02.dbf
1722    indx01_03.dbf
5197    total
...
cat indx01_* > /dev/null

[root@ ~]# vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0 8093888  10808 163392    0    0     0     0 1012    17  0  0 100  0
 0  0      0 8093952  10808 163392    0    0     0     0 1012    16  0  0 100  0
 0  1      0 7956736  10948 300272    0    0 68602     0 1567  1126  0  2 76 22
 0  1      0 7808576  11092 448068    0    0 73992    80 1623  1210  0  2 75 23
...
 0  1      0 2847616  16104 5397616    0    0 65792     0 1542  1076  0  2 75 23
 0  0      0 2766272  16180 5479180    0    0 40698     0 1341   675  0  1 85 14
 0  0      0 2766208  16192 5479168    0    0     0   114 1033    22  0  0 100  0
cat /proc/meminfo 
...
MemFree:       2766464 kB
Buffers:         16192 kB
Cached:        5479168 kB
...

The next test case, reads the same files (cached in memory) and writes them to a new file replicating a number of datablocks. Then I remove the newly created file, and its cache is released, resulting in a lot of free memory.

cat indx01_* >newfile
vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0 2765312  17044 5479356    0    0     0     0 1012    17  0  0 100  0
 0  3      0 2405376  17428 5833612    0    0    16 36866 1324   144  1 18 76  6
 0  2      0 2143616  17688 6091532    0    0     4 111748 2000   213  0 16 50 34
...
 0  1      0  16832   6784 8198556    0    0  8556 26684 1942  1267  0  2 74 24
 1  1      0  16832   6856 8198744    0    0 12518 20720 2130  1767  0  3 74 23
...
cat /proc/meminfo 
...
MemFree:         16768 kB
Buffers:          2192 kB
Cached:        8196908 kB
...
Dirty:          277468 kB
Writeback:           0 kB
...

cat /proc/meminfo 
...
MemFree:         20672 kB
Buffers:          3300 kB
Cached:        8191900 kB
...
Dirty:               0 kB
Writeback:           0 kB
...
rm newfile
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0  23296   3380 8189480    0    0     0    28 1015    18  0  0 100  0
 0  1      0 3257472   3948 4996372    0    0   284     0 1084   160  0 14 78  8
 0  1      0 3255552   5828 4996572    0    0   940     0 1247   485  0  1 75 24
 0  1      0 3253696   7616 4996344    0    0   884    96 1237   470  0  2 75 23
 0  0      0 3253440   7988 4996492    0    0   186     0 1061   112  0  0 95  4
 0  0      0 3253440   7988 4996492    0    0     0     0 1012    14  0  0 100  0

SWAP

The swap section displays summary about less-used data that has been written to disk. Only anonymous data requires a swap file since it does not have an already existing on disk representation, i.e. it is not associated with any inode.

There are three values in the swap section:

  1. SwapTotal – the total amount of swap space available
  2. SwapFree – unused amount of swap space
  3. SwapCached – data that has been swapped out to disk, but still remains in memory.

Swap caching is another attempt by the OS to anticipate memory needs. Swaps rarely change data in the swap file without removing it from memory. In an eventual memory demand pressure, these pages can be deallocated and made available immediately.

Actual swapping (memory that will need to be read from disk) can be calculated as:

SwapTotal - SwapFree - SwapCached

It is perfectly fine to have data swapped out and cached. This means that some application has requested and used some memory, but has not been using it for a long period of time. For example, the Oracle Clusterware tends to do that.

Active/Inactive

These values can be used to determine whether you have sufficient memory for your workload. If “active” is very close to your total amount of memory, you would probably benefit from extra memory. However these are not long-term numbers. They refer to a very recent state of the system. Having a low ‘“active” does not mean that overnight your “active” will be very high and could cause some swapping.

High/Low Free

In 32-bit systems, there are limitations on what memory the Linux kernel can address directly or indirectly. High memory in 32-bit systems is the memory above 1 Gb. Linux needs to do extra work when using high memory. Some of the Linux kernel structures cannot reside in high memory. The newer 2.6 kernel does a very good job in managing high memory and reserving low memory for kernel structures.

On 64-bit systems there is no high memory, and meminfo will show 0 for high total and high free. All the memory can be accessed directly without any additional overhead.

Dirty & Writeback

Dirty memory is cache/buffers that need to be written to disk.

Writeback – memory actively being written to disk. This can reach high values with asynchronous writes with a large queue.

Committed and mapped

Committed_AS – Total memory requested by all running applications . Note that this is requested memory, not actually used memory. A process may “request” a large amount of memory and never use it. There is no need to allocate it to that process until it actually uses it. If every process in the system is to touch and use the memory it has requested, the total memory usage would be Committed_AS.

Mapped – memory used for in-memory mapped files. Anonymous, and committed and touched memories are also counted here.

Here is a small test case demonstrating the difference between committed (requested) and used (touched). This small C program requests 1 Gb of memory, but free memory remains the same — committed increases by 1 Gb.

cat grab.c 
main() {void *p;
p=malloc(1073741824);
sleep(60);}

cat /proc/meminfo 
...
MemFree:       3230592 kB
...
Committed_AS:    49972 kB

./grab
cat /proc/meminfo 
...
MemFree:       3230464 kB
...
Committed_AS:  1098808 kB

SLAB

Slab is special type of memory. It is a way to manage a large number of small objects that are constantly allocated and de-allocated. Most of the objects are a direct result of kernel structure allocated as part of process activity. In a way it is similar to Oracle’s shared pool.

Details about the content of the “slab” can be obtained from the user un-friendly file /proc/slabinfo. A more user-friendly interface is provided slabtop. It’s similar to the “top” for processes but only displays stats about slabs by type.

slabtop sample output (ordered by size, by pressing the C key):

 Active / Total Objects (% used)    : 88874 / 139343 (63.8%)
 Active / Total Slabs (% used)      : 5839 / 5846 (99.9%)
 Active / Total Caches (% used)     : 90 / 132 (68.2%)
 Active / Total Size (% used)       : 17286.03K / 23311.27K (74.2%)
 Minimum / Average / Maximum Object : 0.01K / 0.17K / 128.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 32382  24900  76%    0.27K   2313       14      9252K radix_tree_node
 56925  40013  70%    0.05K    759       75      3036K buffer_head
   364    363  99%    4.00K    364        1      1456K size-4096
  2485   2471  99%    0.54K    355        7      1420K ext3_inode_cache
  2376    413  17%    0.50K    297        8      1188K size-512
   256    256 100%    3.00K    128        2      1024K biovec-(256)
  4576   4481  97%    0.15K    176       26       704K dentry_cache    
 10248   4548  44%    0.06K    168       61       672K size-64
  4340   1215  27%    0.12K    140       31       560K size-128
  1980    316  15%    0.25K    132       15       528K size-256

A few of the types that are usually taking considerable space:

  • ext3_inode_cache – cache the file object entity. Information about timestamps and permissions
  • dentry cache – caches the contents of directories
  • buffer_head – metadata about each buffer. The actual data is not part of this structure.

Since the largest portion of your slab may be in some cases inode/dentry cache, this can be controlled via /proc/sys/vm/vfs_cache_pressure. This parameter can allow you to change the priority of deallocation to favour logical objects (the slabs) or file system data (“cached”).

PageTables

The PageTables (new in 2.6 kernel) shows the amount of memory used by the leaf objects of the tree-like structure of process page tables. Similar to Oracle’s B-Tree indexes, the leafs represent the greatest portion of this structure.

PageTables is memory to manage memory.

Each entry takes 4 bytes. Each 4 Kb of virtual (not swap) memory requires one PTE to manage. This also counts for shared memory, as with shared memory there is one physical segment, but each process has its own virtual memory which maps to this segment.

In Oracle’s case, assuming an SGA of 2 Gb, that is:

524,288 pages * 8 bytes = 4 Mb per process

If you have 1000 sessions, that’s 4 Gb of memory to manage 2 Gb of SGA. Not efficient at all. Huge Pages provides a solution to this problem.

HugePages

“HugePages_Total” and “HugePages_Free” provide information about the amount of hugepages allocated (removed from “normal” memory) and the amount free (available for use).

HugePages is memory managed in 2 Mb blocks (this is huge compared to 4 Kb). This memory is also locked in RAM and cannot be swapped out. It is not even considered for swapping.

Programs that would like to use this memory need to have additional flags. Oracle attempts to use hugepages memory for the SGA if available. There are a number of permissions associated with using hugepages as well. See next sections for details.

VLM – Workaround for 32-bit

Some 32-bit systems have the Intel Physical Address Extension or PAE. This allows 36-bits to be used to address memory. This is only for the OS. Processes are still limited to 4 Gb. However, multiple processes can each use 4 Gb, and the file system cache can use the extra memory.

For Oracle, each process needs to be able to address (reach) the entire SGA memory. The way this is worked around is by creating an in-memory file system, mounting it in /dev/shm, and setting USE_INDIRECT_BUFFERS in the oracle init.ora file. There are limitations — you have to use the old db_block_buffers parameter instead of the new db_cache_size. This also prevents you from using multiple cache sizes.

What will happen is that Oracle will still create a shared memory segment (IPC Shared memory) for the shared_pool and other pools, however it will create a file in /dev/shm for the block buffer’s contents. Then each individual process will map the pages it needs to access into its own addressing space using the remap_file_pages system call.
The reason for this split is that we still want to access the block headers containing the state and information about each buffer directly.

Note that in Linux RedHat 4 there are three types of shared memory file systems available.

  1. shmfs – needs a preset size
  2. tmpfs – does not need size
  3. ramfs – does not need size + locked in memory (non swappable)

None of these can use hugepages for the data. But, the shared memory segment with the shared pool can and should be in hugepages. Obviously, ramfs is the best option as it locks the data in memory.
Using this approach with large amounts of memory often results in a lot of time spent in “system” CPU, and usually it’s in the remap_file_pages call.

Oracle uses a narrow window (usually 512 Mb) to map the buffers it needs to read. Each process still has its own mapping table. In terms of PTE usage, this would be the same as having a 512 Mb SGA, i.e. about 500 Kb per process. That’s assuming the shared_pool section is in hugepages.

If the system is capable of running 64-bit Linux then it should. Using the VLM workaround adds significant overhead to database cache access.

Page Tables Abuse — An Example

As discussed earlier, each process has a page table. This page table is private for the process and cannot be shared. (Solaris is different in this respect.)

In Oracle, there is usually a large shared memory segment shared amongst multiple processes. Each process still has a page table that is maintained. For example, for a 1.7 Gb SGA (the typical 32-bit limit), 445,440 x 4 Kb pages are needed. We would need 445,440 leaf PTE entries times 4 bytes each — that’s about 2 Mb. Each process would need a PTE table that is 2 Mb in size to fully describe its 2 Gb of mappings. If you have a large number of processes, say 1000, then you will need 2000 Mb of RAM to manage a 1.7 Gb SGA. Quite inefficient.

Here is a test case demonstrating this behavior. Note that the PTE table is initialized as it is used. Thus just starting a process does not consume 2 Mb immediately. The process needs to access its shared memory, initializing its PTE table (and taking a CPU penalty the first time each page is accessed). The test case has a specially crafted table that fits exactly in the 1400 Mb db_cache that is configured.

Before starting the sessions (db is up).

cat /proc/meminfo
...
MemFree:       1070472 kB
...
Committed_AS:  1881544 kB
PageTables:       4932 kB
...

After sessions have finished touching the memory.

cat /proc/meminfo
...
MemFree:        473496 kB
...
Committed_AS:  4708552 kB
PageTables:     295068 kB

Note that PageTables has grown with more then 290 Mb for 100 sessions. Doing the math:

290 Mb / 100 sessions = 2.9 Mb per session = ~3,040,870 bytes
1400 mb cache is 358'400 x 4k pages
~3,040,870 bytes used for 358'400 pages = ~8 bytes per page

The math is not 100% accurate due to a number of private pages that we are not counting.

The same test repeated, but hugepages configured. and the database using hugepages:

Before starting 100 sessions (db up).

cat /proc/meminfo 
...
Committed_AS:   332264 kB
PageTables:       3056 kB
...
After
...
Committed_AS:  3124640 kB
PageTables:      23100 kB
...

Notice thatPageTables increases by only 20 Mb.

Additional Benefits of HugePages

The mappings from virtual to physical address are performed so often that CPU manufacturers have created a special cache inside the CPU called the TLB Cache – Translation Look-aside Buffer. It has the latest 512 translations (typically). If you have a larger page, then these 512 slots can cover much more memory, and the chances of having a cache hit increase. If there is a cache miss, then up to three additional memory reads (from RAM) need to be performed to find the correct RAM address to read. This can waste a significant amount of CPU time.

In a TPC benchmark test I performed, there was an 8% improvement in transactions per second. The TPC test was in-memory only. I am not even including the fact that we could have used more memory for cache since it’s now available.

Configuring HugePages

Configuring HugePages is easy. The following steps are needed:

Edit /etc/security/limits.conf and allow the oracle user to lock large amounts of memory in RAM. Add (for 40 Gb limit):

oracle soft memlock 41943030
oracle hard memlock 41943030

Edit /etc/sysctl.conf and add setup the group ID for DBA that can access hugepages and how many hugepages to attempt to allocate. For example add:

vm.hugetlb_shm_group=503
vm.nr_hugepages=2000

Issue sysctl –p to apply the changes
Issue cat /proc/meminfo to see how many HugePages are available. If the number of available is much less then the number you configured, reboot the system to defragment the RAM. Linux needs to find continuous 2 Mb pages.

Case Studies – Free Memory is Going Down

This demonstrates what happens to a freshly-started machine. Free memory seems to disappear. Some monitoring tools may report this as an error when in fact, it is not.

The database has just been started and I am about to run a query that will full scan a large table, touching and initializing the db cache. The table size is 1.2 Gb.

vmstat 2
 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0 7725800  13036 497864    0    0     0    16 1013    25  0  0 100  0
 0  1      0 7663272  13144 556516    0    0 14806    70 1164   393  2  2 80 16
 0  1      0 7513128  13252 706168    0    0 37494     0 1318   631  2  3 75 20
 0  1      0 7310824  13408 908032    0    0 50502    64 1429   862  3  4 75 18
...
 0  1      0 5503208  14724 2709556    0    0 59144    16 1493   987  3  4 75 18
 1  0      0 5263080  14856 2948884    0    0 59838   128 1518   995  3  5 75 18
 0  0      0 5111272  14944 3106096    0    0 39344     6 1332   663  2  5 82 11
 0  0      0 5111272  14944 3106096    0    0     0    16 1013    26  0  0 100  0
 0  0      0 5111144  14960 3106080    0    0     0    30 1016    36  0  0 100  0

Notice we started with 7,700 Mb free and at the end of the test we have 5,111 Mb free. A 2,589 Mb difference, yet our test reads only 1.2 Gb of data. What happened?

  1. First, since we are reading from disk, the file system cache got populated and the file we read is now in the linux ‘“cached- section
  2. Second, we’re initializing the SGA thus it’s consuming memory, and because it’s shared memory it is also counted in the ‘“cached- counter.

Repeating the same test, but this time setting up HugePages, the SGA is pre-allocated in its own pool. The exact same test:

vmstat 2
 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0 3436280  14236 286324    0    0     0    56 1027    54  0  0 100  0
 0  1      0 3395192  14336 323924    0    0 18902    24 1193   447  2  1 81 16
 0  1      0 3303928  14480 415040    0    0 45572    48 1387   775  3  1 75 21
...
 0  1      0 2566776  15560 1150020    0    0 49228     6 1416   828  3  1 75 20
 0  1      0 2452152  15720 1264260    0    0 57230    18 1492   977  3  1 75 20

Notice that we start with far less free memory, as the HugePages pool has taken it. I’ve configured a 4000 Mb.

Free memory still goes down by 1.2 Gb, the file system is caching the file.

Finally, configuring filesystemio_options to setall, enabling both AsyncIO and DirectIO (requires 2.6 kernel). We now expect the free memory to remain the same.

vmstat 2
 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0 3436280  14236 286324    0    0     0    56 1027    54  0  0 100  0
 0  1      0 3395192  14336 323924    0    0 18902    24 1193   447  2  1 81 16
 0  1      0 3303928  14480 415040    0    0 45572    48 1387   775  3  1 75 21
...
 0  1      0 2566776  15560 1150020    0    0 49228     6 1416   828  3  1 75 20
 0  1      0 2452152  15720 1264260    0    0 57230    18 1492   977  3  1 75 20

No change — the exact same behaviour. Is something wrong in our monitoring? No!

Oracle Bugs

The above behaviour is a result of a bug in the 9i version. Here are the details:

    • bug 3186847
      • filesystemio_options=directio is ignored on linux
      • fixed in 9.2.0.6
    • Note: 297521.1
      • bug 2448994 introduced – O_DIRECT flag was not passed to the open() system call
        fixed in 9.2.0.7
      • Basically you need 9.2.0.7

The 10g database don’t have these bugs, it behaves as expected. Free memory will go down as you use the OS and other things are cached. It will also go down as private memory (PGA) is used for sorting or hashing tables.

Conclusions

  1. Linux Oracle servers must use HugePages.
  2. Being low on free memory is not an issue.
  3. Using DirectIO is important to maintaining a useful file system cache.

You should now have a good understanding of how to read what the various memory reporting tools are showing, you should be able make sense of it. Most of the concepts provided in this document apply to all Unix platforms, it’s only a matter of figuring out the right terminology.

Don’t forget, if you have any questions or mystery cases, please leave a comment.

34 Responses to “Pythian Goodies: The Answer to Free Memory, Swap, Oracle, and Everything”

  • […] the time to read, and listen to, this presentation from Christo Kutrovsky from Pythian and you’ll find out what that 20% is needed for – and how you can avoid “wasting” […]

  • zach friese says:

    “I do not recommend running multiple dbwriter processes.” – Could you you please explain why? Thanks!

  • Christo Kutrovsky says:

    Zach,

    Multiple dbwriters are only needed if your one dbwriter is running so hard, that it’s maxed out by the capacity of a single CPU.

    Today’s CPUs are so fast, that the chances of this happening are close to none.

    With async IO, dbwriter can send literally thousands of IOs to the OS (can be ovserved with iostat) so no need of extra dbwriters.
    If you cannot use async io for whatever reason, then you should use multiple dbwriter slaves (like 32 or so) in order to simulate async IO.

    Multiple dbwriters will cause more stress on the buffer cache and other then the case above, I cant see a reason to use them.
    Just for comparisson, running with async IO is equivalent of having 1000 dbwriter slaves. Can you afford to have 1000 dbwriters?

  • Don Seiler says:

    Christo,

    My server reports this:
    # cat /proc/sys/vm/nr_hugepages
    10240

    # cat /proc/sys/vm/hugetlb_shm_group
    0

    I wasn’t previously aware of the hugetlb_shm_group setting. Does this mean that Oracle is not using hugepages. meminfo is:

    HugePages_Total: 10240
    HugePages_Free: 2047
    Hugepagesize: 2048 kB

    Don.

  • Christo Kutrovsky says:

    I have gotten reports on this as well. Seems like the “hugetlb_shm_group” is not really honored. I havent tested it, and I set it up every time.

    Looks like it’s used since you have less free pages then taken. Which means *something* has taken them.

    You can do the math to be sure:

    do a “ipcs” then look how big the shared memory segment is, and compare it to the memory taken (in your case, 10240 – 2047 = 8193*2 = 16386 mb).

    SUSE users: beware, suse does “lasy” allocation of hugepages. Which can be a problem if you start multiple databases that wont fit if they touch all it’s SGA.

  • […] in low level internals, and particularly memory management. Lucky for me, Christo Kutrovsky published a paper about this exact topic. It is the perfect paper – crystal clear, gets into all the interesting and important details, no […]

  • jason arneil says:

    don,

    You can see that you have allocated 8193 hugepages to something, so on your system, unless something else is using 10g of ram, the chances are it’s oracle.

    Christo- as far as I can see it matters not to set the hugetlb_shm_group – oracle will be able to grab hugepages, certainly i see this behaviour on Redhat 4 Update 3 2.6.9-34.ELsmp kernel. (64-bit).

    HugePages_Total: 3250
    HugePages_Free: 114
    Hugepagesize: 2048 kB

    cat /proc/sys/vm/hugetlb_shm_group
    0

    my /etc/security/limits.comf is also unedited, and I’m still allocating – only when using srvctl though.

    Except my only allocate via srvctl also only works when setting both the group and the security limits. This is RAC though.

  • Video auto-plays. Not sure that is best if viewing in, say, a library.

  • Just a nit: I’d recommend another look at slide 11. 2ns for registers? That’s on the order of 7 clock cycles for a 3GHz processor. L2 isn’t even that slow.

  • Shahid says:

    Hi,
    I hope you can clear some of my doubts about memory concepts related to aix OS.

    Stats:
    AIX 5.3, P590 series. 8 CPUs, 27GB RAM allocated for the machine partition.
    DB SGA 4GB
    MAXCLIENT=MAXPERM% = 20% (MIN IS 10%)
    lru_file_repage =1
    strict value for maxperm and for maxclient both set to 1.
    File system=jfs2.
    paging = 1% all the time (means no paging)

    Consider the following at point in time:
    From vmstat -v, it is clear the file cache is at 19.8% meaning almost all of 20% of 27gb = 5.4gb is in used by file cache for jfs2.

    From svmon and vmstat , free memory is 6gb.
    From nmon also it says total mem=27gb, free=6gb and used=21gb and file cache used is 5.4 gb.

    I would like to know where i am doing wrong below:
    Objective:
    Try to add the os processes memory usage + shared segments +freemem,
    and sum should map to 27gb (which is the total memory).

    Tot_memory = File_cache + unixprocesses + shared segment memory +freemem
    27gb = 5.4 gb + 9gb + 4gb +6g (as you see i am missing 2.4gb uncounted for)

    Running the above at different times more or less am short of 2-2.4 gb always. Where is this 2.4 gb and who is using it? If you say to me that this 2.4 gb is actually persistent memory from disk files and the freemem shown my vmstat is actually not complete true, then i need to find out how to calculate this persistent memory on my machine and that better sum to 2.4 gb so that i can reconcile the total mem.

    Filecache, freemem and shared segment calculation is pretty much straight forward, and i think you all know how it is to be calculated. (I even checked ipcs -b and i do not have a memory segment orphan etc).

    How I come up with 9gb is simple.(9gb is sum of all unix processes memory usage)
    Running ps vg command and adding (rss-trs)/1024 (mb) to sum for those process ids where PID is also available in (select pid from v$process).
    To this i add sum of rss for rest of the remaining process(oracle listener, dbconsole, and root processes etc).

    Oracle processes adds up to major like 8gb, rest 1gb is for all remaining processes.

    What about jave heap, is this a separate memory that wont come in ps vg and also in ipcs -b?? Am at a loss.

    My only obj for now is to find ways to add all process memory + shared segments + free + file cache and this should sum to 27gb or at least the difference should not be so high as 2.4 gb missing.
    I checked i don’t have a leaking unless u tell me how to check??

    Cheers
    shahid

    CAN YOU PLEASE SEND THE REPLY TO MSHAHID@ALFRANSI.COM.SA OR ORACLEFUSIONS@GMAIL.COM

  • Christo Kutrovsky says:

    Hi Kevin,

    Yes I know the times I picked are now not matching reality any more. However I think I do mention in the presentation (maybe i’ve forgoten this time) that the exact times are not as important as the ratio of difference between the different memory types.

  • […] If you’re running Linux, set HugePages and free up your RAM to be usable real work, instead of wasting it on addressing. Christo Kutrovsky made an excellent presentation on this topic. […]

  • Thanks for the writeup on meminfo, it’s scarcely documented. Would you be able to explain how to tell from these values how much memory is being used by user applications? I can’t get tools like ‘memstat’ to agree with the way munin calculates used memory from meminfo, which is making it difficult to troubleshoot what appears to be a leak based on munin’s data.

  • Steve Kirby says:

    This is an awesome presentation. Great discussion that covered issues I was currently having with a client. Thanks.

  • june says:

    Thank You

  • […] I am back on the road, going to RMOUG Training Days to present The Answer to Free Memory, Swap, Oracle, and Everything. […]

  • Yong Huang says:

    Hi, Christo, is there any reason you set memlock limit to 40G but you only set vm.nr_hugepages=2000 i.e. 4G? What’s the SGA size of the database on the box (plus ASM’s SGA if ASM is on it too)? And do you see Oracle process’s virtual and resident memory about the same value in top, after the database has reached its normal work load? Thanks. — Yong Huang

  • Christo Kutrovsky says:

    I set the memlock limit far enought not to bother me. Imagine adding more RAM to the server – I don’t want to deal with memlock restrictions again.

    Yes it’s normal for resident=virtual. If this is not the case, then that means you have commited (allocated) memory but you are not using it.

  • Yong Huang says:

    Regarding my observation of resident and virtual memory being about the same (comment #18), I think there’s a display issue on Red Hat 5 (RHEL5). On RHEL4, top (version 3.2.3) shows RES close to VIRT. But on RHEL5 configured similarly, top (version 3.2.7) shows far lower RES than VIRT even though HugePages is used by Oracle and the instance has been running for a long time. I wonder if others have seen his.

  • Pascal says:

    hi,

    thanks for your article.

    may be I have a question (may I don’t understand something)

    in PageTables part
    you write “Each entry takes 4 bytes” (entry of pte)
    and later you write “~ 8 bytes per page)
    “290 Mb / 100 sessions = 2.9 Mb per session = ~3,040,870 bytes”
    “1400 mb cache is 358’400 x 4k pages”
    “~3,040,870 bytes used for 358’400 pages = ~8 bytes per page”

    what is the number of bytes per pages ? 8bytes or 4 bytes.
    may be 4bytes in 32bits OS and 8bytes in 64 bits OS ?

  • Christo Kutrovsky says:

    Pascal,

    I haven’t researched the exact value. But for sure on 32 bit systems it’s at least 4 bytes and on 64 bit at least 8 bytes.

    But with lazy (dynamic) allocations it is difficult to compute. Regardless, the exact value is not important, as with 8 the amount of memory used is already significant.

  • […] ?64????????oracle SGA??huge pages memory mapping?????????????? ??huge pages?oracle????????????????????????????? Memory Pythian Goodies: The Answer to Free Memory, Swap, Oracle, and Everything […]

  • odi says:

    I want to run Oracle without swap. Given the RAM size, how large can SGA/shm be to run it safely? See http://www.odi.ch/weblog/posting.php?posting=643

    • Hi Odi,

      I checked out your blog. Don’t use transparent hugepages for Oracle’s SGA. Use the fixed one for the sga.

      To answer, your question on SGA – you need to first estimate how much memory you will need for connections, sorting, and pl/sql arrays. i.e. it depends on your application.

      But the bigger question is … 6GB server running Oracle? In 2011? With the low prices of RAM, and the fact that you do not pay licensing fee for RAM, it only makes sense to use the maximum available on the server.

  • […] you are DBA(s) and presumably will understand the database specifics of this. I like it as I ain't a DBA. For your specific question, I doubt you will be able to accurately […]

  • Olivier BERNHARD says:

    Hi Christo,
    really enjoyed your presentation at ukoug 2011, seems like the very short time you had to present your stuff was a little bit frustrating ;) Any chance you upload the slides you’ve used which looked like a little bit up to date as compared to the existing presentation from 2006 ?
    Thanks and Best Regards,
    Olivier.

  • Olivier BERNHARD says:

    Thanks Christo, a new audio presentation would be awesome as well !

  • Sergey says:

    Thanks a lot. That is very useful.

  • djeday84 says:

    can u reupload audio track ? it is only 11 minutes long =(

  • […] Pythian Goodies: The Answer to Free Memory, Swap, Oracle, and Everything […]

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>