John Gallagher

A Better Default for vm.overcommit_memory is 1 (Overcommit Always)

November 2022

Background

Many processes allocate more memory than they end up using, whether for the sake of efficiency (fewer syscalls) or simplicity. Rather than allow this memory to sit unused, an operating system can overcommit - essentially, allow the sum of the memory allocated by all processes to exceed the total memory of the system. The assumption is that most of them generally won't use all the memory they have allocated, so most of the time everything will be fine.

Although this makes it easier to achieve high utilization of the memory available to the system, there are downsides. If the processes do end up trying to use a large chunk of the memory they allocated, the sum of the memory they try to use can exceed the amount of physical memory on the system. There isn't an easy way for the OS to tell a process that there is a shortage of memory such that the process can handle the shortage gracefully; from the process's perspective, the memory shortage can happen on practically any load or store.

Instead, an OS that allows overcommit will generally handle system-wide memory shortages by killing some process and reclaiming its memory. On Linux, this has traditionally been handled by the OOM (Out Of Memory) Killer. The OOM Killer picks some unfortunate process (not always the same one you would pick if you were given the choice) and terminates it with a SIGKILL, repeating as necessary until the memory pressure is relieved.

When I first learned about the OOM killer, and overcommit in general, it seemed to me like a crazy way of managing system memory. How could anyone build reliable systems if their processes could be randomly killed at any time? I still think it's fairly painful, but a few years of working on Solaris/illumos, which doesn't overcommit, taught me that the alternative (strict memory accounting) isn't too pretty either. You either allow a large chunk of your memory to sit unused, or you can add swap, increasing the amount of 'memory' that is available. If you add enough swap, and tune your workload just right, the used memory of your system's processes can be resident in physical memory, and the swap is only there for the purpose of accounting for the inevitable unused portions of the allocations by the system's processes. This is tricky though - unless your workload is very consistent in terms of memory usage, it may sometimes grow too big, in which case the swap will actually be used as swap. For many workloads, swapping any part of the system's working set causes such a big performance hit that it would be better just to fail immediately rather than continue to operate with degraded performance.

And strict memory accounting doesn't solve the reliability issue either - unless you are reserving memory up front (which can be wasteful), a small allocation by a critical component of the system can fail due to high memory usage by some less important component.

Ultimately, running out of memory is just a hard situation for an OS to handle gracefully, whether it overcommits or not.

vm.overcommit_memory

One nice thing about Linux is that it allows you to choose whether or not to overcommit, via the vm.overcommit_memory system parameter setting. Although in practice most Linux systems have this set in a way that generally allows overcommmit, there are actually three possible settings:

Allow overcommit except for allocations that are "seriously wild", in the words of the kernel documentation linked to above
Allow overcommit always
Don't overcommit (with some subtlety)

A Better Default

Overcommit should probably be the default in the Linux world, since so many programs written for Linux allocate many times more memory than they generally use, assuming overcommit will be enabled. The glibc memory allocator is a good example of this, allocating large memory arenas in proportion to the number of threads a process has. Turning off overcommit would be going against the grain in the Linux world, and probably wouldn't be a good default.

So if we are going to allow overcommit on a system, should the setting be 0 (overcommit almost always) or 1 (overcommit always)? All of the distros that I've come across default to 0, and this may seem like the best of both worlds. We get overcommit in general, which we want, but if our programs have a bug that causes them to request an unrealistic amount of memory, (a 'seriously wild allocation'), the responsible syscall (e.g. mmap) will receive an error, which will allow the bug to be identified more quickly.

In my experience though, it's better to set vm.overcommit_memory to 1 (overcommit always). The 'wild allocation' scenario doesn't happen that often, and when it does, the system is already headed for failure of some sort.

On the other hand, I have seen real situations where large but non-problematic allocations have been denied for being too large, but in reality the system would not have experienced any memory shortage had the allocation been allowed. One of the more common examples of such a scenario is when a large program forks, and the child immediately execs. The fork represents a large allocation - the child's memory is a copy of the parent's, so it's an allocation equal in size to the amount of memory used by the parent process. The memory is copy-on-write though, and if the child immediately execs, never writing to the vast majority of its memory in between the fork and exec, almost no additional physical memory needs to be used.

Of course, vfork() exists to prevent exactly these sorts of problems, and use of plain fork() is discouraged these days. Nevertheless, there exist programs in the wild that still use plain fork(). For example, one of the most common large programs that you can find, the JVM, still uses vanilla fork() in some error handling paths.

These sorts of scenarios don't happen too often, but there's no need for them to occur at all. If you've already gone all-in on memory overcommitting (which you probably have if you are using Linux), you may as well set vm.overcommit to 1 (overcommit always) and allow large-but-unproblematic allocations to succeed. The rare large-and-problematic-allocations-that-would-have-failed-if-overcommit-was-set-to-0 will succeed too, of course, but they'll just end up causing the OOM Killer to run, and you're already committed to dealing with the OOM Killer.

Another way of putting it: the number of occasions when 0 and 1 behave differently is small, but the false-positive behavior of 0 is worse: incorrectly identifying a non-problematic allocation as 'seriously wild' and denying it introduces a new failure into the system. Allowing a problematically-large allocation to succeed just delays an already existing problem momentarily.

Takeaway

If you are unsure how to set vm.overcommit_memory on Linux, your best option is generally to set it to 1 (overcommit always). That would be a better default for most distros, too.