Very interesting note about the reclaiming. Yet another warning when transparently using a NUMA system.
NUMA can be a real pain. You can get a 40% hit on direct memory access, and far worse if you're modifying a cacheline in another processor. On one of our VoIP workloads, we noticed major (250%+) increase in performance and CPU stability after splitting a very thread-intensive process into multiple processes, each set with affinity to a particular core.
OSes try to help you, but it seems like they're primarily concerned with multiple processes, not huge processes like databases. Such processes should become NUMA aware and handle things themselves for best performance.
It might even make sense to ask if you can split the machine on NUMA boundaries and just act like they're separate systems. RAM's getting very cheap, and RAM/core is going up faster than CPU power is (it seems to me, anyways).
Also, is there a reason not to use large pages directly for the mmap'd sets if you know you're going to have them hot at all times? (I assume they read the entire file on start?)
"after rolling out our optimizations, we saw our error rates (ie. the proportion of slow or timed out queries) drop by up to 400%"
There is some good shared knowledge in the post (unlike this comment, to be fair), but what does drop by 400% mean?
If a rate drops by 100% it becomes zero. I get that.
If it increases by 400%, the outcome is slightly ambiguous (do we add 400% for 500% total or do we multiply up to 400% of the original value).
But a rate decreasing by 400% - am I the only person who finds that (not uncommon) expression hard to conceptualize?
In regard to conclusion 2, there is another approach here - when you're finished with an old segment, posix_fadvise(..., POSIX_FADV_DONTNEED) can be used to drop it from the page cache.
I was hit by the transparent huge pages on RHEL 6.2 in my workload. If you find our ordinary processes randomly taking up huge amounts of CPU time -- system CPU time -- when doing apparently ordinary tasks, you might be affected too. That was a real pain to diagnose when you're used to trusting the kernel not doing anything that weird. Running "perf top" helped to narrow down what the system was REALLY doing.
I didn't have LI-size databases -- just a dozen Python processes allocating each perhaps 300MB and all restarting at the same time were enough to trigger it, taking 10 minutes rather than 2 seconds to start up.
According to LWN, this is probably going to be automatic in the future:
http://lwn.net/Articles/568870/ (subscriber-only now, will be free in a week)
"we saw our error rates (ie. the proportion of slow or timed out queries) drop by up to 400%."
Should that be 80%?
Edited to add: Apparently it should be 75%, per comments elsewhere.
Does the information in this article apply to VMs (specifically AWS) or is it only relevant when you're running directly on hardware?
> On small setting for Linux, one dramatic performance improvement for LinkedIn!
should be...
you know what it should be ;)
You can also optimize memory and cpu management through linux control groups. Oracle published a pretty good description (see: example 1: NUMA Pinning) of how to assign dedicated cpus and memory to a process or group of processes [1], but you can also read about the supporting cpuset & memory cgroups subsystems too [2, 3].
p.s. I can recently created a screencast about control groups (cgroups) for anyone interested @ http://sysadmincasts.com/episodes/14-introduction-to-linux-c...
[1] http://www.oracle.com/technetwork/articles/servers-storage-a...
[2] https://www.kernel.org/doc/Documentation/cgroups/memory.txt
[3] https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt