Friday, June 10, 2011

Make linux faster

CPU affinity

On system with multiple cores (either 1 dual core CPU or 2 single core or whatever), linux actually detects each core as a CPU. There is a feature available in the kernel that allows you to bind processes to a particular set of CPUs. This is called CPU affinity.
CPU affinity can be changed while the system is running using the taskset command. Taskset takes two arguments: the process PID and the affinity mask. The mask is a 32bit integer that has a 1 for each bit representing the CPU on which the process is allowed to run.
By default, the kernel implements soft affinity. This consists in trying as much as possible to have each process running on the same CPU. Hard affinity is when the user (root) forces this behavior.
The following command forces process with PID 1 to run only on CPU 1:
taskset -p 1 1
One last note: the children inherits the affinity from their parent. This way if you want to make the kernel run on CPU 1 and leave CPU 2 for a particular process, just change the affinity of init to 1 in the boot process and nothing will use CPU 2.
I never managed to get better performance out of these feature but I think it is worth mentionning.

NUMA

NUMA stands for Non Uniform Memory Access. Basically this means that some memory is 'closer' to a certain CPU than another.
Starting with the Opteron, AMD CPUs have their memory controller on the chip. Therefore in a dual CPU system, you will have some memory banks being managed by CPU 1 and some by CPU 2. If a CPU want to access memory managed by the other one, it will have to ask for the info to second CPU.
NUMA makes the kernel aware of this architecture and tries as much as possible to assign memory on the DIMM controlled by the CPU on which the process is running.
NUMA is a kernel option. The best way of using it is to use as much memory banks as possible.

Tools

Some tools you might want to check out.
vmstat, mpstat, top, ps aux, iostat

Network option: net.core.netdev_max_backlog

When you do netstat -su, there is a line for 'packet receive error'. This somehow matches a buffer overflow/packet loss in the network socket. Increasing the options net.core.[rw]mem_default and net.core.[rw]mem_max can help, but in my test cases, increasing them wasn't enough.
What I did is increasing net.core.netdev_max_backlog from 300 to 2500 and we stopped seeing those errors.
Here is how this option comes into play.
For NAPI drivers:
Every time the kernel polls frames from the nic, it has to be a good citizen and not take the CPU for too long. Each poll will poll at most netdev_max_backlog frames.
For Non-NAPI drivers:
When the nic get a packet, it sends an interrupt to the kernel. The kernel disable the interrupts, does half of the handling (copy the packet in its own memory* and trigger and softirq), reenable the interrupts and go back to what it was doing. Sometime in the future, the kernel has to handle the softirq and handle the bottom half (decapsulate and give the data to the socket). Netdev_max_backlog is number of message in the queue market by *.

TCP options

Buffer sizes

These are the files you want to look at:
/proc/sys/net/core/rmem_default
/proc/sys/net/core/rmem_max
/proc/sys/net/core/wmem_default
/proc/sys/net/core/wmem_max
/proc/sys/net/ipv4/tcp_mem
/proc/sys/net/ipv4/tcp_rmem
/proc/sys/net/ipv4/tcp_wmem
Here is a little but of info on these variables:
  • everything with rmem is for read (ie the incoming buffers and such), everything with wmem is for write (ie the outgoing buffers and such)
  • (r,w)mem_default is the default size of the (incoming, outgoing) buffer for each socket
  • (r,w)mem_max is the max size of the (incoming, outgoing) buffer for each socket (note the size of the buffer needs to be changed by the program using the setsockopt syscall)
  • tcp_(r,w)mem contain 3 values
    1. minimum size of the buffer per socket
    2. default size of the buffer per socket (overrides (r,w)mem_default)
    3. max size of the buffer per socket (is overridden by (r,w)mem_max)
  • tcp_mem: there are global variable for the tcp stack. The unit is page. (=4kb on x86_64). It contains 3 values:
    1. low_threshold: normal behavior
    2. high_threshold: puts pressure on the sockets to consume more data (faster)
    3. max: packets get dropped.
So here you go� Depending on which value you want, this is how to find it.

http://philly.astroboy.fr/index.php?page=linux/performance