Wednesday, June 29, 2011

Uptime Calculator

One interesting approach to visualise the effect of multiple load balanced hosts on uptime is to consider two identical physical hosts. Presume that they are in two separate data centres on diverse networks. Presume further that each host can achieve 99.0 percent uptime and the outages are random. Then, all else being equal, for the 1.0 percent that host A is down, host B will be up 99.0 percent of those same times. Therefore, outages of the two hosts simultaneously is:

1.0 percent * 1.0 percent = 0.01 percent

or

0.01 * 0.01 = 0.0001

and, more rigorously

0.01 - (0.01 * 0.99) = 0.0001

Conversely, at least one physical host is available for:

100 percent - 0.01 percent = 99.99 percent

when two hosts with 99.0 percent uptime are deployed.

-----

Instead of calculating your uptime, it's a bit easier to calculate the downtime.

The downtime calculation is something like this:-

D = Total downtime (in minutes)
T = Total time in the month (in minutes)

Downtime (in %) = D/T X 100

Example:

A site which has only 2 hours downtime in a 30-days month will have downtime % like the following :-

D = 2 X 60 minutes = 120 minutes
T = 30 days X 24 hours X 60 minutes = 43200 minutes

Downtime (in %)
= [120/43200] X 100
= 0.2777 %

Uptime (in %)
= 100 - 0.2777
= 99.7222 %

In other words, you have 99.722% of uptime throughout the month.

http://edgedirector.com/htm/9999.htm
http://tycoontalk.freelancer.com/web-hosting-forum/34463-uptime-calculation.html
http://easyuptimecalc.com/index.php
http://squobble.blogspot.com/2008/02/hosting-uptimedowntime-calculation.html

Saturday, June 25, 2011

How to disable NetBIOS over TCP/IP on Windows-Clients via dhcp:

f anyone out there is trying to get rid of NetBIOS to reduce broadcast traffic and optimize their network, here are the options for ISC dhcpd.conf to fast perform this operation on a large number of computers:

# to save the vendor id in the lease db:
set vendor-id = option vendor-class-identifier;

# specifying the option space name:
option space MSFT;
option MSFT.nbt code 1 = unsigned integer 32;

# subnet declaration:
subnet 192.168.100.0 netmask 255.255.255.0 {
range 192.168.100.10 192.168.100.100;
if substring ( option vendor-class-identifier, 0, 8 ) = "MSFT 5.0"
{
vendor-option-space MSFT;
# 1 = enable, 2 = disable NetBIOS over TCP/IP:
option MSFT.nbt 2;
}
}

# single host with fixed IP:
host max {
hardware ethernet 00:90:f5:05:3b:b4;
fixed-address 192.168.100.101;
if substring ( option vendor-class-identifier, 0, 8 ) = "MSFT 5.0"
{
vendor-option-space MSFT;
# 1 = enable, 2 = disable NetBIOS over TCP/IP:
option MSFT.nbt 2;
}
}



Result from "ipconfig /all" on client:

netbios-disabled

Download the full dhcpd.conf:

http://www.bakarasse.de/pages/en/linux/disable-netbios-via-dhcp.php?lang=EN

Saturday, June 11, 2011

Script of the Week: Tuning TCP Packets

In order to gain some practice building scripts from the ground up this exercise will provide basic resources to create a script that will monitor and manipulate tcp packets on the server.
1. Background for Tuning TCP Sockets
In an excellent book, “Performance Tuning for Linux Servers”, published by IBM, it documents one of the major issues with many Linux servers, that TCP sockets used for networking have not been optimized in the Linux kernel.  As a result networking performance on high usage servers fails to provide the needed access to the server.  These tuning features of the TCP sockets can have significant increases in speed.  Be sure to test before and after to verify these are doing what you expect.
The  tcp_max_syn_backlog sets the number of TCP SYN packets that the server will queue before they are dropped.  Here you can see the default, this can be increased to 30000.
cat /proc/sys/net/ipv4/tcp_max_syn_backlog
1024

The recommended increase is to 30,000.
echo 30000 > /proc/sys/net/ipv4/tcp_max_syn_backlog
With a web server you will see a lot of TCP connections in the TIME-WAIT state.  TIME_WAIT is when the socket is waiting after close to handle packets  still in the network.  This also should be increased.  Here is the default.
cat /proc/sys/net/ipv4/tcp_max_tw_buckets
180000

This should be updated to 2 million.
echo 2000000 > /proc/sys/net/ipv4/tcp_max_tw_buckets
The number of packets that can be queued should be increased from the default of 1000 to 50000.
cat /proc/sys/net/core/netdev_max_backlog
1000

These should be increased to 50,000.
echo 50000 > /proc/sys/net/core/netdev_max_backlog
2. Create a script that will implement these needed changes to the tcp sockets.  It should have these features:
* header with licensee information
* require root to be able to run script
* check command paths, if incorrect die
* send an email to your email address on completion
* automatically run the script at boot time
3. Debug the script
sh -x script_name

#!/bin/bash
#####################################################################
# The purpose of the script is to optimize tcp packets
#####################################################################
# Variables
ADMIN="admin_email"
CONLOG=/tmp/connections
#Feature Options
SENDMAIL=1
# Command Paths
MAIL=/bin/mail
ID=/usr/bin/id
# Make sure script runs with the EUID of root
if [[ $EUID -ne 0 ]]; then
echo "This script must be run as root" 1>&2
exit 1
fi
# Kill if script has problems and log
die(){
echo "$@"
exit 999
}
init(){
[ ! -x $MAIL ] && die "$MAIL command not found."
[ ! -x $ID ] && die "$ID command not found."
>$CONLOG
}
init
echo 30000 > /proc/sys/net/ipv4/tcp_max_syn_backlog
echo "tcp_max_syn_backlog increased to 30,000" > $CONLOG
echo 2000000 > /proc/sys/net/ipv4/tcp_max_tw_buckets
echo "tcp_max_tw_buckets increased to 2 million" >> $CONLOG
echo 50000 > /proc/sys/net/core/netdev_max_backlog
echo "netdev_max_backlog increased to 5000" >> $CONLOG
if [ $SENDMAIL -eq 1 ];
then
$MAIL -s "TCP PACKET MANAGEMENT @ $(hostname)" $ADMIN < $CONLOG
fi
rm -f /tmp/connections
exit 0


http://bashshell.net/script-of-the-week/script-of-the-week-tuning-tcp-packets/

Network Tool

Httcp


Flowgrind: TCP Performance Measurement Tool

Flowgrind is a tool similar to iperf, netperf to measure throughput and other metrics for TCP.

Features:

distributed architecture:
Flowgrind is split into two components: the flowgrind daemon and the flowgrind controller. Using the controller, flows between any two systems running the flowgrind daemon can be setup (third party tests). At regular intervals during the test the controller collects and displays the measured results from the daemons. It can run multiple flows at once with the same or different settings and individually schedule every one. Test and control connection can optionally be diverted to different interfaces.

Advanced metrics
besides goodput it measures application layer IAT and 2-way (RTT) delay, blockcount and Networktransactions/s

Stochastic Traffic Generation
bulk transfers, rate-limited flows, sophisticated request/response tests

Automatic Dump support
can use libpcap to automatically dump traffic for qualitative analysis

Special Linux TCP support
shows tcpi_info stats (RTT, RTO, CWND, SSTRESH, congestion control state etc), can set TCP options like TCP_NODELAY, set buffer size on per flow basis, set congestion control algorithm, DSCP field etc

Singlethreaded flow handling
handles all flow inside one thread to improve fairness between flows

Gnuplot compatible configureable output format

bwctl

NAME
bwctl - Client application to request throughput tests.
SYNOPSIS
bwctl [options] -c recvhost -s sendhost
bwctl [options] -c recvhost
bwctl [options] -s sendhost
DESCRIPTION
bwctl is a command line client application that is used to initiate throughput tests.

This version of bwctl is capable of initiating Iperf, Nuttcp, and Thrulay tests.

bwctl works by contacting a bwctld daemon on both the receiving host and the sending host. bwctld manages and schedules the resources of the host it runs on. In the case where only one of the receiving host or sending host is specified, bwctl assumes the local host is the other endpoint. bwctl will attempt to contact a local bwctld if it can. If there is no local bwctld running, bwctl assumes the local host does not require policy controls and will execute the bwctld functionality required to run the test directly.

If cases where bwctl is directly running the throughput test on the host, there are several configuration options that are shared with bwctld. Those configuration options can be set using the bwctlrc(5) configuration file in a way very similar to the way they are specified in the bwctld.conf(5) file.

The bwctl client is used to request the desired type of throughput test. Furthermore, it requests when the test is wanted. bwctld on each endpoint either responds with a tentative reservation or a test denied message. Once bwctl is able to get a matching reservation from both bwctld processes (one for each host involved in the test), it confirms the reservation. Then, the bwctld processes run the test and return the results. The results are returned to the client from both sides of the test from the respective bwctld processes. Additionally, the bwctld processes share the results from their respective side of the test with each other.

BWCTL (bwctl and bwctld) is used to enable non-specific throughput tests to hosts without having to give full user accounts on the given systems. Users want the ability to run throughput tests to determine the achievable or available bandwidth between a pair of hosts. It is often useful to test to multiple points along a network path to determine the network characteristics along that path. Typically, users who want to do this path decomposition have to directly contact the network/system administrators who control the hosts along the path. The administrator needs to either run half of the test for the user or give them a user account on the host. Also, network paths of interest are typically controlled by multiple administrators. These hurdles have made this kind of testing difficult in practice.

BWCTL was designed to help with this problem. It allows an administrator to configure a given host as an Iperf, Thrulay, or Nuttcp endpoint. The endpoint can be a packet sender (e.g. Iperf client) or a packet receiver (e.g. Iperf server). It can be shared by multiple users without concern that those users will interfere with each other. Specific policy limits can be applied to specific users, and individual tests are scheduled so they will not interfere with each other. Additionally, full user accounts are not required for the users running the tests.

BWCTL allows the administrator to classify incoming connections based upon a user name and AES key combination or, alternatively, based upon an IP/netmask. Once the connection is classified, the bwctld can determine the exact type and intensities of througput tests that will be allowed. More information on the policy controls can be found in the bwctld(8) man page.

BWCTL makes use of a distributed scheduling algorithm. Each host maintains a schedule independently. As a client requests a test, the two endpoints are contacted and each bwctld server responds with the first available open schedule slot. This enables on-demand tests to co-exist with regularly scheduled tests since regularly scheduled tests are implemented by having the client request tests on regular intervals. Different priorities can be implemented using the event_horizon configuration directive to bwctld. (By allowing clients that implement regularly scheduled tests to reserve their time slots further into the future.)
ARGUMENTS

Connection/Authentication Arguments:

-A authmethod
authmethod is used to specify the authentication method the bwctl client is willing to use for communication with the bwctld on the sendhost and recvhost. The authentication options of bwctl are intended to be extensible. The communication from the bwctl client to each bwctld server may take different options for different types of authentication. If the authmethod option is specified for either the -s, or the -c argument, it overrides the authmethod specified with the -A option for communication with that particular host. (Therefore, the -A argument is really only useful if the same authentication can be used with both hosts.)

Allowing different authentication methods for each connection should allow a client to use different authentication methods with different servers which should in turn allow cross-domain tests to occur more easily.

The format for authmethod is:

authmode [authscheme schemeopts]

authmode
Specifies the authentication mode the client is willing to speak with a server. It must be set as a character string with any or all of the characters "AEO". The modes are:

A
[A]uthenticated. This mode encrypts the control connection.
E
[E]ncrypted. This mode encrypts the control connection. If the test supports encryption, this mode will additionally encrypt the test stream. (Encryption of the test stream is not currently supported, so this mode is currently identical to authenticated.)
O
[O]pen. No encryption of any kind is done.

The client can specify all the modes with which it is willing to communicate. The most strict mode that both the server and the client are willing to use will be selected.

Default:
"AEO"

authscheme schemeopts
authscheme indicates the authentication scheme that should be used to achieve the authenticated or encrypted modes. schemeopts are a list of arguments specific to each particular authentication scheme. Supported authscheme values follow (listed with the schemeopts each scheme requires):

AESKEY userid [keyfile]
This is the initial "simple" shared secret (AES key) model. userid is required to identify which shared secret the server and client should use. keyfile optionally specifies a file to retrieve the AES key from. If keyfile is not specified, the user will be prompted for a passphrase. keyfile can be generated using the aespasswd(1) application.
Default:
Unauthenticated

authscheme and schemeopts are only needed if authenticated communication (A or E modes of authmode) is wanted with sendhost and recvhost.

-B srcaddr
Bind the local address of the client socket to srcaddr. srcaddr can be specified using a DNS name or using standard textual notations for the IP addresses.

Default:
Unspecified (wild-card address selection).

-c recvhost[:port] [authmethod]
Specifies the host that will run the Iperf, Thrulay or Nuttcp server. The :port suffix is optional and is only needed if bwctld is being run on a non-default port number. If an IPv6 address is being specified, note that the accepted format contains the recvhost portion of the specification in square brackets as: [fe80::fe9f:62d8]:4823. This ensures the port number is distinct from the address specification, and is not needed if the :port suffix is not being used.

At least one of the -c or -s options must be specified. If one of them is not specified, it is assumed to be the local host.

authmethod is a specifically ordered list of keywords that is only needed if authenticated communication is wanted with recvhost. These keywords are used to describe the type of communication and authentication that should be used to contact the recvhost. If recvhost and sendhost share the same authentication methods and identities, it is possible to specify the authmethod for both recvhost and sendhost using the -A argument. An authmethod specified with the -c option will override an authmethod specified with the -A argument for communication with the recvhost.

The format for authmethod and a description of the currently available authentication methods are described with the -A argument.

-k

This option has been deprecated. Originally, it was used to specify the keyfile for authentication. All authentication options can now be specified using the -A argument. For the next several versions this option will report an error. Eventually, it may be reclaimed for another purpose.
-s sendhost[:port] [authmethod]
Specifies the host that will run the Iperf, Thrulay or Nuttcp client. The :port suffix is optional and is only needed if bwctld is being run on a non-default port number. If an IPv6 address is being specified, note that the accepted format contains the sendhost portion of the specification in square brackets as: [fe80::fe9f:62d8]:4823. This ensures the port number is distinct from the address specification, and is not needed if the :port suffix is not being used.

At least one of the -c or -s options must be specified. If one of them is not specified, it is assumed to be the local.

authmethod is a specifically ordered list of keywords that is only needed if authenticated communication is wanted with sendhost. These keywords are used to describe the type of communication and authentication that should be used to contact the sendhost. If recvhost and sendhost share the same authentication methods and identities, it is possible to specify the authmethod for both recvhost and sendhost using the -A argument. An authmethod specified with the -s option will override an authmethod specified with the -A argument for communication with the sendhost.

The format for authmethod and a description of the currently available authentication methods are described with the -A argument.

-U

This option has been deprecated. Originally, it was used to specify the username to identify the AES key for authentication. All authentication options can now be specified using the -A argument. For the next several versions this option will report an error. Eventually, it may be reclaimed for another purpose.


Throughput Test Arguments:
The arguments were named to match their counterparts in Iperf as closely as possible.

Some of the options are not available for some of the throughput testers. BWCTL does not support UDP tests, changing the output format or changing the output units for either Nuttcp or Thrulay.

-T
Specify which throughput tester to use:

iperf
thrulay
nuttcp
Default:
None. Selects a tool that the client and server have in common

-b bandwidth
Limit UDP send rate to bandwidth (bits/sec).

Default:
1 Mb

-i interval
Report interval (seconds).

Default:
unset (no intervals reported)

-l len
length of read/write buffers (bytes).

Default:
8 KB TCP, 1470 bytes UDP

-P nStreams
Number of concurrent streams for the test. See the -P option of Iperf for details.
-S TOS
Set the TOS (See RFC 1349) byte in packets.

Default:
0 (not set)

-t time
Duration of test (seconds).

Default:
10

-u

UDP test.

Default:
TCP test

-W window
Same as the -w option, except that the value is advisory. bwctl will attempt to dynamically determine the appropriate TCP window, based upon RTT information gathered from the control socket. If bwctl is unable to dynamically determine a window, the value window will be used.

Default:
Unset (system defaults)

-w window
Socket buffer sizes (bytes). For TCP, this sets the TCP window size. For UDP, this sets the socket receive buffer size.

Default:
Unset (system defaults)


Scheduling Arguments:

-a syncfuzz

Allow bwctl to run without a synchronized system clock. Use this to specify how far off the local clock is from UTC. bwctl prefers to have an NTP synchronized system clock to ensure the two endpoints of the test are actually agreeing to the same scheduled time window for test execution.

If two systems do NOT have a close enough notion of time, then the throughput test will eventually fail because one endpoint of the test will attempt to run at a different time than the other.

If the operating system supports the NTP system calls, and the system clock is determined to be unsynchronized, error messages will still be reported depending upon the value of the -e flag.

When calculating the time errors, this value will be aded in to account for the difference. The maximum time offset can be bounded on the server side, using the max_time_error directive, to prevent a denial of service attack. If set, the server will reject any requests to test with a peer that has too high a timestamp error.

Default:
Unset (Defaults to Set for systems without the NTP system calls)

-I interval
Specifies that bwctl should attempt to run a throughput test every interval seconds.

Default:
Unset. If it is unset, bwctl only runs the test once.

-L longest
Specifies the longest amount of time the client is willing to wait for a reservation window. When bwctl requests a test from the bwctld server, it specifies the earliest time and the latest time it is willing to accept. The latest time is determined by adding this longest option to the earliest time. The earliest time is essentially 'now'. The longest time is specified as a number of seconds.

Default:
If interval is set, the default is 50% of interval. Otherwise, the default is twice the test duration time but no smaller than 5 minutes. (See -t.)

-n nIntervals
Number of tests to perform if the -I option is set.

Default:
Continuous

-R alpha
Randomize the start time of the test within this alpha percent of the interval. Valid values for alpha are from 0-50. bwctl will attempt to run the test every interval +/- alpha percent. For example, if the interval is 300 seconds and alpha is set to 10 percent, then bwctl will attempt to run a test every 270-330 seconds. This option is only useful with the -I option.

Default:
0 (no randomness)


Output Arguments:

-d dir
Specifies directory for results files if the -p option is set.
-e facility
Syslog facility to log messages to.

Default:
LOG_USER

-f units
Specify the units for the tool to use when displaying the results. The accepted values for units are tool specific.

Iperf:

k
Kilobits per second

K
Kilobytes per second

m
Megabits per second

M
Megabytes per second

-h

Print a help message.
-p

Place test results in files. Print the filenames to stdout when results are complete.
-q

Quiet output. Output as little as possible.
-r

Send syslog messages to stderr. This is the default unless the -q option is specified so this option is only useful with the -q option.
-V

Print version information and exit.
-v

Verbose output. Specifying additional -v's increases the verbosity.
-x

Output sender (client) results as well as receiver results. By default, sender results are not output. If the -p option is specified, the sender results are placed in an additional file.
-y format
Specify the output format of the tool. The accepted values for format are tool specific.

Iperf:

c
[c]omma-separated output


ENVIRONMENT VARIABLES
bwctl Environment Variable use default

BWCTLRC Config file ~/.bwctlrc
BWCTL_DEBUG_TIMEOFFSET Offset 0.0(seconds)

LIMITATIONS
Only tested with versions 1.7.0 and 2.0.b of Iperf.
EXAMPLES

bwctl -c somehost.example.com

Run a default 10 second TCP test as soon as possible with local as the sender and somehost.example.com as the receiver, using whichever tools they have in common. Return the results from the receive side of the test.

bwctl -x -c somehost.example.com

Like the previous test, but also return the results from the sender side of the test.

bwctl -x -c somehost.example.com -s otherhost.example.com

Like the previous test, but with otherhost.example.com as the sender instead of local.

bwctl -t 30 -T iperf -s somehost.example.com

Run a 30 second TCP Iperf test with somehost.example.com as the sender and local as the receiver.

bwctl -I 3600 -R 10 -t 10 -u -b 10m -s somehost.example.com

Run a 10 second UDP test about every hour (3600 +/- 360 seconds) with the sender rate limited to 10 Mbits per second from somehost.example.com to local.

bwctl -s somehost.example.com AE AESKEY someuser

Run the default 10 second TCP test. Authenticate using the identity someuser. bwctl will prompt for a passphrase that will be used to create an AES key.


SEE ALSO
bwctld(8) and the http://e2epi.internet2.edu/bwctl/ web site.

For details on Iperf, see the http://sourceforge.net/projects/iperf web site.

For details on Nuttcp, see the http://www.wcisd.hpc.mil/nuttcp/Nuttcp-HOWTO.html web site.

For details on Thrulay, see the http://e2epi.internet2.edu/thrulay/ web site.
ACKNOWLEDGMENTS
This material is based in part on work supported by the National Science Foundation (NSF) under Grant No. ANI-0314723. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.


Index

NAME
SYNOPSIS
DESCRIPTION
ARGUMENTS

Connection/Authentication Arguments:
Throughput Test Arguments:
Scheduling Arguments:
Output Arguments:

ENVIRONMENT VARIABLES
LIMITATIONS
EXAMPLES
SEE ALSO
ACKNOWLEDGMENTS

https://launchpad.net/flowgrind
http://www.internet2.edu/performance/bwctl/bwctl.man.html

http://psps.perfsonar.net/status/index.html

The perfSONAR-PS Status Collector and Service allows networks to monitor network elements and make available its operational and administrative status information.The perfSONAR-PS Status Collector can collect status information via a number of methods. It can be configured to run a script allowing it to query devices that the service doesn't natively support, or to consult an existing database of status information. The collector can also be configured to obtain status information directly from the switches and routers.

http://psps.perfsonar.net/status/index.html

Reduce transmit queue length

As enough pieces have fallen into place to make actual predictions, (and the quarry’s spoor noticed), I decided to perform more deliberate experiments to see if I could capture more henchmen of the mastermind. I did.
I’ll start to provide puzzle pieces I’ve discovered here regularly, so you can help with the conviction of the criminal and repair of the damage they have caused. Today’s simple experiments only involve your router’s switch. We’ll do some experiments on the wireless side of the router next.
Conclusion: something stinks in operating system’s network stacks. Linux is often worst, with two different but related problems, followed by Mac OSX; Microsoft Windows manages to obfuscate much of it’s problems, but also demonstrably suffers. After mitigation, Linux may be able to perform much better than either.

Experiment Setup

If your home router has a gigabit switch (a few do, these days), you’ll want to find a 100 meg switch to perform this experiment with. You may be able to achieve the same effect using “ethtool” and setting your ethernet link speed to 100Mbps. I presume your machines all have gigabit network interfaces; most have for a while.
Hook up your laptop directly to the switch’s ethernet port. Hook a second machine up to a second port to act as your server. In case one or the other of your computers is wimpy, let’s use nttcp for our testing. The point here is to transfer data over the link as fast as you can.
Install “nttcp“. Run “nttcp -i” on the machine you designate as your server.

Experiment 1a:

Run “nttcp -t -D -n2048000 server & ping-n server” on your laptop.
What do you observe after, say, 20 seconds? Is this what you would expect, given that a packet of 1500 bytes takes only .13 milliseconds to pass through a 100Mbps switch?

Experiment 1b:

Issue the command “ifconfig eth0“; look at the txqueuelen value. On my laptop, it is set by Linux to 1000.
Is the latency constant, or variable, as you manipulate the txqueuelen parameter?
Set the txqueuelen parameter to half of its initial size (e.g. “ifconfig eth0 txqueuelen 500“. What happens to the observed latency?
What do you observe? How does it differ from Experiment 1a?
Try playing with different values of txqueuelen while continuing to observe the ping latency. On most current hardware, you can set the txqueuelen to zero; on some older hardware, you may have problems if you do so.

Experiment 1c:

Install the command “ethtool” if you don’t have it installed.
Set the txqueuelen to the minimum operating value (0 on my laptop) for this experiment.
Execute the command “ethtool -g” and note the current hardware settings for your ethernet interface. Note that not all device drivers support this interface. On my laptop, the ring size is 256 by default.
Run “nttcp -t -D -n2048000 server & ping -n server” on your laptop. What do you observe? Why?
Try playing with different values for the ring parameters (e.g. “ethtool -G eth0 tx 64” , and observe the ping latency. Your hardware will probably have some limit minimum ring size that you cannot go below. On my laptop, this is 64 entries.
Is the latency constant, or variable? Why?

Experiment 1d:

Note that you can perform similar experiments on Mac OSX and Windows, both of which behave much better than Linux “out of the box” (though Linux is better than OSX once the transmit queue is truncated). Note that the details of the hardware matter here: you should use the same hardware, or hardware using the same ethernet chip if possible.
For extra credit, explain why Windows default behavior is so much better than either Linux or OSX on 100Mbps Ethernet. (Hint: try setting the transmit speed on the Windows machine to 10Mbps; and search Windows technical notes about multimedia playing). Do you now believe Microsoft’s explanations? Or is there a different explanation given these experiments that makes more sense?

http://www.bufferbloat.net/projects/bloat/wiki/Linux_Tips

Friday, June 10, 2011

How to Optimize your Internet Connection using MTU and RWIN

The TCP Maximum Transmission Unit (MTU) is the maximum size of a single TCP packet that can pass through a TCP/IP network.


An easy way to figure out what your MTU should be is to use ping where you specify the payload size:

ping -s 1464 -c1 google.com

Note though that the total IP packet size will be 1464+28=1492 bytes since there is 28 bytes of header info. Thus if the packet gets fragmented for payload above 1464, then you should set your MTU=1492. Ping will let you know when it becomes fragmented with something like the following:

ping -s 1464 -c1 google.com

PING google.com (72.14.207.99) 1464(1492) bytes of data.
64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=1 ttl=237 (truncated)

--- google.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 118.672/118.672/118.672/0.000 ms
john@TECH5321:~$ ping -s 1465 -c1 google.com
PING google.com (64.233.167.99) 1465(1493) bytes of data.
From adsl-75-18-118-221.dsl.sndg02.sbcglobal.net (75.18.118.221) icmp_seq=1 Frag needed and DF set (mtu = 1492)

--- google.com ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

In other words, to find your correct MTU, you would first start with a small packet size, and then gradually increase it until you see fragmentation; the cutoff point will be what to use for your MTU (using the formula payload + 28 = MTU). Note in the first case shown above where the payload size is 1464, the packet was transmitted fine, but in the second case where the payload size is 1465, ping complains "Frag needed"; to clarify, that means any packet with a payload of 1464 or less will be sent just fine, but a payload size of 1465 or above will end up being fragmented. Therefore, 1464 is the maximum payload, and that means the MTU is 1464+28=1492.

To set the MTU temporarily (will be lost after a reboot), you can do:

sudo ifconfig mtu 1492

Note that unfortunately some NICs do not allow you to change their MTU. You can use "ifconfig" by itself to see what the MTU is for your NIC and whether the MTU changes when you use the above command.
Or to make the change permanent, you can add it to /etc/network/interfaces:

gksudo gedit /etc/network/interfaces

And then add "mtu " in it for the particular interface. Here's an example of mine that uses my wireless interface wlan0:

iface wlan0 inet static
address 192.168.1.23
netmask 255.255.255.0
gateway 192.168.1.1
wireless-essid John's Home WLAN
mtu 1492

TCP Receive Window (RWIN)


In computer networking, RWIN (TCP Receive Window) is the maximum amount of data that a computer will accept before acknowledging the sender. In practical terms, that means when you download say a 20 MB file, the remote server does not just send you the 20 MB continuously after you request it. When your computer sends the request for the file, your computer tells the remote server what your RWIN value is; the remote server then starts streaming data at you until it reaches your RWIN value, and then the server waits until your computer acknowledges that you received that data OK. Once your computer sends the acknowledgement, then the server continues to send more data in chunks of your RWIN value, each time waiting for your acknowledgment before proceeding to send more.

Now the crux of the problem here is with what is called latency, or the amount of time that it takes to send and receive packets from the remote server. Note that latency will depend not only on how fast the connection is between you and the remote server, but it also includes all additional delays, such as the time that it takes for the server to process your request and respond. You can easily find out the latency between you and the remote server with the ping command. When you use ping, the time that ping reports is the round-trip time (RTT), or latency, between you and the remote server.

When I ping google.com, I typically get a latency of 100 msec. Now if there were no concept of RWIN, and thus my computer had to acknowledge every single packet sent between me and google, then transfer speed between me and them would be simply the (packet size)/RTT. Thus for a maximum sized packet (my MTU as we learned above), my transfer speed would be:

1492 bytes/.1 sec = 14,920 B/sec or 14.57 KiB/sec

That is pathetically slow considering that my connection is 3 Mb/sec, which is the same as 366 KiB/sec; so I would be using only about 4% of my available bandwidth. Therefore, we use the concept of RWIN so that a remote server can stream data to me without having to acknowledge every single packet and slow everything down to a crawl.

Note that the TCP receive window (RWIN) is independent of the MTU setting. RWIN is determined by the BDP (Bandwidth Delay Product) for your internet connection, and BDP can be calculated as:

BDP = max bandwidth of your internet connection (Bytes/second) * RTT (seconds)

Therefore RWIN does not depend on the TCP packet size, and TCP packet size is of course limited by the MTU (Maximum Transmission Unit).

Before we change RWIN, use the following command to get the kernel variables related to RWIN:

sysctl -a 2> /dev/null | grep -iE "_mem |_rmem|_wmem"

Note the space after the _mem is deliberate, don't remove it or add other spaces elsewhere between the quotes.

You should get the following three variables:

net.ipv4.tcp_rmem = 4096 87380 2584576
net.ipv4.tcp_wmem = 4096 16384 2584576
net.ipv4.tcp_mem = 258576 258576 258576

The variable numbers are in bytes, and they represent the minimum, default, and maximum values for each of those variables.

net.ipv4.tcp_rmem = Receive window memory vector
net.ipv4.tcp_wmem = Send window memory vector
net.ipv4.tcp_mem = TCP stack memory vector

Note that there is no exact equivalent variable in Linux that corresponds to RWIN, the closest is the net.ipv4.tcp_rmem variable. The variables above control the actual memory usage (not just the TCP window size) and include memory used by the socket data structures as well as memory wasted by short packets in large buffers. The maximum values have to be larger than the BDP (Bandwidth Delay Product) of the path by some suitable overhead.

To try and optimize RWIN, first use ping to send the maximum size packet your connection allows (MTU) to some distant server. Since my MTU is 1492, the ping command payload would be 1492-28=1464. Thus:

ping -s 1464 -c5 google.com

PING google.com (64.233.167.99) 1464(1492) bytes of data.
64 bytes from py-in-f99.google.com (64.233.167.99): icmp_seq=1 ttl=237 (truncated)
64 bytes from py-in-f99.google.com (64.233.167.99): icmp_seq=2 ttl=237 (truncated)
64 bytes from py-in-f99.google.com (64.233.167.99): icmp_seq=3 ttl=237 (truncated)
64 bytes from py-in-f99.google.com (64.233.167.99): icmp_seq=4 ttl=237 (truncated)
64 bytes from py-in-f99.google.com (64.233.167.99): icmp_seq=5 ttl=237 (truncated)

--- google.com ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3999ms
rtt min/avg/max/mdev = 101.411/102.699/105.723/1.637 ms

Note though that you should run the above test several times at different times during the day, and also try pinging other destinations. You'll see RTT might vary quite a bit.

But for the above example, the RTT average is about 103 msec. Now since the maximum speed of my internet connection is 3 Mbits/sec, then the BDP is:
Code:

(3,000,000 bits/sec) * (.103 sec) * (1 byte/8 bits) = 38,625 bytes

Thus I should set the default value in net.ipv4.tcp_rmem to about 39,000. For my internet connection, I've seen RTT as bad as 500 msec, which would lead to a BDP of 187,000 bytes. Therefore, I could set the max value in net.ipv4.tcp_rmem to about 187,000. The values in net.ipv4.tcp_wmem should be the same as net.ipv4.tcp_rmem since both sending and receiving use the same internet connection. And since net.ipv4.tcp_mem is the maximum total memory buffer for TCP transactions, it is usually set to the the max value used in net.ipv4.tcp_rmem and net.ipv4.tcp_wmem.

And lastly, there are two more kernel TCP variables related to RWIN that you should set:

sysctl -a 2> /dev/null | grep -iE "rcvbuf|save"

which returns:

net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1

Note enabling net.ipv4.tcp_no_metrics_save (setting it to 1) means have Linux optimize the TCP receive window dynamically between the values in net.ipv4.tcp_rmem and net.ipv4.tcp_wmem. And enabling net.ipv4.tcp_moderate_rcvbuf removes an odd behavior in the 2.6 kernels, whereby the kernel stores the slow start threshold for a client between TCP sessions. This can cause undesired results, as a single period of congestion can affect many subsequent connections.

Before you change any of the above variables, try going to http://www.speedtest.net or a similar website and check the speed of your connection. Then temporarily change the variables by using the following command with your own computed values:

sudo sysctl -w net.ipv4.tcp_rmem="4096 39000 187000" net.ipv4.tcp_wmem="4096 39000 187000" net.ipv4.tcp_mem="187000 187000 187000" net.ipv4.tcp_no_metrics_save=1 net.ipv4.tcp_moderate_rcvbuf=1

Then retest your connection and see if your speed improved at all.

Once you tweak the values to your liking, you can make them permanent by adding them to /etc/sysctl.conf as follows:

net.ipv4.tcp_rmem=4096 39000 187000
net.ipv4.tcp_wmem=4096 39000 187000
net.ipv4.tcp_mem=187000 187000 187000
net.ipv4.tcp_no_metrics_save=1
net.ipv4.tcp_moderate_rcvbuf=1

And then do the following command to make the changes permanent:

sudo sysctl -p


http://swik.net/Ubuntu/Only+Ubuntu/How+to+Optimize+your+Internet+Connection+using+MTU+and+RWIN/cbnda

Make linux faster

CPU affinity

On system with multiple cores (either 1 dual core CPU or 2 single core or whatever), linux actually detects each core as a CPU. There is a feature available in the kernel that allows you to bind processes to a particular set of CPUs. This is called CPU affinity.
CPU affinity can be changed while the system is running using the taskset command. Taskset takes two arguments: the process PID and the affinity mask. The mask is a 32bit integer that has a 1 for each bit representing the CPU on which the process is allowed to run.
By default, the kernel implements soft affinity. This consists in trying as much as possible to have each process running on the same CPU. Hard affinity is when the user (root) forces this behavior.
The following command forces process with PID 1 to run only on CPU 1:
taskset -p 1 1
One last note: the children inherits the affinity from their parent. This way if you want to make the kernel run on CPU 1 and leave CPU 2 for a particular process, just change the affinity of init to 1 in the boot process and nothing will use CPU 2.
I never managed to get better performance out of these feature but I think it is worth mentionning.

NUMA

NUMA stands for Non Uniform Memory Access. Basically this means that some memory is 'closer' to a certain CPU than another.
Starting with the Opteron, AMD CPUs have their memory controller on the chip. Therefore in a dual CPU system, you will have some memory banks being managed by CPU 1 and some by CPU 2. If a CPU want to access memory managed by the other one, it will have to ask for the info to second CPU.
NUMA makes the kernel aware of this architecture and tries as much as possible to assign memory on the DIMM controlled by the CPU on which the process is running.
NUMA is a kernel option. The best way of using it is to use as much memory banks as possible.

Tools

Some tools you might want to check out.
vmstat, mpstat, top, ps aux, iostat

Network option: net.core.netdev_max_backlog

When you do netstat -su, there is a line for 'packet receive error'. This somehow matches a buffer overflow/packet loss in the network socket. Increasing the options net.core.[rw]mem_default and net.core.[rw]mem_max can help, but in my test cases, increasing them wasn't enough.
What I did is increasing net.core.netdev_max_backlog from 300 to 2500 and we stopped seeing those errors.
Here is how this option comes into play.
For NAPI drivers:
Every time the kernel polls frames from the nic, it has to be a good citizen and not take the CPU for too long. Each poll will poll at most netdev_max_backlog frames.
For Non-NAPI drivers:
When the nic get a packet, it sends an interrupt to the kernel. The kernel disable the interrupts, does half of the handling (copy the packet in its own memory* and trigger and softirq), reenable the interrupts and go back to what it was doing. Sometime in the future, the kernel has to handle the softirq and handle the bottom half (decapsulate and give the data to the socket). Netdev_max_backlog is number of message in the queue market by *.

TCP options

Buffer sizes

These are the files you want to look at:
/proc/sys/net/core/rmem_default
/proc/sys/net/core/rmem_max
/proc/sys/net/core/wmem_default
/proc/sys/net/core/wmem_max
/proc/sys/net/ipv4/tcp_mem
/proc/sys/net/ipv4/tcp_rmem
/proc/sys/net/ipv4/tcp_wmem
Here is a little but of info on these variables:
  • everything with rmem is for read (ie the incoming buffers and such), everything with wmem is for write (ie the outgoing buffers and such)
  • (r,w)mem_default is the default size of the (incoming, outgoing) buffer for each socket
  • (r,w)mem_max is the max size of the (incoming, outgoing) buffer for each socket (note the size of the buffer needs to be changed by the program using the setsockopt syscall)
  • tcp_(r,w)mem contain 3 values
    1. minimum size of the buffer per socket
    2. default size of the buffer per socket (overrides (r,w)mem_default)
    3. max size of the buffer per socket (is overridden by (r,w)mem_max)
  • tcp_mem: there are global variable for the tcp stack. The unit is page. (=4kb on x86_64). It contains 3 values:
    1. low_threshold: normal behavior
    2. high_threshold: puts pressure on the sockets to consume more data (faster)
    3. max: packets get dropped.
So here you go� Depending on which value you want, this is how to find it.

http://philly.astroboy.fr/index.php?page=linux/performance

Linux Tuning

This page contains a quick reference guide for Linux 2.6 tuning for hosts connected at speeds of 1Gbps or higher. For a detailed explanation of some of the advice on this page, see the Linux Tuning Expert page. Also see the tuning advice for test/measurement hosts page.

 General Approach

To check what setting your system is using, use 'sysctl name' (e.g.: 'sysctl net.ipv4.tcp_rmem'). To change a setting use 'sysctl -w'. To make the setting permanent add the setting to the file 'sysctl.conf'.

 TCP tuning

Like most modern OSes, Linux now does a good job of auto-tuning the TCP buffers, but the  default maximum Linux TCP buffer sizes are too small. The following settings are recommended:
# increase TCP max buffer size setable using setsockopt()
# 16 MB with a few parallel streams is recommended for most 10G paths
# 32 MB might be needed for some very long end-to-end 10G or 40G paths
net.core.rmem_max = 16777216 
net.core.wmem_max = 16777216 
# increase Linux autotuning TCP buffer limits 
# min, default, and max number of bytes to use
# (only change the 3rd value, and make it 16 MB or more)
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# recommended to increase this for 10G NICS
net.core.netdev_max_backlog = 30000
Note: you should leave net.tcp_mem alone. The defaults are fine.
Linux supports pluggable congestion control algorithms . To get a list of congestion control algorithms that are available in your kernel (kernal  2.6.20+), run:
sysctl net.ipv4.tcp_available_congestion_control
If cubic and/or htcp are not listed try the following, as most distributions include them as loadable kernel modules:
/sbin/modprobe tcp_htcp
/sbin/modprobe tcp_cubic
For long fast paths, we highly recommend using cubic or htcp. Cubic is the default for a number of Linux distributions, but if is not the default on your system, you can do the following:
 sysctl -w net.ipv4.tcp_congestion_control=cubic
NOTE: There seem to be bugs in both bic and cubic for a number of versions of the 2.6.18 kernel used by Redhat Enterprise Linux 5.3 - 5.5 and its variants (Centos, Scientific Linux, etc.) We recommend using htcp with a 2.6.18.x kernel to be safe.

 NIC Tuning

These can be added to /etc/rc.local to get run at boot time.
# increase txqueuelen for 10G NICS
/sbin/ifconfig eth2 txqueuelen 10000

http://fasterdata.es.net/fasterdata/host-tuning/linux/#t3

Whose house is of glasse, must not throw stones at another.

n my last post I outlined the general bufferbloat problem. This post attempts to explain what is going on, and how I started on this investigation, which resulted in (re)discovering that the Internet’s broadband connections are fundamentally broken (others have been there before me).  It is very likely that your broadband connection is badly broken as well; as is your home router; and even your home computer. And there are things you can do immediately to mitigate the brokenness in part, which will cause applications such as VOIP, Skype and gaming to work much, much better, that I’ll cover in more depth very soon. Coming also soon, how this affects the world wide dialog around “network neutrality.”

Bufferbloat is present in all of the broadband technologies, cable, DSL and FIOS alike. And bufferbloat is present in other parts in the Internet as well.

As may be clear from old posts here, I’ve had lots of network trouble at my home,  made particularly hard to diagnose due to repetitive lightning problems. This has caused me to buy new (and newer) equipment over the last five years (and experience the fact that bufferbloat has been getting worse in all its glory).  It also means that I can’t definitively answer all questions about my previous problems, as almost all of that equipment is scrap.

Debugging my network

As covered in my first puzzle piece I was investigating performance of an old VPN device Bell Labs had built last April, and found that the latency and jitter when running at full speed was completely unusable, for reasons I did not understand, but had to understand for my project to succeed.  The plot thickened when I discovered I had the same terrible behavior without using the Blue Box.
I had had an overnight trip to the ICU in February; so did not immediately investigate then as I was catching up on other work. But I knew I had to dig into it, if only to make good teleconferencing viable for me personally. In early June, lightning struck again (yes, it really does strike in the same place many times). Maybe someone was trying to get my attention on this problem.  Who knows? I did not get back to chasing my network problem until sometime in late June, after partially recovering my home network, further protecting my house, fighting with Comcast to get my cable entrance relocated (the mom-and-pop cable company Comcast had bought had installed it far away from the power and phone entrance), and replacing my washer, pool pump, network gear, and irrigation system.
But the clear signature of the criminal I had seen on April had faded. Despite several weeks of periodic attempts, including using the wonderful tool smokeping to monitor my home network, and installing it in Bell Labs, I couldn’t nail down what I had seen again.I could get whiffs of smoke of the the unknown criminal, but not the same obvious problems I had seen in April.  This was puzzling indeed; the biggest single change in my home network had been replacing the old blown cable modem provided by Comcast with a new faster DOCSIS 3 Motorola SB6120 I bought myself.
In late June, my best hypothesis was that there might be something funny going on with Comcast’s PowerBoost® feature. I wondered how that worked, did some Googling, and happened across the very nice internet draft that describes how Comcast runs and provisions its network. When going through the draft, I happened to notice that one of the authors lives in an adjacent town, and emailed him, suggesting lunch and a wide ranging discussion around QOS, Diffserv, and the funny problems I was seeing. He’s a very senior technologist in Comcast. We got together in mid-July for a very wide ranging lunch lasting three hours.

Lunch with Comcast

Before we go any further…
Given all the Comcast bashing currently going on,  I want to make sure my readers understand through all of this Comcast has been extremely helpful and professional, and that the problem I uncovered, as you will see before the end of this blog entry, are not limited to Comcast’s network: bufferbloat is present in all of the broadband technologies, cable, FIOS and DSL alike.
The Comcast technical people are as happy as the rest of us that they now have proof of bufferbloat and can work on fixing it, and I’m sure Comcast’s business people are happy that they are in a boat the other broadband technologies are in (much as we all wish the mistake was only in one technology or network, it’s unfortunately very commonplace, and possibly universal). And as I’ve seen the problem in all three common operating systems, in all current broadband technologies, and many other places, there is a lot of glasse around us. Care with stones is therefore strongly advised.
The morning we had lunch, I happened to start transferring the old X Consortium archives from my house to an X.org system at MIT (only 9ms away from my house; most of the delay is in the cable modem/CMTS pair); these archives are 20GB or so in size.  All of a sudden, the wiffs of smoke I had been smelling became overpowering to the point of choking and death.  ”The Internet is Slow Today, Daddy” echoed through my mind; but this was self inflicted pain. But as I only had an hour before lunch, the discussion was a bit less definite than it would have been even a day later. Here is the “smoking gun” of the following day, courtesy of DSL Reports Smokeping installation. You too can easily use this wonderful tool to monitor the behavior of your home network from the outside.
Horrifying Smokeping Plot; terrible latency and jitterAs you can see, I had well over one second latency, and jitter just as bad, along with high ICMP packet poss.  Behavior from inside out looked essentially identical. The times when my network connection returned to normal were when I would get sick of how painful it was to browse the web and suspend the rsync to MIT.  As to why the smoke broke out, the upstream transfer is always limited by the local broadband connection: the server is at MIT’s colo center on a gigabit network, that directly peers with Comcast. It is a gigabit (at least) from Comcast’s CMTS all the way to that server (and from my observations, Comcast runs a really clean network in the Boston area.  It’s the last mile that is the killer.
As part of lunch, I was handed a bunch of puzzle pieces that I assembling over the following couple months.  These included:
  1. That what I was seeing was more likely excessive buffering in the cable system, in particular, in cable modems.  Comcast has been trying to get definitive proof of this problem since Dave Clark at MIT had brought this problem to their attention several years ago.
  2. A suggestion of how to rule in/out the possibility of problems from Comcast’s Powerboost by falling back to the older DOCSIS 2 modem.
  3. A pointer to ICSI’s Netalyzr.
  4. The interesting information that some/many ISP’s do not run any queue management (e.g. RED).
Screen capture of wireshark output, showing part of a "burst".
Wireshark screen capture, showing part of a "burst"
I went home, and started investigating seriously.  It was clearly time to do packet traces to understand the problem. I set up to take data, and eliminated my home network entirely by plugging my laptop directly into the cable modem.
But it had been more than a decade since I last tried taking packet captures, and was staring at TCP traces.  Wireshark was immediately a big step up (I’d occasionally played with it over the last decade); as soon as I took my first capture I immediately knew something was gravely wrong despite being very rusty at staring at traces. In particular, there were periodic bursts of illness, with bursts of dup’ed acks, retransmissions, and reordering.  I’d never seen TCP behave in such a bursty way (for long transfers).  So I really wanted to see visually what was going on in more detail. After wasting my time investigating more modern tools, I settled on the old standby’s of tcptrace and xplot I had used long before.  There are certainly more modern tools; but most are closed source and require Microsoft Windows. Acquiring the tools, and their learning curve (and the fact I normally run Linux) mitigated against their use.
A number of plots show the results.  The RTT  becomes very large after a while (10-20 seconds) into the connection, just as the ICMP ping results go..  The outstanding data graph and throughput graph show the bursty behavior so obvious even browsing the wireshark results. Contrast this with the sample RTT, outstanding data graph and throughput graphs from the TCP trace manual.
RTT plot
RTT - Round Trip Time
Outstanding data graph
Outstanding data graph
Throughput Graph
Throughput Graph
Also remember that buffering in one direction still causes problems in the other direction; TCP’s ack packets will be delayed.  So my occasional uploads (in concert with the buffering) was causing the “Daddy, the Internet is slow today” phenomena; the opposite situation is of course also possible.

The Plot Thickens Further

Shortly after verifying my results on cable, I went to New Jersey (I work for Bell Labs from home, reporting to Murray Hill), where I stay with my in-laws in Summit. I did a further set of experiments.  When I did, I was monumentally confused (for a day), as I could not reproduce the strong latency/jitter signature (approaching 1 second of latency and jitter) that I saw my first day there when I went to take the traces. With a bit of relief, I realized that the difference was that I had initially been running wireless, and then had plugged into the router’s ethernet switch (which has about 100ms of buffering) to take my traces.  The only explanation that made sense to me was that the wireless hop had additional buffering (almost a second’s worth) above and beyond that present in the FIOS connection itself. This sparked my later investigation of routers (along with occasionally seeing terrible latency in other routers), which in turn (when the results were not as I had naively expected, sparked investigating base operating systems.
The wireless traces are much rattier in Summit: there are occasional packet drops severe enough to cause TCP to do full restarts (rather than just fast retransmits), and I did not have the admin password on the router to shut out other access by others in the family.  But the general shape in both are similar to that I initially saw at home.
Ironically, I have realized that you don’t see the full glory of TCP RTT confusion caused by buffering if you have a bad connection as it reset TCP’s timers and RTT estimation; packet loss is always considered possible congestion. This is a situation where the “cleaner” the network is, the more trouble you’ll get from bufferbloat. The cleaner the network, the worse it will behave. And I’d done so much work to make my cable as clean as possible…
At this point, I realized what I had stumbled into was serious and possibly widespread; but how widespread?

Calling the consulting detectives

At this point, I worried that we (all of us) are in trouble, and asked a number of others to help me understand my results, ensure their correctness, and get some guidance on how to proceed.  These included Dave Clark, Vint Cerf, Vern Paxson, Van Jacobson, Dave Reed, Dick Sites and others. They helped with the diagnosis from the traces I had taken, and confirmed the cause.  Additionally, Van notes that there is timestamp data present in the packet traces I took (since both ends were running Linux) that can be used to locate where in the path the buffering is occurring (though my pings are also very easy to use, they may not be necessary by real TCP wizards, which I am not, and begs a question of accuracy if the nodes being probed are loaded).
Dave Reed was shouted down and ignored over a year ago when he reported bufferbloat in 3G networks (I’ll describe this problem in a later blog post; it is an aggregate behavior caused by bufferbloat). With examples in broadband and suspicions of problems in home routers I now had reason to believe I was seeing a general mistake that (nearly) everyone is making repeatedly. I was concerned to build a strong case that the problem was large and widespread so that everyone would start to systematically search for bufferbloat.  I have spent some of the intervening several months documenting and discovering additional instances of bufferbloat, as my switch, home router, results from browser experiments, and additional cases such as corporate and other networks as future blog entries will make clear.

ICSI Netalyzr

One of the puzzle pieces handed me by Comcast was a pointer to Netalyzr.
ICSI has built the wonderful Netalyzr tool, which you can use to help diagnose many problems in your ISP’s network.  I recommend it very highly. Other really useful network diagnosis tools can be found at M-Lab and you should investigate both; some of the tests can be run immediately from a browser (e.g. netalyzr), but some tests are very difficult to implement in Java. And by using these tools, you will also be helping researchers investigate problems in the Internet, and you may be able to discover and expose mis-behavior of many ISP’s. I have, for example, discovered that the network service provided on the Acela Express is running a DNS server which is vulnerable to man-in-the-middle attacks due to lack of port randomization, and therefore will never consider doing anything on it that requires serious security.
At about the same time as I was beginning to chase my network problem, the first netalyzer results were published at NANOG; more recent results have since been publishedNetalyzr: Illuminating The Edge Network, by Christian Kreibich, Nicholas Weaver, Boris Nechaev, and Vern Paxson. This paper has a wealth of data in it on all sorts of problems that Netalyzr has uncovered; excessive buffering is caused in section 5.2. The scatterplot there and the discussion is worth reading. Courtesy of the ICSI group, they have sent me a color version of that scatterplot that makes the technology situation much clearer (along with the magnitude of the buffering) which they have used in their presentations, but is not in that paper. Without this data, I would have still been wondering bufferbloat was widespread, and whether it was present in different technologies or not. My thanks to them for permission to post these scatter plots.
Netalyzer Uplink buffer test results
Netalyzer Uplink buffer test results
Netalyzer Downlink buffer test results
Netalyzer Downlink buffer test results
As outlined in the Netalyzr paper in section 5.2, the structure you see is very useful to see what buffer sizes and provisioned bandwidths are common.  The diagonal lines indicate the latency (in seconds!) caused by the buffering. Both wired and wireless Netalyzer data are mixed in the above plots. The structure shows common buffer sizes, that are sometimes as large as a megabyte. Note that there are times that Netalyzr may have been under-detecting and/or underreporting the buffering, particularly on faster links; the Netalyzr group have been improving its buffer test.
I do have one additional caution, however: do not regard the bufferbloat problem as limited to interference cause by uploads. Certainly more bandwidth makes the problem smaller (for the same size buffers); the wired performance of my FIOS data is much better than what I observe for Comcast cable when plugged directly into the home router’s switch.  But since the problem is present in the wireless routers often provided by those network operators, the typical latency/jitter results for the user may in fact be similar, even though the bottleneck may be in the home router’s wireless routing rather than the broadband connection.  Anytime the downlink bandwidth exceeds the “goodput” of the wireless link that most users are now connected by, the user will suffer from bufferbloat in the downstream direction in the home router  (typically provided by Verizon) as well as upstream (in the broadband gear) on cable and DSL. I see downstream bufferbloat commonly on my Comcast service too, now that I’ve upgraded to 50/10 service, now that it is much more common my wireless bandwidth is less than the broadband bandwidth.

Discarding various alternate hypotheses

You may remember that I started this investigation with a hypothesis that Comcast’s Powerboost might be at fault.  This hypothesis was discarded by dropping my cable service back to using DOCSIS 2 (which would have changed the signature in a different way when I did).
Secondly, those who have waded through this blog will have noted that I have had many reasons not to trust the cable to my house, due to mis-reinstallation of a failed cable by Comcast earlier, when I moved in. However, the lightning events I have had meant that the cable to my house was relocated this summer, and a Comcast technician had been to my house and verified the signal strength, noise and quality at my house.  Furthermore, Comcast verified my cable at the CMTS end; there Comcast saw a small amount of noise (also evident in (some of) the packet traces by occasional packet loss) due to the TV cable also being plugged in (the previous owner of my house loved TV, and the TV cabling wanders all over the house).  For later datasets, I eliminated this source of noise, and the cable tested clean at the Comcast end and the loss is gone in subsequent traces. This cable is therefore as good as it gets outside a lab and very low loss.  You can consider some of these traces close to lab quality. Comcast has since confirmed my results in their lab.
Another objection I’ve heard is that ICMP ping is not “reliable”.  This may be true if pinging a particular node when loaded, as it may be handled on a node’s slow path.  However, it’s clear the major packet loss is actual packet loss (as is clear from the TCP traces).  I personally think much of the “lore” that  I’ve heard about ICMP is incorrect and/or a symptom of the bufferbloat problem. I’ve also worked with the author of httping, so that there is a commonly available tool (Linux and Android) for doing RTT measurements that is indistinguishable from HTTP traffic (because it is HTTP traffic!), by adding support for persistent connections. In all the tests I’ve made, the results for ICMP ping match that of httping. But TCP shows the same RTT problems that ICMP or httping does in any case.

What’s happening here?

I’m not a TCP expert; if you are a TCP expert, and if I’ve misstated or missed something, do let me know. Go grab your own data (it’s easy; just an scp to a well provisioned server, while running ping), or you can look at my data.
The buffers are confusing TCP’s RTT estimator; the delay caused by the buffers is many times the actual RTT on the path.  Remember, TCP is a servo system, which is constantly trying to “fill” the pipe. So by not signalling congestion in a timely fashion, there is *no possible way* that TCP’s algorithms can possibly determine the correct bandwidth it can send data at (it needs to compute the delay/bandwidth product, and the delay becomes hideously large). TCP increasingly sends data a bit faster (the usual slow start rules apply), reestimates the RTT from that, and sends data faster. Of course, this means that even in slow start, TCP ends up trying to run too fast. Therefore the buffers fill (and the latency rises). Note the actual RTT on the path of this trace is 10 milliseconds; TCP’s RTT estimator is mislead by more than a factor of 100. It takes 10-20 seconds for TCP to get completely confused by the buffering in my modem; but there is no way back.
Remember, timely packet loss to signal congestion is absolutely normal; without it, TCP cannot possibly figure out the correct bandwidth.
Eventually, packet loss occurs; TCP tries to back off. so a little bit of buffer reappears, but it then exceeds the bottleneck bandwidth again very soon.  Wash, Rinse, Repeat…  High latency with high jitter, with the periodic behavior you see.  This is a recipe for terrible interactive application performance.  And it’s probable that the device is doing tail drop; head drop would be better.
There is significant packet loss as a result of “lying” to TCP.  In the traces I’ve examined using The TCP STatistic and Analysis Tool (tstat) I see 1-3% packet loss. This is a much higher packet loss rate than a “normal”  TCP should be generating.  So in the misguided idea that dropping data is “bad”, we’ve now managed to build a network that both is lossier and exhibiting more than 100 times the latency it should.  Even more fun is that the losses are in “bursts.” I hypothesis that this accounts for the occasional DNS lookup failures I see on loaded connections.
By inserting such egregiously large buffers into the network, we have destroyed TCP’s congestion avoidance algorithms. TCP is used as a “touchstone” of congestion avoiding protocols: in general, there is very strong pushback against any protocol which is less conservative than TCP. This is really serious, as future blog entries will amplify. I personally have scars on my back (on my career, anyway), partially induced by the NSFnet congestion collapse of the 1980′s. And there is nothing unique here to TCP; any other congestion avoiding protocol will certainly suffer.
Again, by inserting big buffers into the network, we have violated the design presumption of all Internet congestion avoiding protocols: that the network will drop packets in a timely fashion.
Any time you have a large data transfer to or from a well provisioned server, you will have trouble.  This includes file copies, backup programs, video downloads, and video uploads. Or a generally congested link (such at a hotel) will suffer. Or if you have multiple streaming video sessions going over the same link, in excess of the available bandwidth. Or running current bittorrent to download your ISO’s for Linux. Or google chrome uploading a crash to Google’s server (as I found out one evening). I’m sure you can think of many others. Of course, to make this “interesting”, as in the Chinese curse, the problem will therefore come and go mysteriously, as you happen to change your activity (or things you aren’t even aware of happen in the background).
If you’ve wondered why most VOIP and Skype have been flakey, stop wondering.  Even though they are UDP based applications, it’s almost impossible to make them work reliably over such links with such high latency and jitter. And since there is no traffic classification going on in broadband gear (or other generic Internet service), you just can’t win. At best, you can (greatly) improve the situation at the home router, as we’ll see in a future installment. Also note that broadband carriers may very well have provisioned their telephone service independently of their data service, so don’t jump to the conclusion that therefore their telephone service won’t  be reliable.

Why hasn’t bufferbloat been diagnosed sooner?

Well, it has been (mis)diagnosed multiple times before; but the full breadth of the problem I believe has been missed.
The individual cases have often been noticed, as Dave Clark did on his personal DSLAM, or as noted in  the Linux Advanced Routing & Traffic Control HOWTO. (Bert Huber attributed much more blame to the ISP’s than is justified: the blame should primarily be borne by the equipment manufacturers, and Bert et. al. should have made a fuss in the IETF over what they were seeing.)
As to specific reasons why, these include (but are not limited to):
  • We’re all frogs in heating water; the water has been getting hotter gradually as the buffers grow in subsequent generations of hardware, and memory has become cheaper. We’ve been forgetting what the Internet *should* feel like for interactive applications. Us old guy’s memory is fading of how well the Internet worked in the days when links were 64Kb, fractional T1 or T1 speeds.  For interactive applications, it often worked much better than today’s internet.
  • Those of us most capable of diagnosing the problems have tended to opt for the higher/highest bandwidth tier service of ISP’s; this means we suffer less than the “common man” does. More about this later. Anytime we try to diagnose the problem, it is most likely we were the cause; so we stop what we were doing to cause “Daddy, the Internet is slow today”, the problem will vanish.
  • It takes time for the buffers to confuse TCP’s RTT computation. You won’t see problems on a very short (several second) test using TCP (you can test for excessive buffers much more quickly using UDP, as Netalyzer does).
  • The most commonly used system on the Internet today remains Windows XP, which does not implement window scaling and will never have more than 64KB in flight at once. But the bufferbloat will become much more obvious and common as more users switch to other operating systems and/or later versions of Windows, any of which can saturate a broadband link with but a merely a single TCP connection.
  • In good engineering fashion, we usually do a single test at a time, first testing bandwidth, and then latency separately.  You only see the problem if you test bandwidth and latency simultaneously. None of the common consumer bandwidth tests test latency simultaneously. I know that’s what I did for literally years, as I would try to diagnose my personal network. Unfortunately, the emphasis has been on speed; for example, the Ookla speedtest.net and pingtest.net are really useful; but they don’t run a latency test simultaneously with each other.  As soon as you test for latency with bandwidth, the problem jumps out at you. Now that you know what is happening, if you have access to a well provisioned server on the network, you can run tests yourself that make bufferbloat jump out at you.
I understand you may be incredulous as you read this: I know I was when I first ran into bufferbloat.  Please run tests for yourself. Suspect problems everywhere, until you have evidence to the contrary.  Think hard about where the choke point is in your path; queues form only on either side of that link, and only when the link is saturated.

https://gettys.wordpress.com/2010/12/06/whose-house-is-of-glasse-must-not-throw-stones-at-another/

http://networkmanagement.comcast.net/

http://networkmanagement.comcast.net/

Buffer Bloat: The calculations

The buffer-bloat blog posts by Jim Gettys[1][2], are very interesting, and relevant for everybody with a broadband connection (especially asymmetric links). They are well written, but also too long and too many posts.

In this blog post, I'll explain what is going on, by showing how you can calculate your own latency issues.

The issue raised by Gettys, is that buffer-bloat (too big buffers on the network path) has fundamentally broken Internet broadband connections [2].
As buffer-bloat can introduce enough latency to cripple your line. Basically killing the possibility of interactive and realtime services, being delivered (e.g. by companies) over your broadband connection.

I'm very happy to see that Gettys is bring this issue up again.
Back in 2005, I discovered the same issues as Gettys. I wrote my masters thesis[3] about the issue, and even created an Open Source "mitigation" solution the ADSL-optimizer[4]. It seems my solution has not gained a wider use.

The major contribution for my side, is to take the ADSL overhead into account when doing QoS packet shaping. (Everything is in mainline, just use/add the TC options "linklayer adsl" and "overhead", if you already have a Linux box doing QoS on your line.)

The issue is that:
A single TCP upload cause a delay of 1.2 seconds (on a 512 Kbit/s ADSL line)
(see thesis[3] page 21).

Lets calculate what is happening, without going into details of why TCP/IP miss-behaves and cause queues to build (details are in my thesis[3]).

Before starting the calculations, here is a beautiful cite by Jim Gettys[1]:
"Large network buffers can be thought of as 'dark buffers', analogous to 'dark matter' in the universe; they are undetectable under many/most circumstances, and you can detect them only by indirect means. Buffers do not cause problems when they are empty. But when they fill they introduce additional latency (and create other problems, possibly very severe) to other traffic sharing the link."

Given the line speed and the delay, we can calculate the buffer size
(this is the bandwidth-delay product). Due to ADSL overhead the
effective bandwidth is actually 454 Kbit/s of the 512 Kbit/s line,
and the measured delay was 1138 ms.
454 Kbit/s * 1138 ms = 64581 bytes
This, corresponds to the TCP window-size. Thus, this is not the maximum buffer-size of the modem. (Use several TCP connection or UDP to find your maximum ping RTT, and calc your buffer size).

Where does the delay come from?!
The delay consists of different components, the important one in our case is the transmission delay (combined with the packets in queue).

The transmission delay of a 1500 bytes (MTU) packet is:
1500 bytes / 454 Kbit/s = 26.34 ms

Thus, the experienced delay is the time it takes to empty the packets in the queue, which is greatly dependend on the line speed.
E.g. 64000 bytes / 454 Kbit/s = 1127 ms.

With a RTT delay of 150 ms, your interactive SSH connection will feel sluggish.
(A side note on ADSL is that; the processing delay in the ADSL modem can get as large as 60 ms, and is caused by the interleaving depth, but its "fortunately" a constant fixed delay on the path)

Increasing the bandwidth, will reduce the latency, but its not the
solution. Besides, ADSL technology is often limited to a 1024 Kbit/s
upstream link.

The Point:
"ISPs SHOULD configure the buffer size based upon the link bandwidth"

I have a feeling that the ISP just configure a default queue size, and tune the queue size based upon max throughput on their largest product.

The line I did my measurements on, I could see delay on 3.3 sec, thus a buffer-bloat size of 187332 bytes (454 Kbit/s * 3300 ms), or 125 packets at 1500 bytes (MTU). Simply crazy!

For more details on formulas and calculation see thesis page 19 to 27.

Links:
[1] Jim Gettys: introducing-the-criminal-mastermind-bufferbloat
[2] Jim Gettys: whose-house-is-of-glasse-must-not-throw-stones-at-another[3] http://www.adsl-optimizer.dk/thesis/
[4] http://www.adsl-optimizer.dk/
[5] http://en.wikipedia.org/wiki/Jim_Gettys

http://netoptimizer.blogspot.com/2010/12/buffer-bloat-calculations.html