So, what am I reading you ask…

How about some light reading before bed… -sigh- I would rather chew gravel.

Question: Sometimes my server gets slow or becomes unresponsive, then comes back to life. I’m using NFS over UDP, and I’ve noticed a lot of IP fragmentation on my network. Is there anything I can do?
Answer: UDP datagrams larger than the IP Maximum Transfer Unit (MTU) must be divided into pieces that are small enough to be transmitted. If, for example, your network’s MTU is 1524 bytes, the Linux IP layer must break UDP datagram larger than 1524 bytes into separate packets, all of which must be smaller than the MTU. These separated packets are called fragments.

The Linux IP layer transmits each fragment as it is breaking up a UDP datagram, encoding enough information in each fragment so that the receiving end can reassemble the individual fragments into the original UDP datagram. If something happens that prevents a client from continuing to fragment a packet (e.g., the output socket buffer space in the IP layer is exceeded), the IP layer stops sending fragments. In this case, the receiving end has a set of fragments that is incomplete, and after a certain time window, it will drop the fragments if it does not receive enough to assemble a complete datagram. When this occurs, the UDP datagram is lost. Clients detect this loss when they have not received a reply from the server after a certain time interval, and recover by retransmitting the datagram.

Under heavy write loads, the Linux NFS client can generate many large UDP datagrams. This can quickly exhaust output socket buffer space on the client. If this occurs many times in a short time, the client sends the server a large number of fragments, but almost never gets a whole datagram’s worth of fragments to the server. This fills the server’s IP reassembly queue, causing it to become unreachable via UDP until it expels the useless fragments from the queue.

Note that the same thing can occur on servers that are under a heavy read load. If the server’s output socket buffers are too small, large reads will cause them to overflow during IP fragmentation. The client’s IP reassembly queue then fills with worthless fragments, and little UDP traffic can get to the client.

Here are some symptoms of this problem:

* You use NFS over UDP with a large wsize (relative to the network’s MTU), and your application workload is write-intensive, or with a large rsize with a read-intensive application.
* You may see many fragmentation errors on your server or clients (netstat -s will tell the story).
* Your server may periodically become very slow or unreachable.
* Increasing the number of threads on your server has no effect on performance.
* One or a small number of clients seem to make the server unusable.
* The network path between your client and server may have a router or switch with small port buffers, or the path may contain links that run at different speeds (100Mb/s and GbE).

The fix is to make the Linux’s IP fragmentation logic continue fragmenting a datagram even when output socket buffer space is over its limit. This fix appears in kernels newer than 2.4.20. You can work around this problem in one of several ways:

1. Use NFS over TCP. TCP does not use fragmentation, so it does not suffer from this problem. Using TCP may not be possible with older Linux NFS clients and servers that only support NFS over UDP.
2. If you can’t use NFS over TCP, upgrade your clients to 2.4.20 or later.
3. If you can’t upgrade your clients, increase the default size of your client’s socket buffers (see below). 2.4.20 and later kernels do this automatically for the NFS client’s socket buffers. See Section 5.3 <http://nfs.sourceforge.net/nfs-howto/ar01s05.html#fragmented_packets> of the NFS How-To for more information.
4. If your rsize or wsize is very large, reduce it. This will reduce the load on your client’s and server’s output socket buffers.
5. Reduce network congestion by ensuring your GbE links use full flow control, that your switch and router ports use adequate buffer sizes, and that all links are negotiating their fastest settings.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*