|
|
|
Hey! Where Did My Performance Go?
|
|
Rate Limiting Rears Its Ugly Head
|
A Case Study for E2Epi
|
| |
|
Shawn, an assistant research scientist at the University
of Michigan, was seeing problems with routine data transfers.
As part of his research with the Large Hadron Collider, Shawn
routinely sends large streams of data to locations across
the nation and around the world. On June 11, 2003, when sending
a stream, Shawn experienced 20% packet loss to locations outside
of his departmental LAN. Why?
Recently, Shawn has been instrumental in developing
MGRID
(Michigan Grid Research and Infrastructure Development)
for the university; the project is attempting to develop a
scalable grid infrastructure such that the tools could be
replicated on a national and international level. As part
of the development effort, Shawn is documenting problems encountered,
along with the solutions discovered. Unlike many users confronted
with an end-to-end problem, Shawn was not only expecting trouble,
but prepared to deal with it, expeditiously.
As the chair of Internet2’s End-to-End Performance Initiative
(E2Epi) Technical Advisory Group (TAG), Shawn is familiar
with many of the diagnostic tools that network performance
experts use to identify and solve performance problems. He
installed NDT
on a Web100 Kernel (on a Linux 2.4.20 box) and used the tool
to debug poor performance on his local network.
Using a tuned host, connecting via 100 Mbps Ethernet, a user
would normally see 95 Mbps or so worth of throughput. Shawn
discovered that the maximum throughput was 20-60 Mbps over
many tests from a tuned client to the NDT
server along a FastEthernet path. The Universityof Michigan
has a robust networking infrastructure; normally, he would
have no problem sending a 95 Mbps TCP stream across his own
campus between properly configured machines.
NDT
helped Shawn localize two problems by identifying the bandwidth
limitation, which wasn’t present earlier, as well as
indicating a significant amount of packet loss on an under-utilized
network. Further research with Ethereal,
a network packet capturing utility, showed bursts of broadcast
packets during normal network operation and coincident with
packet “loss” events. Testing within a local subnet
didn’t exhibit the problem. The problem seemed to involve
the connection between the local subnet and the rest of the
departmental LAN. When the department network administrator
was contacted about this specific network connection, Shawn
discovered that the department had established a broadcast
packet rate limit of 10 packets/second to protect against
a known bug that had caused a broadcast
storm.
The reason for wanting rate limiting was in response to an
earlier problem:
The departmental network design had Ethernet switches.
The switches had a protocol among them to form a spanning tree
– a single path for broadcast packets.
A firmware upgrade on a network switch had silently turned off
this feature on that switch; now broadcast packets would come
in one link and broadcast on all others.
Unfortunately, the departmental network design also had redundant
links; this meant that broadcast packets kept multiplying when
they passed the redundant links, until the network was completely
filled with these (now useless) broadcast packets.
To try to protect the network against something like this in
the future, some of the links were rate limited. A traffic limiting
device was installed in the middle of the departmental LAN;
it was designed to turn on the rate limiting when it recognized
a certain level of broadcast packets, but, apparently, it limited
all traffic on the link (not just broadcasts).
NOTE: As a result of this problem, the University of Michigan
network administration team worked with the manufacturer of
the network switch to ensure both that the firmware upgrade
would not silently turn off a spanning tree and that the existing
release notes warned users about this existing behavior.
The 10 packet/second limit, when it was originally configured,
was not a problem on the subnet, which typically had around
2-3 broadcast packets/second. However some newly installed (and
apparently misconfigured) software was causing bursts of broadcast
packets, which intermittently caused the subnet to exceed 10
packets/second, thus causing intermittent connectivity problems
due to exceeding the broadcast limit.
When Shawn identified the problem and brought it to the attention
of his network staff, they increased the rate limit by a factor
of three, which decreased the probability that “normal”
broadcast traffic (or at least traffic with broadcast rates
significantly below a real broadcast storm) would trip the limit.
In addition, Shawn notified one of the subnet user’s about
their misbehaving software and they correctly reconfigured it.
Recommendations
Have a good understanding of your network topology.
When a good connection suddenly goes bad, talk to network
staff at the departmental and campus levels. Tell them about
the problem encountered; ask if they have installed any upgrades
that could have silently modified the paths?
Learn about network tools available to help define and
isolate the problem, such as
NDT
(http://e2epi.internet2.edu/ndt/),
which is designed for novice (to expert) users, and Ethereal
(www.ethereal.com) or tcpdump,
which are primarily intended for expert users. Another option
is to use Iperf (http://dast.nlanr.net/Projects/Iperf/)
with a server at the edge of your campus as a testpoint.
Ask if your department or campus has initiated rate limiting;
all the bugs have not yet been worked out of traffic limiting
devices and they could be causing a problem.
More generally, even simple switch and router configurations
can have unforeseen consequences, especially with regards to
performance.
Talk to the people who run your network; lack of communication
is often the largest part of the problem.
|