|
When 99% Isn't Quite Enough: |
|
A Tale of Multicast Woe |
A Case
Study for E2Epi |
| |
|
On October 2, 2002, Internet2 began netcasting the EDUCAUSE
2002 conference in Atlanta; Internet2's President and CEO,
Doug Van Houweling, was scheduled to give a talk that would
be presented from Internet2’s Ann Arbor offices. Jon
and Chris were the Internet2 staff responsible for netcasting
the event.
During the test phase before the netcasting began, Chris told
Jon that the netcast transmission quality was poor; Jon’s
video transmission equipment showed significant packet loss
but, when he tried to pin down the location of the network
problem, he had no luck. Ping was showing no packet loss and
Iperf tests showed only 1% loss. Not knowing what else he
could do, Jon sent out an S.O.S. to the E2Eperf
Interest Group on October 3:
“We currently
have a network problem that is proving hard to pin down. Ping
(every .1 sec, 1300 byte packets) shows no packet loss. From
199.77.176.251 to Ann Arbor (say 207.75.164.56), there is
some problem. Iperf showed only 1% loss. Our video transmission
equipment shows major packet loss. Any thoughts? My guess
is some type of traffic shaping or perhaps back to back packets
problem.”
Russ and Matt, members of Internet2’s End-to-End Performance
Initiative (E2Epi) team, received the request for information;
they called Jon to discuss the problem – had he run
any other tests? What were the parameters and results?
They began by looking at the path: Russ ran a jalaaM
test that indicated some packet loss between Atlanta and Ann
Arbor but it only gave an indication of where the loss was
occurring. Matt had an Iperf server available at the SoX (Southern
Crossroads) GigaPoP, which is located in Atlanta. Matt began
running Iperf tests; he eventually discovered that 1% loss
occurred from Georgia Tech to Ann Arbor but not the other
way around. Matt also discovered that the loss
was 1%, regardless of applied load – the loss remained
constant between 5 and 50 mbps.
After discovering that the loss was occurring on the Atlanta
to Ann Arbor route, Matt tested various parts of the path.
First, he checked the Abilene
Weather Map, which displays
loss/error status on internal links. Then, he checked the
connection from Abilene to Ann Arbor (via the Merit GigaPoP);
no problems on the Ann Arbor end. Matt verified loss from
Indianapolis to the SoX GigaPoP.
At this point, Matt and Russ began researching the SoX GigaPoP
end of the path. They looked at the Abilene Atlanta router
interfaces and found that the interface to the SoX GigaPoP
was showing about 1% errors in the inbound direction only
since the counters were last reset some time before –
the interface was showing CRC (cyclic redundancy checks) errors.
The router performs a checksum to ensure the packets are good;
in this case, it was throwing away 1 % of the packets as corrupt.
Matt spoke with the Abilene NOC staff and found that this
was a known problem with some optical gear through this GigaPoP;
the SoX GigaPoP had been working on the problem. Matt spoke
with one of the SoX GigaPoP network staff, who indicated that,
although they knew of the problem, there was not enough time
to fix the problem (or money in the budget to expedite the
correction).
While all the testing was going on, Jon and Chris managed
the netcast of the conference. Presentations were transmitted
live; the packet loss caused major disruption in the quality
and effectiveness of the netcast. Sections of presentations
lost both audio and video signals at times; as such, the netcast
could be deemed a failure.
By 5:00 p.m. on October 3, Russ contacted Jon to inform him
that the problem was known and unfixable: the SoX GigaPoP
had a problem with their fiber connection and fixing the problem
would be time-consuming. The SoX GigaPoP did not communicate
to outside world that the problem existed – but there
is no method, as yet, by which they could have done so. If
the conference planners (especially their networking/video
engineers) could have known of it, they could have tried to
reroute the traffic, changed the conference site, or warned
users that H.323 (and any other real-time) applications were
too sensitive for the network.
Recommendations
In the discussion that ensued after Jon posted the
problem to the E2Eperf
Interest Group, several recommendations were made and
issues raised:
While developers/users of applications want the network
to run perfectly – high speed and zero-loss –
network engineers feel that application designers need to
realize that the network never will be perfect; applications
need to be more robust to withstand common network errors.
Applications users, on the other hand, want network engineers
to understand that there are families of applications that
require near-zero packet loss but are not speed-driven (so
systematic loss is significant).
Users and network engineers must keep each other informed of
changes if the network is to run smoothly. Engineers should
think of how a modification or system failure could affect users
and inform them of the change so that they can make adjustments,
where possible; users should inform network staff when they
plan to use a new type of application to ensure that the network
is capable of supporting it.
Communication is important but education is essential –
both users and network engineers need to understand that different
types of applications have different network needs; users should
be able to tell network engineers the type of application they
will be running and have that convey information about the necessary
speed and packet loss-resistance of the application family so
a usable path is available.
In response to the requests from several members of the
E2Eperf
Interest Group after this problem occurred, more links
from the GigaPoPs have been added to the Abilene Weather Map;
users need to double click on the node to see activity at
the GigaPoPs that connect to that node.
Members of the E2Eperf
Interest Group felt that campuses and GigaPoPs should
create a posting location to alert people to known or long-standing
problems.
Knowing the tolerance for an application is important.
There has been an ongoing discussion of creating information
on the network requirements and tolerances of specific types
of applications.
In solving this problem, Russ and Matt had to look at the
router proxy; users can issue commands to show data on interface
statistics and get that information back. Matt and Russ were
able to get the Atlanta Abilene router to show them the interface
statistics and that was where they initially pinpointed the
1% loss. As a result of this problem (and its solution), this
information is now available on the Weather Map.
Keep track of loss/error statistics: Matt suspects that
Jon’s ping tests didn’t show loss because it wasn’t
running long enough to catch
the 1/100 loss.
|