Internet2
Site Index | Internet2 Searchlight |
Membership | Communities | Services | Projects | Tools | Events | Newsroom | About
 | Home
End-to-End Performance Initiative
> About Us
> Staff
> Contact
Resources
> Tools
> Presentations
> Library
> Case Studies


Network Performance
> perfSONAR-PS
> BWCTL
> OWAMP
> NDT
> Thrulay
> Workshops
> NPToolkit
> MP Directory
> RPM
> Phoebus


Community Engagement
> Working Groups
> Collaborations
When 99% Isn't Quite Enough:
A Tale of Multicast Woe
A Case Study for E2Epi
 

On October 2, 2002, Internet2 began netcasting the EDUCAUSE 2002 conference in Atlanta; Internet2's President and CEO, Doug Van Houweling, was scheduled to give a talk that would be presented from Internet2’s Ann Arbor offices. Jon and Chris were the Internet2 staff responsible for netcasting the event.

During the test phase before the netcasting began, Chris told Jon that the netcast transmission quality was poor; Jon’s video transmission equipment showed significant packet loss but, when he tried to pin down the location of the network problem, he had no luck. Ping was showing no packet loss and Iperf tests showed only 1% loss. Not knowing what else he could do, Jon sent out an S.O.S. to the E2Eperf Interest Group on October 3:

“We currently have a network problem that is proving hard to pin down. Ping (every .1 sec, 1300 byte packets) shows no packet loss. From 199.77.176.251 to Ann Arbor (say 207.75.164.56), there is some problem. Iperf showed only 1% loss. Our video transmission equipment shows major packet loss. Any thoughts? My guess is some type of traffic shaping or perhaps back to back packets problem.”

Russ and Matt, members of Internet2’s End-to-End Performance Initiative (E2Epi) team, received the request for information; they called Jon to discuss the problem – had he run any other tests? What were the parameters and results?

They began by looking at the path: Russ ran a jalaaM test that indicated some packet loss between Atlanta and Ann Arbor but it only gave an indication of where the loss was occurring. Matt had an Iperf server available at the SoX (Southern Crossroads) GigaPoP, which is located in Atlanta. Matt began running Iperf tests; he eventually discovered that 1% loss occurred from Georgia Tech to Ann Arbor but not the other way around. Matt also discovered that the loss was 1%, regardless of applied load – the loss remained constant between 5 and 50 mbps.

After discovering that the loss was occurring on the Atlanta to Ann Arbor route, Matt tested various parts of the path. First, he checked the Abilene Weather Map, which displays loss/error status on internal links. Then, he checked the connection from Abilene to Ann Arbor (via the Merit GigaPoP); no problems on the Ann Arbor end. Matt verified loss from Indianapolis to the SoX GigaPoP.

At this point, Matt and Russ began researching the SoX GigaPoP end of the path. They looked at the Abilene Atlanta router interfaces and found that the interface to the SoX GigaPoP was showing about 1% errors in the inbound direction only since the counters were last reset some time before – the interface was showing CRC (cyclic redundancy checks) errors. The router performs a checksum to ensure the packets are good; in this case, it was throwing away 1 % of the packets as corrupt.

Matt spoke with the Abilene NOC staff and found that this was a known problem with some optical gear through this GigaPoP; the SoX GigaPoP had been working on the problem. Matt spoke with one of the SoX GigaPoP network staff, who indicated that, although they knew of the problem, there was not enough time to fix the problem (or money in the budget to expedite the correction).

While all the testing was going on, Jon and Chris managed the netcast of the conference. Presentations were transmitted live; the packet loss caused major disruption in the quality and effectiveness of the netcast. Sections of presentations lost both audio and video signals at times; as such, the netcast could be deemed a failure.

By 5:00 p.m. on October 3, Russ contacted Jon to inform him that the problem was known and unfixable: the SoX GigaPoP had a problem with their fiber connection and fixing the problem would be time-consuming. The SoX GigaPoP did not communicate to outside world that the problem existed – but there is no method, as yet, by which they could have done so. If the conference planners (especially their networking/video engineers) could have known of it, they could have tried to reroute the traffic, changed the conference site, or warned users that H.323 (and any other real-time) applications were too sensitive for the network.



Recommendations

In the discussion that ensued after Jon posted the problem to the E2Eperf Interest Group, several recommendations were made and issues raised:

  • While developers/users of applications want the network to run perfectly – high speed and zero-loss – network engineers feel that application designers need to realize that the network never will be perfect; applications need to be more robust to withstand common network errors.
     
  • Applications users, on the other hand, want network engineers to understand that there are families of applications that require near-zero packet loss but are not speed-driven (so systematic loss is significant).
     
  • Users and network engineers must keep each other informed of changes if the network is to run smoothly. Engineers should think of how a modification or system failure could affect users and inform them of the change so that they can make adjustments, where possible; users should inform network staff when they plan to use a new type of application to ensure that the network is capable of supporting it.
     
  • Communication is important but education is essential – both users and network engineers need to understand that different types of applications have different network needs; users should be able to tell network engineers the type of application they will be running and have that convey information about the necessary speed and packet loss-resistance of the application family so a usable path is available.
     
  • In response to the requests from several members of the E2Eperf Interest Group after this problem occurred, more links from the GigaPoPs have been added to the Abilene Weather Map; users need to double click on the node to see activity at the GigaPoPs that connect to that node.
     
  • Members of the E2Eperf Interest Group felt that campuses and GigaPoPs should create a posting location to alert people to known or long-standing problems.
     
  • Knowing the tolerance for an application is important. There has been an ongoing discussion of creating information on the network requirements and tolerances of specific types of applications.
     
  • In solving this problem, Russ and Matt had to look at the router proxy; users can issue commands to show data on interface statistics and get that information back. Matt and Russ were able to get the Atlanta Abilene router to show them the interface statistics and that was where they initially pinpointed the 1% loss. As a result of this problem (and its solution), this information is now available on the Weather Map.


  • Keep track of loss/error statistics: Matt suspects that Jon’s ping tests didn’t show loss because it wasn’t running long enough to catch the 1/100 loss.
     


  • REVIEW THIS ARTICLE
     
    Please share your comments; if you have any questions be sure to include your email address.
     
    Read Other Reviews

    © 1996 - 2008 Internet2 - All rights reserved | Terms of Use | Privacy | Contact Us
    1000 Oakbrook Drive, Suite 300, Ann Arbor MI 48104 | Phone: +1-734-913-4250