1. 1.1.1 Creation of Analysis and Diagnostic Tools
      2. 1.1.1 Application Network Use Characteristics
      3. 1.1.2 Guidelines for Design of Network-Based Applications
      4. 1.1.3 Common Libraries for Network Operations
      5. 1.1.4 Tools for Applications Performance Debugging
      6. Application-Network SWAT Team
      7. 1.1.5 Top Level Coordination
      8. 1.1.6 Sharing Experience
      9. 1.1.7 Campus Network Case Studies
      10. 1.1.8 Problem Resolution Case Studies
      11. 1.1.9 Searchable Repository of Problems and Solutions
      12. 1.1.10 Tools and Ratings

Internet2 E2Epi Roadmap

 

Abstract

 

The Internet2 End-to-End Performance Initiative (E2Epi) has completed three years of work organizing, documenting and enabling solutions to address the “end-to-end” problem in networking. This document provides a context for the current state of the effort, outlines and discusses the relevant issues related to the “end-to-end” problem, and, based upon this, provides an analysis of what the roadmap should be for the coming three years.

 

Executive summary

 

Despite the availability of a lightly loaded high-speed national network infrastructure, users at many Internet2 sites are able to effectively use only a fraction of the available capacity. This makes users understandably frustrated with the empirically poor performance they experience. Another aspect of this situation is the perception from site management that the university may be purchasing more network capacity than is necessary. Perhaps a “right sized” network link would be more economical, especially in times of financial constraints.

 

The Internet2 End-to-End Performance Initiative (E2Epi) has made great strides in completing the tasks defined in the original vision paper “Fat Pipes Are Not Enough” ( http://e2epi.internet2.edu/papers/End-to-End-Perf-Design-Paper.pdf ). Through a series of workshops, tool development and integration efforts, and case studies with Intenet2 member institutions, the initiative has identified several major problems that cause poor application performance. One of the fundamental problems is the lack of good empirical data that can be used to evaluate the performance of campus and national backbone networks. Thus Internet2 has spent a large portion of its time developing and deploying the measurement infrastructure needed for this task. With good data, anecdotal reports can be validated and the source of problems may be found and corrected. In addition, others have tackled other aspects of the broader problem (such as the Web100 project working on host issues, and the NLANR DAST team working on a diagnostic interface for the novice user. Some of the projects were inspired by the original whitepaper, and discussions and presentations that surrounded it.

 

With the basic measurement infrastructure now in place on the backbone, progress can be made on reaching the ultimate goal of identifying and correcting both network and application performance problems that artificially suppress network utilization. In this paper the Technical Advisory Group for the E2Epi has identified the next set of challenges that must be met to make further progress in this area. This paper lays out the open issues, defines the steps needed to move forward on these issues, and finally prioritizes the tasks to be accomplished in the next 3 years.

 

The roadmap laid out in this paper describes 3 major goals. The first goal is to continue to improve and extend the measurement infrastructure needed to quantify existing network performance and to identify specific problems with the basic network infrastructure. In particular, the deployment of interoperable infrastructures both throughout the US, and internationally. The second goal is to improve the measurement tools and diagnostic functions to automate the detection and analysis of poor performance. The third goal is to engage the application developer and user communities into this process to ensure the experiences learned are transferred to the Internet2 community at large and are not simply known to a few network operations or research staff.

THE PROBLEM

Despite large, lightly loaded circuits with negligible packet loss, users at many Internet2 sites are able to effectively use only a fraction of the capacity available to them.

 

Consider a typical end-user scenario:

  A faculty member has a switched 100Mbps fast Ethernet connection to her desktop,

  …which is connected via a lightly loaded gigabit campus backbone network,

  …to an OC 12 Abilene connection with plenty of available capacity,

…yet they may still only see 1 or 2 Mbps throughput when transferring a dataset with ftp.

 

The user understandably becomes frustrated with the empirically poor performance she receives from a network that should be able to deliver far greater throughput.

 

Or consider the problem from the macroscopic level: CIOs and network architects at Internet2 sites look at their 622Mbps or faster connections, see low average utilization, and wonder if Internet2 “makes sense” for them, and wonder if perhaps they should be purchasing lower capacity connections more appropriate to their apparent requirements.

 

These are the sorts of issues that the End-to-End Performance Initiative (E2Epi) is designed to actively address. We want to make sure end users are able to fully use the network resources made available to them, and we want to make sure sites see resource utilization levels consistent with the investments they’ve made.

History and Current Status

In early 2001, after realizing that high performance networking was not enabled by high-speed circuits alone, Internet2 started the E2Epi to find the causes of poor performance on high performance networks and to work toward everyone getting the full capabilities of the network on a regular basis. The vision of the required work was documented in the paper, “Fat Pipes Are Not Enough” ( http://e2epi.internet2.edu/papers/End-to-End-Perf-Design-Paper.pdf ).

 

In the past three years, the initiative has made significant progress. E2Epi has documented various network performance assessment efforts, reviewed tools to measure network, and end-host performance levels, and written case studies on performance problems and their solutions. This information has been made available on the E2Epi web pages ( http://e2epi.internet2.edu/ ).

 

E2Epi held three workshops (sponsored by Internet2 and NSF) to bring together the various research communities including:

 

  The work in DoE-, NSF-, and DARPA-funded projects; <URL?>

  NASA, NIH, and USGS internal projects; <URL?>

  High-Energy Nuclear Physics (the HENP), <URL?>

  Very Long Baseline Interferometry (VLBI) <URL?>

  Global Grid Forum (GGF) communities. <URL?>

 

E2Epi brought its expertise in network operations and data collection techniques to this broad research community. E2Epi also shared the workshop results with its community of university researchers and campus network administrators ensuring that research results will directly benefit the Internet2 user community. 1 E2Epi has also increased community awareness of the problems and complexities of end-to-end performance.

 

One outcome of these workshops has resulted in the Intenert2 E2Epi staff establishing a leadership role in the GGF. This forum is developing common schemas for representing measurement requests and data responses. The workshops have also achieved a better understanding of the overall measurement infrastructure required to understand performance problems and as well as identifying which projects are working on providing missing components of that infrastructure. E2Epi has provided staff time to fill important gaps in the infrastructure that were not being covered by other efforts.

 

A primary E2Epi accomplishment is its key role in the creation of the Abilene Measurement Infrastructure (AMI) to monitor the Abilene backbone links. The AMI allows a better understanding of the limitations to performance and helps to locate performance problems. This infrastructure consists of measurement servers located at all 11 Abilene router hubs. Both throughput and one-way delay measurements are performed on a regularly scheduled basis. The collected data is then archived and published on an open web site, allowing Intenet2 users and site administrators access to the current Abilene statistics. This greatly aids these users and administrators in determining if problems exist in the Abilene backbone.

 

Each measurement point contains several physical measurement servers, each currently dedicated to a specific task. Separate servers are used to ensure that sufficient resources (i.e., CPU, NIC, memory) are available to perform the active measurement. At the present time the AMI performs the following active measurements:

  Throughput with BWCTL (BandWidth Control). The BWCTL server performs throughput measurements between the pairs of measurement points. The NLANR/DAST Iperf tool is used to perform the throughput measurement, while the BWCTL program schedules the tests and uploads the data into the archive. The AMI performs a full mesh of BWCTL tests (110 tests taking approximately 20 minutes) every hour.

  Latency with OWAMP (One-Way Active Measurement Protocol). The OWAMP server performs one-way measurements to measure the packet loss, delay, and jitter statistics of the Abilene backbone. Due to the low overhead associated with the OWAMP probe packets, a full mesh of tests is performed once a minute.

  Routing with Traceroute. The Traceroute server performs periodic measurements of the Abilene network looking for routing changes. The data is archived and route changes are displayed in a web based table. The goal is to correlate these route changes with other measurements to understand the impact of these changes.

  An end user’s view with NDT (Network Diagnostic Tool). The NDT provides first/last mile testing and analysis services to the Internet2 user. The AMI has a federation of NDT servers that coordinate their actions to allow automatic server selection. Each NDT downloads a Java applet to the client’s desktop/laptop computer. This applet communicates with the servers testing and analysis engine to probe the path for set of pre-defined problems.

 

 

In addition to the scheduled tests, the AMI also supports on-demand testing from Intenet2 sites, GigaPoP peering points, and other national/international peer networks. This allows these sites to monitor their own performance, and perform “divide and conquer” testing if a problem is detected. The tools to support these tests are the product of both internal development efforts and collaborations with external research teams. They are constantly being refined to improve their reliability and usefulness.

 

As will be described below, the first major goal of this proposed roadmap is the creation of a global measurement infrastructure that can quickly and easily be extended and expanded on an as needed basis.

 

Finally, effort has been placed on gathering specific use cases that either define a problem faced by a specific Internet2 user or, better yet, the detailed solution that describes how to fix a specific problem. These use cases are collected and archived on an Internet2 web site. Thus they are available for review by other Intenet2 users and site administrators. Continuous efforts are underway to ensure that this information is available and used by this community.

 

 

 

Roadmap – Next Steps for the E2Epi

As noted above, the E2Epi team has made significant progress in developing the infrastructure needed to answer the fundamental user question “What’s wrong with the network”. While much has been accomplished, major challenges still remain. This section identifies the major tasks that must be accomplished in order to continue making progress on this project.

 

Appendix A of this paper identifies numerous tasks and open issues that must be addressed in order to completely solve the E2E performance problem. From this list we have identified three major goals that we believe the E2Epi team has the time and resources to tackle. The first goal is to continue to improve and extend the measurement infrastructure needed to quantify existing network performance and to identify problems with the basic network infrastructure. The second goal is to improve the measurement tools and diagnostic functions to automate the detection and analysis of poor performance. The third goal is to engage the application developer and user communities into this process to ensure the experiences learned are transferred to the Internet2 community at large and are not simply known to a few network operations or research staff.

 

Global Network Measurement Infrastructure (GNMI)

As noted above, a typical Internet2 user is unable to achieve a significant fraction of her or his local network capacity. The challenges are many and complex, but the fundamental issue is that raw data must be collected to determine what specific problems exist on the network path or in the user’s host. It is also noted that in today’s collaborative science environment the majority of communications take place across the global Internet, not within a single administrative network domain. Thus the path between peer hosts is composed of network links owned and managed by several autonomous administrative domains.

 

To address this issue, a global infrastructure needs to be created and deployed that can capture and process raw data from numerous administrative domains. This GNMI may consist of passive sensors that monitor the health and operation of the network links, switches, and routers. It may contain active measurement tools that probe the network to establish base-line operating parameters. It may also contain diagnostic tools and algorithms that aid in problem identification and resolution strategies. Finally, host issues such as CPU utilization, interrupt handling concepts, and disk I/O (input/output) operations must be measured and monitored to determine how applications actually perform.

 

The AMI tools now deployed on the Abilene network are being extended to other sites and networks. As described above the AMI consist of measurement tools located at all 11 Abilene route hubs. These E2Epi developed tools are currently available, under open source software license agreements, via the E2Epi web site. Over the next three years, efforts will be made to expand the number of Internet2 sites that use these tools on a regular basis.

 

While it is important to accomplish these deployment tasks, it must also be recognized that there is not just one GNMI to measure, archive, analyze, and report, that will serve the needs of all communities at all places and times. Other GNMIs are already in existence; for example, the NLANR AMP project and SLAC’s IEPM-PingER. It is not feasible or practical to expect that these projects will abandon their existing infrastructures to deploy the tools used in the Abilene network.

 

A better approach is to develop common schemas to gather and store the raw test results, schemas to request test data and results from a peer domain, and general algorithms to aid in identifying common problems. Fortunately, the Global Grid Forum (GGF) provides a ready home for these activities. The GGF Network Measurement Working Group (NWWG) has begun to define schemas that will allow different systems to share results and provide a more complete overall picture of the Internet.

 

Thus it will be possible to build a federation of inter-working GNMIs that can provide end-to-end performance data across a network path. As noted above, the GGF NMWG has been identified as the international forum where these interoperability issues can be addressed. The E2Epi team has been a major contributor to this effort, co-chairing this working group. Efforts will continue to ensure that this group develops documents that have broad support from all GNMI development teams. In addition, we will develop one-on-one relationships with the various teams, allowing the testing of proposed standards.

 

Some of the challenges that must be overcome to expand this measurement infrastructure include:

  Secure access to the measurement servers. Throughput servers may be capable of generating a large amount of network traffic. If used unwisely or maliciously the server could become a security threat to sites and user hosts. Methods must be developed and deployed that makes it difficult or impossible to misuse these measurement servers.

  Scheduled access to the measurement servers. All of the measurement servers have a limited number of resources (CPU, network bandwidth, memory, etc). Ad-hoc scheduling mechanisms can lead to situations where multiple tests are simultaneously conducted on a single server, exhausting some or all of these resources. In that case, the measurement may reflect this resource limit of some unexpected component instead of the expected result. Scheduling of system resources must be performed to ensure that measurements are reliable and accurate.

  Cross-domain testing. Each management domain that operates a measurement server has a specific clientele that they are willing to serve. Establishing the trust relationships needed to make cross-domain tests a common occurrence is a challenging task.

  Generation of common measurement schema. Different tools deployed in different administrative domains generate unique amounts of raw data. This data is archived for later use in analyzing the network performance of a specific administrative domain. Peer domains need access to some or all of this data on an automated basis. Thus common data formats and request/response schemas need to be developed to allow this exchange of data.

 

All of these challenges will be addressed in the upcoming project cycle. The existing efforts have made significant progress in identifying these challenges. Preliminary proposals to address some of them have already been developed. Formalizing and refining the specific procedures is an on-going task.

 

Analysis and Diagnosis

As stated above, much effort has been put forth in gathering the raw information about the state of a network, but it still takes an expert to analyze the data and determine what it means. Since there are not enough experts to go around, techniques and tools that do what the experts do are needed; also, funding for a “wizard workshop” (or series of workshops) to increase the number of experts and establish a method by which more “wizards” can be trained is being sought by the E2Epi.

 

While it is important to train a new set of experts, in the long run it will be just as important to provide an ‘expert system’ that can analyze a network for a novice user. The tools and algorithms that can automate these analysis functions are currently at a very primitive state today. Appendix C contains a list of 10 problems the E2Epi has identified that are known to cause performance problems. These are a combination of configuration issues (e.g.; setting the proper TCP buffer size for the network path) and fault conditions (e.g., duplex mismatch on the first hop).

 

Work on analyzing the first/last mile of the network path has begun with the development of the NDT. However, more work needs to be done to improve this tool. New detection algorithms need to be developed that can accurately and reliably detect a wide range of potential problems. The list in Appendix C will be expanded as more common problems are uncovered.

 

With regularly scheduled tests, data is being produced all the time. On a well running network, the data should show that the network is running well, but how can you tell from the data when it is not? Better yet, how can the network tell you that it has a problem? The amount of data being generated by these regularly scheduled tests, and the graphs they produce, already swamp the network operations staff at an Internet2 site. So much data is being generated that the “interesting” events are being lost in the noise. These events must be extracted and displayed in a way that makes it easy to understand what is happening in real time.

 

One simple answer is to set parameters for a specific measurement that defines a “normal” range. When the measurements go outside this range, an event is generated, warning the network operators. The hard part is defining the “normal” range and detecting anomalies. What is normal or acceptable in some cases may be abnormal in others. Different applications make different demands on the network, and not all applications degrade gracefully when these demands aren’t met. Any anomaly detection algorithm must take these differences into account when raising operator alarms.

 

Finally, what is the best operator interface for displaying these anomalies? Does each administrative domain need to constantly monitor every network infrastructure deployed around the globe? Tools need to be created that can automate the data collection process that would allow the network operations staff and/or individual users to track down problems at remote locations.

 


1.1.1  Creation of Analysis and Diagnostic Tools

 

As described in this section, the primary tasks with respect to tools that we will accomplish in the next 3 years are:

  We will extend the tools currently deployed in the AMI to improve there specific functions and capabilities.

  We will extend the analysis tools to improve their reliability and accuracy.

  We will continue to evaluate tools that can extract anomalous behavior and events from the large stream of data being collected.

 

The E2Epi can also be used as a focal point for individuals developing new tools to exchange ideas and use the E2Epi community for testing them. Both the existing, and new, tools will serve different user groups including end-users, network operations, network engineers, and researchers.

 

Applications Support

One thing that we have learned is that applications developers are not necessarily network experts, and often the applications developers and the network experts speak different languages when describing requirements or problems. An application that works across the lab may not work across the country. Applications developers could use some help to make their network applications work better.

 


1.1.1  Application Network Use Characteristics

For some applications, it may be that packet loss is the key network characteristic that controls performance; for another application, it may be latency. How does an applications designer know this, if he is not a network expert? Developing classes of applications and identifying the required network characteristics would help both the designer and the users of the application. Will the application work? Check for the required network parameters based on the type of application and the expected deployment environment. Some work has been done in this area but more is needed and it needs to be made available to those that need it.

 

Applications can and should be more robust. If the network can’t provide a specific service, the hosts are on different continents so sub-millisecond RTTs are not going to happen, then the application needs to hide this delay. For example, when we talk about audio/video applications, we need to be clear that different uses may require different resources. A broadcast video application can hide delay/jitter with an appropriate play-out buffer. This buffer will introduce a startup delay, but will then hide any run time delays or jitter. The same technique may not be suitable for a 2-way conference call, as human factors also come into play and they can’t be ignored. It is important to take usage into context, as well as the application itself.

 


1.1.2  Guidelines for Design of Network-Based Applications

Do applications designers know when cross-country latency will be important? Do they know what packet loss will do to their application performance? Do they know how to make a network application easy to debug? Often, they do not. Guidelines for applications design can help designers not only avoid the pitfalls of network use, but also could help them design applications that can adapt to the current network conditions and not completely fail when things are not optimal.

 


1.1.3  Common Libraries for Network Operations

Should applications designers reinvent the solutions for good network performance every time? They wouldn’t have to if there were a common library of routines that they could use for many of the network functions that applications need. Need to get bulk data transfers done? Pull up the library. Need a real-time stream? Pull up the library. Standard routines would also help network experts since they would generate traffic that the network experts could recognize.

 


1.1.4  Tools for Applications Performance Debugging

So far, E2Epi has primarily looked at tools that a network expert would use, but what about the applications designer? Having a list of tools and instructions on when to use them in debugging applications problems might be just the thing. We also need to support novice end users. They need help in determining when a real problem exists and when the application isn’t suitable for their specific needs. For example determining when to use a ‘chatty’ protocol that works fine in the LAN but fails miserably in the WAN?]


Application-Network SWAT Team

While a measurement infrastructure and derived 'finger-pointing tool' are essential elements for identifying network performance problems, they are not solutions. Solutions to fix performance problems caused by defects in the interior of the network are fairly well understood, and there are organized staff who have access, capabilities, and (most importantly) mandate to fix these problems. Solutions to fix performance problems caused by defects in end-user applications are not as clear, and most often has limited or no staff with mandate, much less capability, to fix the problem.

 

It is our opinion that a 'finger-pointing tool' deployed in the current network infrastructure will most often point at either the last hop, or the user application. This is still within the E2Epi domain, and the E2Epi needs to have a coherent plan on how to address these issues.

 

A collection of user applications "how I did it" stories, cookbooks, online course offerings, and other documentation is a good start. Many more case studies are always needed, but difficult to collect from users who do not have time to document their every move and action while they are optimizing an application. Here is a suggestion on how they would be collected: the E2Epi would do them first-hand.

 

A small "team" would be assembled with the express purpose of assisting applications to be conditioned to run efficiently. Internet2 is in a unique position to solicit applications from their Member institutions. The service provided would be direct, and would increase community knowledge by the 'after-action' stories published and presented. The purpose of the team would *not* be to directly help everyone who needs assistance (this does not scale, obviously) but to directly assist *some* applications, and collect and publicize methods than could be used at any institution. An adjunct goal might be to encourage the creation of similar "teams" at the institutional level, who would be backed by the knowledge & experience gained by the E2Epi team.

 

"Getting the message out" is strength of Internet2 that is essential to successful dissemination of the above collected wisdom.

 

Measurement Architecture

E2Epi has held a few workshops to bring the community together in agreement of what constitutes a GNMI and what parts are needed. The picture is still not complete, however. Further work is needed to fill in the details so that various GNMIs can not only work together but eventually they may have interchangeable parts. Opportunities for GNMI developers are needed to complete this work.

 

Security and Performance

We all know that many of the recent network security measures have been at odds with performance. Firewalls block ports or limit bandwidth. Router access lists can have similar effects. While not strictly security, NAT boxes can make certain applications break or degrade. The Internet2 Security at Line Speed Working Group ( S@LSA ) will be addressing these issues and E2Epi will follow these efforts closely.

 

In conclusion, not all “network problems” are caused by the network infrastructure. The basic TCP/IP protocols used in the Internet rely on every host playing an active roll in managing the traffic it injects into the network. Thus interactions between the TCP/IP protocol stack, the network infrastructure, and the application program can have major performance implications. For example an application that exchanges numerous short messages will function differently in LAN and WAN environments. Understanding when these difficulties are caused by the application and when they are caused by a infrastructure problem is a unanswered question.

 

To address these issues, the E2Epi will define a series of “events” which indicate when things happen that are out of the ordinary. The Internet2 Middleware Diagnostics Working Group is working on a system of events that will allow an application to determine if there are problems within the different sub-systems (i.e., disk I/O, CPU utilization, network access) the application uses. Other events may be caused by external systems used by an application include DNS, authentication, and authorization infrastructures. The specific network events that should be logged need to be defined.

 

 

  We will work with other user and developer communities to explore ways to correlate events from multiple sources.

 

Need to analyze/summarize the items discussed above and arrive at our choice for the roadmap priorities listed in the conclusion.

 

Operational Issues

While the preceding sections describe fundamental tasks that must be performed, they are not sufficient to declare the E2Epi project has been successful. The final step, of transferring this knowledge and experience to the novice network user must also be accomplished. Thus the components and services developed under this proposal must be institutionalized, such that they are an expected and common component of the Internet. Users must be able to rely on getting both the raw and analyzed data that describes their network use.

 

Moving to the Global Internet

Part of the mission of Internet2 is technology transfer. The developments of E2Epi are no exception. For a GNMI to be truly useful in solving end-to-end problems, the infrastructure must also be adopted by the commercial ISPs and non-U.S. networks. Commercial ISPs may have other concerns than the R&E community and, therefore, need to become involved in the development of the GNMI structure as soon as possible. In the past year, the E2Epi has focused on collaboration with International partner networks (such as GEANT and the Global Grid Forum), to ensure the interoperability of data and tools. These efforts will need to continue and expand.

 


1.1.5  Top Level Coordination

Even though systems are designed for each operational domain to handle a majority of the responsibility of supporting its part of the infrastructure, there is always a need for top-level coordination of any system that spans the entire Internet. A system that looks at an end-to-end path, by nature, needs to span the entire Internet. Planning for this support needs to be done well in advance of its eventual need because there are technical and political implications.

 


1.1.6  Sharing Experience

One strength of the E2Epi project has been the sharing of experiences. In this way, everyone can benefit from what is learned by one individual. Sharing can occur in a number of areas, including via various case studies, resource lists, and tool reviews.

 


1.1.7  Campus Network Case Studies

One problem all Internet2 campuses face is designing, building, operating, and upgrading their campus network. Many lessons have been learned by campuses already – sometimes the hard way, sometimes through success. This series of case studies provide examples of how campuses of different size and focus solved their network issues.

 


1.1.8  Problem Resolution Case Studies

Every time there is a performance problem and experts are brought in to solve it, there is a story to be told that can inform a wider audience. Unfortunately, experts typically move immediately on to the next problem and, while they learn from the experience, others do not. E2Epi has started collecting these experiences in a series of problem case studies, so that all may learn from them.

 


1.1.9  Searchable Repository of Problems and Solutions

While collecting case studies of problems is useful, what is really needed is a way to find particular case studies that fit your current problem. A searchable repository of case studies can be built as the number of case studies grows. However, it also needs to be easy enough for people to enter their own case studies that they will actually do so.

 


1.1.10  Tools and Ratings

There are many different tools available for testing performance, both free and commercial, but which one should you use? E2Epi has a tools list on its web page but what would make it more useful is a rating system whereby users can rate and comment on each of the tools.

 

Conclusion

While it would be nice to accomplish all the items above, it would require many more resources than our community has available. Therefore, the E2Epi TAG ranked the items to determine the top priority items on which to focus. These items are:

 

1)  Continue work on the GNMI with the expansion of the active measurement systems but also work toward including information from Layer 2, SNMP, and passive measurement systems.

 

2)  Work on diagnostic end-user tools that make use of the GNMI to tell about paths of particular interest to the user.

 

3)  Create a searchable repository of problems and solutions (case studies). Find ways that people will want to share their experiences on a regular basis and have it added into the repository. Suggestions were comments on repository entries, forum/email list where people communicate about performance problems.

 

4)  Support networked applications by better understanding classes of applications that share common network requirements for good performance and define network characteristics needed for good performance. Create guidelines, or better yet, software libraries for getting good performance for networked applications based on their classification.

 

 


A.  Appendix A: Descriptions of Possible Areas of Work for Internet2 E2Epi

 

1.1  Expansion of the GNMI

While good progress has been made in the last three years in defining and developing an end-to-end measurement infrastructure, implementation has only just started. At this time, systems to collect the raw data have been created. Even with these systems in place, refinement of the actual measurements is needed, based on what is found to be useful in the analysis and diagnosis work. For Internet2, the piPEs system needs to be expanded beyond Abilene to other backbones, GigaPoPs, regional networks, and, especially, campuses. Coordination with other GNMIs will be important to provide the best possible coverage.

 

1.2  Test Refinement and New Tests

As more is learned from the current tests in use, some changes to the tests may occur. In some cases, we may want to design new tests based on lessons learned. Tests will evolve as the GNMI evolves.

 

1.3  Common Schema

Work needs to be continued on the common schemas for GNMI data records, information requests, and responses. While work has been done on initial definitions, real world experience is needed to verify the workability of these schemas. This work, started in the GGF, should continue in that venue. Internet2 will continue to provide leadership in this area by co-chairing the relevant GGF working group.

 

1.4  Inter-Domain Coordination

It is envisioned that numerous operational domains will own and operate their own internal measurement infrastructure. To have a complete picture of the end-to-end network path, however, there needs to be data for each of the participating operational domains. Many issues need to be resolved before this complex environment can be properly analyzed.

 

1.5  Layer 2 Information

Most of the information collected by the current GNMIs is at Layer 3 of the protocol stack, the IP layer. Layer 2 data are often necessary when more details are required to diagnose a problem. For example, Ethernet switches could indicate their status to determine if an Ethernet duplex mismatch exists, but this information is not generally available. Layer 2 devices also contain buffers and other features that can affect overall performance. The challenge is in accessing these devices to retrieve the necessary data. Polling at regular intervals would create a large amount of archived data that would not often be used. On-demand access would require administrative access that most sites would make available to the general user community. Lastly, the collected data must be correlated with data from Layers 3-7 to ensure that it is meaningful.

 

1.6  IPv6

It is often assumed that IPv6 should have the same issues as IPv4. Is this true, however? More experience and investigation should be done to determine if any IPv6-specific issues affect performance.

 

1.7  Passive Measurement Data

Much of the current work on the piPEs GNMI has been with active tests of the network and computer infrastructure. Too many tests can consume all the resources and real use of the resources will be hindered. Passive testing uses existing traffic to determine the state of the network or computer. Thus, one would think that passive measurements would be preferred; however, passive measurements can be hard for many reasons. Taps to collect the data, distillation of high data flows, and privacy issues all are subjects that must be resolved to make passive data collection useful. To save network resources, passive data collection techniques should be advanced. [Joe: do you want to provide a pointer to CAIDA’s Coral Reef work here]

 

1.8  SNMP Data

One form of passive data that has not been well coordinated with the piPEs GNMI effort is the SNMP data that is available from many of the devices that form the end-to-end path. Here, again, some of the information is available on web pages, such as traffic or error graphs, but it is not readily usable to help diagnose performance problems, other than having an expert look at the graphs (if he knows that they exist!). SNMP data needs to be regularly included in an overall GNMI design and implementation. [Joe: would it make more sense to talk about trying to actually get direct SNMP access rather than derivative/summarized data access? How will the all-too-common changes to port assignments/network topology be handled?]

 

1.9  Host/OS Information

At the start of E2Epi, the Pittsburgh Supercomputing Center (PSC) Web100 project was identified as an important component in providing information to determine the performance issues on a computer and the associated operating system (OS). During the project’s development period, the PSC group modified the Linux 2.4.x kernel to show the benefit of this information. While there has been some progress in getting such variables into other OS’s (such as the planned inclusion of similar variables in the upcoming release of Windows), little use of the variables has been made, other than using them on diagnostic servers to guess at what is going on with a host at the other end of the connection. Direct access to the information on the target computer is generally not available. This type of information is needed directly from more of the computers involved in performance diagnosis. [Joe: Are all the correct operating sytems/kernel versions currently available? Have we made progress getting Web100 diagnostics included in default distributions ?]

 

In addition to Web100 variables, we still need a simple tool running on local hosts regardless of operation systems. It should provide fine granularity of local monitoring, integrate network monitoring tools (e.g. Iperf) and report all the values into a common monitoring server (e.g. MonALISA service). Another feature is that it should able to test connectivity and determine the most appropriate network and host settings to optimize the connection. As you see from today’s traffic analysis of Internet2 most users (90%) are still at 10Mbps or less. A useful end-host test tool will help to reduce the huge “wizard’s gap” in achieved performance.

1.10  Data Repositories

As mentioned above, most data collected is either presented only to the person requesting the test or displayed on a web page. A system of data repositories that collect the results from scheduled (or on-demand tests) is needed so that tests for the same data need not be repeated. Scheduled tests covering operational domains also help with a “divide and conquer” approach for problem isolation. Though the idea and need for data repositories has been agreed upon, the work to define standards and implementation is yet to be done. The common schema work fits prominently into the implementation of these.

1.11  New Optical Networks

As networking explores the world of direct use of optical networking, measurement and performance issues will arise that were not of previous concern. Unlike the current situation of having to “fit in” measurement after the design of the network, this work in optical networking provides the opportunity to “design in” the capabilities of measurement and quality control. These issues need to be kept in front of the designers so that they are not “left until later.”

Appendix B: List of Possible Areas of Work for Internet2 E2EPi

Expansion of the GNMI

Test Refinement and New Tests

Common Schema

Inter-Domain Coordination

Authentication/Authorization

Layer 2 Information

IPv6

Passive Measurement Data

SNMP Data

Host/OS Information

Data Repositories

New Optical Networks

 

Analysis and Diagnosis

Standard Operating Parameters (SOPs)

Definition of Events

Creation of Diagnostic Tools

 

Operational Issues

Moving to the Global Internet

Top Level Coordination

 

Sharing Experience

Campus Network Case Studies

Problem Resolution Case Studies

Searchable Repository of Problems and Solutions

Tools and Ratings

 

Applications Support

Application Network Use Characteristics

Guidelines for design of network based applications (adaptive apps)

Common Libraries for Network Operations

Tools for applications performance debugging

 

Measurement Architecture

 

Security and Performance

 

 

B.  Appendix C. 10 problems with local links

 

 

0. Unknown

 

No condition below triggers. (Important to know when to give up.)

 

1. TCP window too small

 

capacity*rtt < c*window

 

2. Ethernet duplex mismatch

 

for concurrent biderectional CBR test with a small rate (e.g.,

c*tcp_throughput, where 0<c<1), jitter=0, loss_1=0, loss_2!=0.

 

3. Bad cables (close to host)

 

CBR: BER!=0, jitter=0

 

4. Congestion

 

jitter>0 or (delay_{with_tcp}=delay_{without_tcp} and loss!=0)

 

5. WiFi

 

??? Very important to detect somehow (perhaps by asking the user,

but preferably automatically), but we can offer no test right now.

 

6. Excessive delays

 

delay>>0

 

7. Half-duplex Ethernet connection

 

CBR: jitter=0, loss_1!=0!=loss_2

 

8. Middleboxes

 

Try to look at IP numbers. See if TCP_MAXSEG gets echoed back correctly.

 

9. MTU mismatch.

 

Send UDP packets of many sizes, look for sizes where loss=100%.

 

10. Asymmetric capacity.

 

capacity_1!=capacity_2 (must have estimates for more important

item 1; might come from packet train timing ).

 

[Joe: I’d add:

-- get a vendor to ship Internet2-optimized systems out of the box (no need for expert tuning, just plug and play). How are vendors currently doing? Who sucks? Who doesn’t?

-- build, test and provide a recipe for emerging high performance hosts (e.g., what can you get from a dual opteron with a couple dozen serial ATA disks such as those available from Rackmountpro?) People need to be able to see and touch affordable high performance hosts to get an idea of what’s possible. (And looking at the specs for a cache appliance that a vendor who shall not be named recently proposed to us, I’m convinced folks still don’t have a clue)

 

-- recognize that end-to-end performance is a dynamic concept, and during long duration flows, routes can and do change, performance may improve or degrade, etc. Good throughput is not a permanent condition, it is a constantly flowing river, subject to changing circumstances (some as simple as contention from other users). Routes change, and those changes can impact performance. Stability should be part of what’s measured. -- recognize that end-to-end performance is a dynamic concept, and during long duration flows, routes can and do change, performance may improve or degrade, etc. Good throughput is not a permanent condition, it is a constantly flowing river, subject to changing circumstances (some as simple as contention from other users). Routes change, and those changes can impact performance. Stability should be part of what’s measured.

-- learn from P2P applications how they are optimizing use of network resources; in some cases the best way to improve throughput is through reducing latency: cache where it makes sense, and select from among those caches based on empirically observed performance

-- be prepared for the continued tension between security and performance (trivial example: sites that block all ICMP, thereby causing huge issues for pMTU and jumbo frame-related work). Speaking of jumbo frames, continue waving the flag to get folks to make progress toward becoming jumbo clean end-to-end.

-- recognize the practical utility of multiple parallel streams concurrently attacking a problem; if you need more aggressive throughput, multiple parallel streams may allow you to achieve it (but how many applications have been parallelized?)

-- acknowledge that we have work to do when it comes to securing tools, and making those encrypted tools perform as fast as unencrypted tools

-- I suspect that at the same time we’re worrying about bulk throughput, there are other folks for whom jitter is a bigger issue, and still others for whom lowering latencies is king. We should not ignore them. Latencies may be a particularly big deal in IPv6 space where network topologies are often distinctly suboptimal, leading to increased latency. (e.g., hypothetically assume traffic from the West Coast traveled to Chicago before going to a destination in Asia).

-- Be prepared to address architectural questions such as: “Should I handle a host that needs high performance by building a special network that bypasses my network core to accommodate it, or should I beef up my core and be ready not just for that one special host, but generic high performance requirements?”

-- As users begin to actually get good performance, be ready to deal with the political issues associated with resources actually getting used. (Remember the lessons of P2P: build it, they’ll come, they’ll use it, and we’ll step on them for using it too well). This is particularly relevant now that pricing for commodity transit from TheQuilt is beginning to test/challenge pricing from Internet2 on a per Mbps basis (at least for lower bandwidth connections). [My prediction: soon you’ll be hearing renewed cries to “Make Internet2 just a research network, keep that commodity-ish traffic off it, even if it is going from an I2 school to an I2 school!”]

-- Recognize that bad performance usually involves two different parties (at a minimum). Think about how bad performance metrics will get shared with data sources (as well as

data retrieving parties) as an example. Instrument the Apache web server, maybe? Write performance data to the syslog? Build a job that grinds the logs daily to provide a performance summary for system administrators?

 

-- Provide a recipe for local flow-tools users so that they can identify bulk transfers that appear to be getting poor empirical throughput?

 

Back to top