1. University of Michigan Task A High Energy Physics
  2. Supplemental Request for Computing
      1. 1. Overview
      2. 2. Background
      3. 2.1 QoS
      4. 2.2 Grid Computing
      5. 3. Gigabit Test/Production Network
      6. 4. Hardware Specifics
      7. 5. Technical Support
      8. 6. Advantages for CDF and D0
      9. 6.1 D0
      10. 6.2 CDF
      11. 7. Summary
      12. References

University of Michigan Task A High Energy Physics

 

University of Michigan Task A High Energy Physics

Back to top


Supplemental Request for Computing

Supplemental Request for Computing

 

 

1. Overview

 

We are requesting herewith a sum of $80,243 in supplemental support to the University of Michigan DOE High Energy Physics ATLAS task (Task A). The equipment we are requesting will facilitate our research on high-performance networking (QoS) and our work on distributed (grid) computing in support of ATLAS. It will also enable us to connect to a new high-performance test network now being constructed at the University, a test network which goes beyond the ordinary infrastructure provided by the University. Access to the test network will also give us a high-speed, low latency network connection to Internet2 and improve our links to CERN, BNL, and other ATLAS sites on Internet2. An added benefit is that it will improve the network link to Fermilab for two of our Tevatron experiments, CDF and D0. Funds from this request would also be used to provide partial staff support for our network and grid computing work, and for our approved collaboratory R/D efforts.

 

We have internal resources available for approximately half of what we need. Our request is therefore aimed at covering the other half of the cost of equipment and staff support.

 

2. Background

 

The University of Michigan ATLAS group is active in a number of software and computing projects in support of the ATLAS project. Among these are:

All of these activities will benefit from higher network bandwidth, but the last two deserve further comment because they will be directly enabled by the equipment we wish to purchase. QoS and grid computing are both expected to be very important to the success of ATLAS.

 

2.1 QoS

 

ATLAS computing will require significantly better networking capabilities than those provided by the existing Internet. Two complementary approaches are being pursued to obtain improvements in network performance. One is simply to build specialized portions of the network that are capable of very high performance and which may be open to relatively few users. However, the benefits of such over-provisioning can be lost if the number of users or network applications grows in proportion without some sort of management. The other approach, therefore, is to implement different levels of service that may be requested by an application, so that time-sensitive traffic can be given preference over other traffic, much like first class mail is given preference over parcel post. The general name for such differentiated service is “ Quality of Service” (QoS). In contrast, existing Internet protocols only give “best effort” data delivery service, where all network traffic is given the same priority, whether it contains time-sensitive or “ important” data or just a fragment of a popular song.

 

The Michigan ATLAS group has recognized the importance of QoS to future networking and network computing, and so has already begun developing a working implementation of QoS. This work is being done in collaboration with U-M’ ;s Information Technology Division (ITD) and Center for Information Technology Integration (CITI), Merit Network, Inc., and the University Corporation for Advanced Internet Development (UCAID), which is responsible for Internet2. Implementing a working QoS system is a daunting task. The ability to create a static preferred channel has only recently been demonstrated [1], but a number of significant challenges remain. The Michigan group has begun by focusing on one of the most important steps, authenticated QoS signaling [2]. In a dynamic network environment bandwidth allocations would have to be created and released as needed, by making a request to a “Bandwidth Broker” (BB), which is a program that controls the differentiated services of QoS enabled routers. Separate bandwidth brokers will control different QoS domains, so some means of coordinating bandwidth allocations across separate QoS domains is needed. Abuse of preferred services could cause serious problems, so authentication and authorization to and between Bandwidth Brokers will be an important component of any practical QoS system.

 

The current project at Michigan will construct an authenticated QoS signaling system using the Akenti Access Control System to provide Public Key (PK) based authorization decisions, the Lightweight Directory Access Protocol (LDAP) for storing and retrieving directory data, and the GARA bandwidth broker software from Argonne National Laboratory. Because the University of Michigan relies heavily on Kerberos for authentication and authorization, an important additional component is the KX509 software from CITI, which creates and signs short-term PK certificates based on valid Kerberos authentication of the requester, thus joining the Kerberos and PK systems. Integration of KX509 with GARA will result in the ability to use Kerberos as a bandwidth broker authentication method.

It is important to emphasize that this project deals only with signaling between bandwidth brokers, without actually making bandwidth reservations, and so does not require a high-performance network for testing. The next logical step, however, is to have one or more bandwidth brokers making and releasing bandwidth reservations on a working high-performance network. Access to the new campus gigabit test network (described below) will provide us with an excellent environment to test our system under realistic circumstances. (Experience shows that some things that work on an isolated test network do not work in a real-world production environment.) Once we have tested the system at the end-point level we will want to test QoS between U-M and CERN and between U-M and other ATLAS sites. The support we are requesting will also enable us to connect to the Internet2 Abilene network, which will enable us to make realistic test of QoS over Internet2 across QoS domains.

Before connecting to the gigabit network there are some basic tests we will perform to verify that our new equipment is QoS capable and to learn how to configure QoS services. Once we have learned how to configure QoS manually we can instruct GARA how to do so automatically when a bandwidth reservation is needed. This kind of testing has been performed by CITI [5] to successfully demonstrate both Police and Drop functionality and bandwidth reservation on a Cisco 7500 router, and Police and Drop functionality on a Torrent IP9000 router. The basic idea of these tests is that the router is connected to three separate computers, two of which send high-bandwidth signals destined for the third using specialized software for unicast traffic generation. The level of traffic from both senders can be adjusted at will, and can be increased until it exceeds the level possible on the outbound link to the receiving computer. When the usual “Best Effort” policy is in use packets are dropped from both input signals, more or less equally. With Police and Drop enabled it is possible to cause packets to be dropped from one signal when it reaches a specific threshold. With Differential Services enabled it is possible to cause one signal to pass through to the receiver more or less unscathed while packets are dropped preferentially from the other signal. Demonstrating that these both work as expected and learning how to configure the router to do so are important steps before connecting the router to an outside network. The CITI tests were performed using Fast Ethernet, but we will test our router at gigabit speeds.

 

2.2 Grid Computing

 

The Michigan ATLAS group has also recognized the future importance of distributed “grid” computing for ATLAS development and data analysis. As the ATLAS detector becomes operational it will generate, on average, several terabytes of data per day (on the order of a petabyte per year), and grid computing is seen as possibly the only way that meaningful physics can be extracted by dispersed researchers from such a large volume of data. The basic idea behind grid computing is that of a distributed collection of computers and other resources tied together with software that can track and allocate computing or storage resources within the collection in a way that is transparent to the user. The user needs only to submit a job to the grid and it will be executed, either locally or remotely, on either a single machine or on many machines, as necessary. The user need not worry about the details; the results just appear as quickly as the available resources allow. This is thought of as being analogous to plugging an electric appliance into the wall. The electric power is provided by a connection to the national power grid, which provides as much or as little power as needed, without the user having to know whether it was generated by a local coal-fired power plant or was imported from a hydroelectric facility in Quebec.

 

Michigan is part of the ATLAS Grid Testbed, along with Argonne National Laboratory, Brookhaven National Laboratory, Boston University, Indiana University, and Lawrence Berkeley Laboratory. ATLAS grid researchers are also participating in the GriPhyN (Grid Physics Network) Project [3], a collection of IT research groups working with four large-scale physics experiments to implement the first petabyte-scale computational grid. (Besides ATLAS, the other three experiments are LIGO, CMS, and the Sloan Digital Sky.) The ATLAS testbed will provide a platform for testing grid concepts and computing models and will provide feedback to grid software developers. We expect that the testbed will expose unforeseen weaknesses that will allow us to better plan infrastructure upgrades, and that it will help us identify and specify application interfaces. It will give developers, administrators, and users experience with grid computing well before it is needed for production computations. It will allow us to develop realistic test cases, which we can make available as a portable test suite. It will allow us to prepare our networks for mock data challenges. The first practical use of the ATLAS grid is expected to be a distributed Monte Carlo related to studying the ATLAS trigger.

 

The software being deployed for the ATLAS grid testbed is the Globus toolkit [4]. Network connectivity will be critical for ATLAS grid computing, and so will the need for guaranteed high bandwidth connections. This is one of the principle reasons for our interest in QoS. The dataset sizes for initial grid tests will be around a few hundred gigabytes, but will scale to about a terabyte by the end of 2001. A faster connection to the new campus gigabit network will allow us to perform meaningful testing of our grid system locally, and it will enable us to participate in grid testbed activities with the other member of the grid testbed group.

 

3. Gigabit Test/Production Network

 

The University of Michigan has started a program to define and test the future of high-bandwidth networking. As a part of this program the University is in the process of deploying a new network of high-performance layer 3 switching routers capable of supporting multiple gigabit interfaces and providing differentiated services (QoS). This network is currently being referred to informally as the “triangle”, because it will initially link three main points on campus: the School of Education Data Center on the south edge of central campus, the Cooley building on the north campus, and the Arbor Lakes Data Center, approximately 6~km northeast of the north campus. In addition to providing a gigabit (1000 Mbits/s) link between these three points, it will also be connected to Internet2 via OC12 (622 Mbits/sec).

 

Direct high-speed access to the “triangle” would facilitate our work on QoS and grid computing and would improve our connection to Internet2. The ATLAS group can currently connect to both the Internet and the triangle only through Fast Ethernet (100 Mbits/sec). Even if we fix this immediate bottleneck, we would then encounter another limitation at OC3 (155 Mbits/sec). Fixing both of these bottlenecks by just upgrading the speed of interface cards would still leave us without differentiated services along the path to the triangle. Because of our needs for high bandwidth and our work on QoS we have entered into detailed discussions with the University about how best to solve our problems. We are being given the opportunity to join the campus test triangle directly from physics via a gigabit connection to the School of Education Data Center. To take advantage of this opportunity requires that we purchase our own layer 3 switching router, along with some additional supporting equipment.

 

It is important to emphasize the unique nature of this new gigabit network. It is on the one hand a test network, above and beyond the existing network infrastructure provided to the rest of the University community, and without the service guarantees of the existing network. On the other hand, current plans are for it to be run as much as possible as a production network, by the same people who run the existing campus network, with the goal of providing a stable level of service consistent with a production network. It is likely that this network will someday be incorporated into the campus backbone, though we see no reason to wait for that to happen when we can access the test network now. The triangle will not be a sterile test environment that is isolated from the outside world, but will instead be connected directly to the campus backbone via OC12 from each corner. These connections to the campus backbone also provide a connection at OC12 to the Michigan GigaPop and from there to the Internet2 Abilene network. Although the test nature of the network means that we will have to tolerate occasional outages or other problems on the triangle, the production nature of the network means that it should be available for us much of the time. Furthermore, it is possible that a connection to Internet2 at OC12 may still be available to us via the School of Education router even if the rest of the triangle is down, though this is not guaranteed.

 

The details of how adding a layer 3 switching router will connect us to the gigabit triangle and to Internet2 are best understood by referring to Figure 1, which shows the existing campus backbone layout, and Figure 2, which shows the network after the addition of the gigabit triangle. These diagrams are complicated and difficult to read, so a certain amount of explanation is in order. One thing to keep in mind is that these kinds of diagrams show the topology of the connections between network elements (routers and switches) but distort the actual physical layout of these elements (though in some cases the lengths of specific connections are shown in the diagrams).

 
Figure 1. The existing campus backbone.

 

Figure 1 shows the existing campus network. In the center is the ITD backbone, consisting of four ATM switches interconnected at OC12 (622 Mbits/sec). Running around the outside of the backbone is the older FDDI (Fiber Distributed Data Interface) ring (100 Mbits/sec, with a dual ring for redundancy). Almost all of the campus routers on the FDDI ring connect directly into the ATM backbone at OC3 (155 Mbits/sec). Connections to the separate backbones for the College of Engineering or the Medical campus are made from the campus ATM backbone at OC12. The campus backbone is connected to the outside world via OC12 and OC3 connections to routers maintained by Merit Network, Inc. The upper right corner of the diagram shows several layer 3 switches at the Arbor Lakes facility, which are connected to the campus network via multiple Fast Ethernet connections (100 Mbits/sec each). The diagram also shows two of these switches connected via GigaBit ethernet (1000 Mbits/sec). This is a small gigabit test network which will form the nucleus for the larger gigabit triangle.

 

The present connection of the ATLAS group to the Internet is via Fast Ethernet (100 Mbits/sec) to a router in Randall Laboratory, labeled c-RNDL in the right lower corner of the FDDI ring in Figure 1. A connection to CITI, for example, would go from c-RNDL at OC3 to the c-atm-LSA switch and then to the c-ARG router. A connection to the Internet or to Internet2 would go from c-atm-LSA to c-atm-ARBL at OC12 and then at OC12 to Merit. The limit on our bandwidth is the 100 Mbits/sec to c-RNDL , but even if that were upgraded we would be limited by the OC3 connection to the backbone. One should also keep in mind that the campus network is very busy and it is unlikely that we could ever make use of the full OC3 bandwidth available since this is only a “Best Effort” network.

 
Figure 2. Campus backbone with addition of gigabit test network.

 

Figure 2 shows how the network will look after the initial deployment of the gigabit triangle. The ATM backbone is represented schematically as the cloud in the middle. The gigabit network is represented by the thick line running around the outside. It looks like a square in this diagram, but it’s currently being referred to as a triangle because it connects the three separate facilities. (The three layer 3 switches in the upper right corner are all at the Arbor Lakes facility and are considered to be only one site.) The connection from the gigabit network to the outside world is via OC12 to the campus ATM backbone and then OC12 to Merit and the Michigan GigaPop.

The connection the ATLAS group would have to the gigabit network is not shown in Figure 2, but it is easy to imagine how it would be added. A layer 3 routing switch like those already on the gigabit network would be added in the lower right corner of the diagram. It is not connected in-line with the gigabit network; instead there would be a separate gigabit connection from the new ATLAS switch to the layer 3 switch at the School of Education Data Center. To maintain a more direct link to other machines in the physics department there would also be a Fast Ethernet connection (100 Mbits/sec) from the new switch to the c-RNDL router. ATLAS computers used for QoS and grid research would be connected directly to the new switch at gigabit speed via optical fiber. Other ATLAS machines would connect to a layer 2 switch via Fast Ethernet, and this switch would connect to the layer 3 switch via gigabit fiber. The layer 3 switch would have 18 gigabit ports, though only a few of these would be used immediately by ATLAS. A few of these ports could be made available to CDF and D0 to give them high-speed access to Internet2 and Fermilab.

 

4. Hardware Specifics

 

The principle piece of equipment we need is a layer 3 switching router like the Cisco 6506. This is sold as separate components for flexibility, so we need to purchase separately a chassis, a power supply, a supervisor engine, and the operating system. This basic configuration has only 2 gigabit ports, and since we will need more than this we also need an additional interface for more gigabit ports. The interface we want has 16 additional ports, so the entire switch will have 18 gigabit ports, which will be more than enough for the immediate future. We will also add another interface card to give us a number of 10/100 Mbits/sec ports.

In addition to the layer 3 switch we also need another layer 2 switch like the Cisco 3512XL. This will be used as a distribution point to extend connections from the layer 3 switch. Using an additional switch will be easier than pulling additional gigabit fiber.
Figure 3. Initial configuration using Layer 3 switching router.

 

Although the final configuration may change, Figure 3 shows the initial configuration we intend to use to connect the layer 3 switch to the gigabit test network, the department network, our present grid machines (linat1 and atgrid) and the other layer 2 switches connecting to our other machines.

 

5. Technical Support

 

As part of our QoS initiative we have hired Eric Myers, a physicist with extensive knowledge of Unix configuration and security issues. He has a strong background in computational physics and parallel computing, and he also understand network security and has worked as a security consultant to the Swedish Research Council for Engineering Sciences (Teknikvetenskapliga ForskningsRådet---TFR) and the Swedish Natural Science Research Council (Naturvetenskapliga ForskningsRådet---NFR). His current efforts on QoS are approximately 50% of his time. We would like to take advantage of the remaining 50% by directing his efforts toward:

1)   System configuration and management, especially with regards to security

2)   Grid software installation, testing and tuning

3)   HEP software maintenance and installation

4)   High-performance network deployment, testing and tuning

 

As a physicist, Myers has a special insight into our requirements for computing. Managing, maintaining, upgrading and securing computer systems is very time consuming and requires extensive knowledge and experience. This type of expertise is usually unavailable to physics departments, and even if a person possessing such skills is available at the salary we can pay, they usually lack the perspective of what computing in physics requires. Eric represents a unique opportunity to get the computing help we need for our US ATLAS grid testbed, our network testing and deployment, and our general computing infrastructure.

 

6. Advantages for CDF and D0

 

While the principle goal of this request is to advance ATLAS computing and networking, it will also offer some significant benefits for two experiments at Fermilab, D0 and CDF. Both experiments are preparing for Run II at the Tevatron, and both would find the extra bandwidth available via Internet2 most helpful to their efforts. Both experiments deal with large volumes of data: The raw data is processed at Fermilab into datasets that are quite large— on the order of a Terabyte or two. This large size makes it difficult to transport these datasets to other sites, but with the enhanced bandwidth of Internet2 it would be much easier to copy them to Ann Arbor so that more U-M resources could be used in the analysis. This has the further benefit that it can free up computing resources at Fermilab so that they can be used for other things.

6.1 D0

The Michigan D0 group has three faculty, two research fellows, and five graduate students. While research fellows are always resident at Fermilab, faculty members and some of the students are based in Ann Arbor. Due to limited computing resources in Ann Arbor, D0 physics analyses for the previous Tevatron run were based at Fermilab. This separation by a slow network connection between staff and resources imposes constraints on faculty participation in the experiment, and it has also limited the group’s ability to get graduate students involved in physics early on. The D0 group’s efforts in the experiment would greatly benefit from the increased bandwidth provided by Internet2. It would allow them to maintain a sub-dataset in Ann Arbor, and to obtain the latest calibration and alignment constants. It would also allow them to utilize large Monte Carlo farm facilities at Fermilab or other sites. A significant increase in bandwidth between Ann Arbor and Fermilab would make it possible for more analyses to be performed in Ann Arbor, either in place of or in addition to work done at Fermilab.

6.2 CDF

The increased bandwidth available via Internet2 would be especially useful to CDF. Michigan is one of the largest university-based collaborating institutions on the CDF experiment, with four faculty, six postdoctoral fellows, and eight graduate students. Most of these scientists are resident in Ann Arbor. With a lengthy Tevatron collider run scheduled to begin in March 2001, there are several areas in which high-bandwidth networking, with or without QoS, will contribute to the scientific mission of CDF:

The current connection to Fermilab, through the physics department and campus backbone, is expected to worsen as the student population at the university discovers more and more network-intensive forms of entertainment. As the CDF experiment moves rapidly in the next year into an extended period of high-rate data taking, the university-based group at Michigan will benefit tremendously from a correspondingly improved network connection to Fermilab.

 



7. Summary

 

We are requesting support to aid in the purchase of a layer 3 routing switch and related hardware, to use for advanced networking research and to provide ATLAS, D0, and CDF with high-bandwidth access to Internet2. We are also asking for partial staff support to help make the most of this new hardware. The funding of this request by DoE would represent a highly leveraged investment to greatly expand the computing capabilities of the two Michigan Tevatron experiments in the immediate future, as well as the long term networking developments required for ATLAS. The work planned will involve the top networking experts in the U.S. and Europe, at Michigan, Argonne, CERN, Intenet2, and Merit.

 

References

 

[1]   “Internet2, Stanford and DOE Demonstrate End-to-End Internet Quality of Service at SC2000 Conference”, http://mail.internet2.edu/guest/archives/I2-NEWS/log0011/msg00009.html

 

[2]   UM CITI Project: “Authenticated Quality of Service”, http://www.citi.umich.edu/projects/qos/

 

[3]  The Globus Project, http://www.globus.org/

 

[4]   The GriPhyN Project, http://www.phys.ufl.edu/~avery/griphyn/

 

[5]   “University of Michigan and Merit Internet2 Qbone Testbed Results”, http://www.citi.umich.edu/projects/qbone/qbone_test1/qbone_test1.html

 

 

Back to top


 - 2 -