1. Draft Zero

Draft Zero

 

 

 

 

 

 

 

ATLAS Grid Requirements

Draft Zero

 

 

 

 

 

 

 

 

December 1, 2000

 

 

 

 

 

 

Rich Baker

Brookhaven National Laboratory

rbaker@bnl.gov


This very preliminary document is intended to lay out one view of ATLAS grid requirements and to draw comments and suggestions to help clarify and improve the list of requirements. Ultimately, the ATLAS collaboration needs to formally specify our requirements for Grid services. This requirements specification will be an ongoing process. There seems to be a good general notion of what we need from the Grid, but it is difficult to define specific requirements without more experience using Grid services as they become available. It will be very important to start using Grid services in production environments as soon as possible and incorporate lessons learned into new iterations of our requirements.

 

To start, I have divided Grid services into 5 functional areas:

 

  1. Data Management
  2. Resource Management
  3. Security & Authentication
  4. Resource Use Accounting
  5. Task Execution Control

 

  1. Data Management

The Grid must provide a global data space that abstracts data from the physical storage or source of the data. Data includes files, collections of files, databases, containers within databases and almost anything that a user might create as a “named data object.” The data manager must be compatible with several different underlying physical storage formats, including various file systems, mass storage and databases. The data must be organized and presented to users with straight-forward, intuitive interfaces. One possible user interface would resemble a Unix directory/file structure:

 

/gds/ATLAS/username/analysis/myntuple

 

Data must be protected with user and group level access control, much like the familiar Unix file system. Users will need commands to move, copy, replicate, delete, etc. Note that some of these commands imply binding to specific physical storage, and operations may be performed on specific replicas. The Grid must manage replicas, both automatically and under user control. The replica management should understand the difference between read only and dynamic data. Synchronization of replicas of dynamic data must be tunable (i.e. changes must be propagated within x hours) and robust in the case of network outages or other causes of site unavailability. Data management within a site must be autonomous. The global replica manager should only know that a replica exists at a particular site. Specific properties of that replica (such as access latency) would be obtained by querying the grid replica manager at that site.

 

  1. Resource Management

The grid will keep track of various computing resources, including CPU, storage, local network, WAN bandwidth, etc. The primary purpose of this service is to present information about the status of available resources to users and to decision making algorithms either within the Grid software or application specific. The resource management must be fault tolerant with a local registration, global discovery model. The list of resources will be very dynamic and capable of real-time update. For example, a script might define a new storage resource in one line (e.g. a disk on the user’s laptop), transfer data to the new resource and then “undefine” the resource to disconnect it from the Grid.

 

  1. Security and Authentication

The Grid must allow task execution and resource utilization across administrative domains. I define an administrative domain as a collection of computing resources with a single access control point and single access policy. There may be multiple administrative domains (perhaps on the order of 10) per site. Users should be authenticated once locally (a single logon) with credentials passed with remote requests. The credentials must be sufficient to identify the user, group access privileges, the source of the credentials and also provide any information required for accounting functions. The Grid must be compatible with various local security policies. User creation and authentication must scale reasonably to tens of thousands of users and thousands of administrative domains.

 

  1. Resource Use Accounting

The Grid must provide user and group level accounting of the use of computing resources sufficient to serve as input to resource allocation algorithms and access control decisions. The accounting will require much flexibility for local administrators to set accounting policies.

 

  1. Task Execution Control

The Grid must allow submission, monitoring and control of tasks in a distributed computing environment. Various levels of user control should be provided to allow for specification of retry options, default input sources and output destinations and probably dozens of things I haven’t thought of. The API for the task manager will require extensive coordination between the user community and the Grid developers. Some support for output collection and merging should be supplied.

 

Back to top