ATLAS Grid Requirements
Draft Zero December 1, 2000
Rich Baker
Brookhaven National Laboratory
rbaker@bnl.gov
This very preliminary document is intended to lay out one view of ATLAS grid
requirements and to draw comments and suggestions to help clarify and improve
the list of requirements. Ultimately, the ATLAS collaboration needs to
formally specify our requirements for Grid services. This requirements
specification will be an ongoing process. There seems to be a good general
notion of what we need from the Grid, but it is difficult to define specific
requirements without more experience using Grid services as they become
available. It will be very important to start using Grid services in
production environments as soon as possible and incorporate lessons learned
into new iterations of our requirements.
To start, I have divided Grid services into 5 functional areas:
The Grid must provide a global data space that abstracts data from the physical
storage or source of the data. Data includes files, collections of files,
databases, containers within databases and almost anything that a user might
create as a named data object. The data manager must be compatible
with several different underlying physical storage formats, including various
file systems, mass storage and databases. The data must be organized and
presented to users with straight-forward, intuitive interfaces. One possible
user interface would resemble a Unix directory/file structure:
/gds/ATLAS/username/analysis/myntuple
Data must be protected with user and group level access control, much like the
familiar Unix file system. Users will need commands to move, copy,
replicate, delete, etc. Note that some of these commands imply binding to
specific physical storage, and operations may be performed on specific
replicas. The Grid must manage replicas, both automatically and under user
control. The replica management should understand the difference between read
only and dynamic data. Synchronization of replicas of dynamic data must be
tunable (i.e. changes must be propagated within x hours) and robust in the case
of network outages or other causes of site unavailability. Data management
within a site must be autonomous. The global replica manager should only know
that a replica exists at a particular site. Specific properties of that
replica (such as access latency) would be obtained by querying the grid replica
manager at that site.
The grid will keep track of various computing resources, including CPU,
storage, local network, WAN bandwidth, etc. The primary purpose of this
service is to present information about the status of available resources to
users and to decision making algorithms either within the Grid software or
application specific. The resource management must be fault tolerant with a
local registration, global discovery model. The list of resources will be very
dynamic and capable of real-time update. For example, a script might define a
new storage resource in one line (e.g. a disk on the users laptop),
transfer data to the new resource and then undefine the resource to
disconnect it from the Grid.
The Grid must allow task execution and resource utilization across
administrative domains. I define an administrative domain as a collection of
computing resources with a single access control point and single access
policy. There may be multiple administrative domains (perhaps on the order of
10) per site. Users should be authenticated once locally (a single logon) with
credentials passed with remote requests. The credentials must be sufficient
to identify the user, group access privileges, the source of the credentials
and also provide any information required for accounting functions. The Grid
must be compatible with various local security policies. User creation and
authentication must scale reasonably to tens of thousands of users and
thousands of administrative domains.
The Grid must provide user and group level accounting of the use of computing
resources sufficient to serve as input to resource allocation algorithms and
access control decisions. The accounting will require much flexibility for
local administrators to set accounting policies.
The Grid must allow submission, monitoring and control of tasks in a
distributed computing environment. Various levels of user control should be
provided to allow for specification of retry options, default input sources and
output destinations and probably dozens of things I havent thought of.
The API for the task manager will require extensive coordination between the
user community and the Grid developers. Some support for output collection and
merging should be supplied.