You are here

Architecture

Overall Architecture

InteGrade grids are structured in clusters, each consisting of groups from one to approximately one hundred computers, which can be shared workstations or machines dedicated to the grid. Clusters are then arranged in a hierarchy, allowing a single InteGrade grid to encompass potentially millions of machines. The hierarchy can be arranged in any convenient manner as defined by the system administrators.

The figure bellow depicts the major types of components in a InteGrade cluster. The Cluster Resource Manager node represents one or more nodes that are responsible for managing that cluster and communicating with managers in other clusters. A Resource Provider node, typically a workstation, is one that exports part of its resources, making them available to grid users.

As depicted in the figure above, InteGrade nodes contain one or more of the following modules:

Grid Resource Manager:

  • AR (Application Repository): a repository for grid applications, data, and meta-data.
  • ARSM (Application Repository Security Manager): manages the security in a grid cluster, providing support for digital signatures, data encryption, authentication, and authorization.
  • CDRM (Cluster Data Repository Manager): manages the checkpoint repositories from its cluster. Distributed codified checkpoint fragments among available checkpoint repositories.
  • EM (Execution Manager): maintains execution information, such as saved checkpoints and nodes on which the application is running, for application running in an InteGrade cluster. It also coordinates the reinitialization and migration of applications.
  • GRM (Global Resource Manager): manages the computing resources in a cluster of machines connected by a local area network (typically, from 1 to 100 machines) and provides scheduling mechanisms based on the knowledge it has about dynamic resource availability and application requirements.

Resource Provider:

  • ARSC ([[Application Repository Security Client]]): helps InteGrade local components to interact with remote components in a secure way.
  • ASCT ([[Application Submission and Control Tool]]):''' user interface to submit applications to the grid and control their execution.
  • BSPLib ([[BSP Library]]):''' Library implementing the API from the Oxford BSPLib?, allowing the execution of BSP applications on InteGrade.
  • CkpRep ([[Checkpoint Repository]]):''' stores checkpointing data in a distributed fashion. Each resource provider that runs a LRM is a potential node to host a checkpoint repository.
  • LRM ([[Local Resource Manager]]):''' manages the resources in a single machine and runs applications submitted via the ASCT.
  • LUPA ([[Local Usage Pattern Analyser]]):''' gathers information about resource usage pattern in a single machine and tries to make predictions about the future utilization of resources based on machine-learning techniques. (This component is still under development).
  • NCC ([[Node Control Center]]):''' user interface to define the portion of the local resources that will be available for the grid applications. (This component is still under development).

Distributed Communication

All the distributed communication is performed via CORBA using [[JacORB]] in the servers and [[OiL]], our dynamic, lightweight ORB, in the clients.

Parallel Computing

For parallel computing, we support:

The BSP model via our own BSPlib: InteGrade BSP Library can run any BSP application written in C/C++ and follows the same API from Oxford [[BSPLib]].
The [[MPI model]] via a modified version of the MPICH2 library: A modified version of MPICH2 implementation allows InteGrade users to run any MPI applications written in C/C++, Fortran 77 or 90. MPICH2 is one of the most popular implementations of MPI-2 specification and it is maintained by the Argonne National Laboratory. Our modifications consists in the implementation of some software interfaces that can be easily applied to any MPICH2 version.
Parametric, bag-of-tasks applications

Fault-tolerance

We provide portable checkpointing for sequential, parametric, and BSP parallel applications. A pre-compiler instruments application source-code to periodically send its state to a checkpoint library, which generates checkpoints. The checkpoints are then stored in a distributed repository, allowing application restart in case of failure of Grid nodes.

Applications

  • Multiplication of sequences of Matrices
  • String matching
  • 3D real-time video generation
  • Parallel Pointwise Unconstrained Minimization Approach
  • add your application here