Executing long-running parallel applications in Opportunistic Grid
environments composed of heterogeneous, shared user workstations, is a
daunting task. Machines may fail, become unaccessible, or may switch
from idle to busy unexpectedly, compromising the execution of
applications. A mechanism for fault-tolerance that supports these
heterogeneous architectures is an important requirement for such a
system.
Besides, we need to deal with the large amounts of data generated
by checkpointing data intensive parallel applications. The classical
approach is to employ high-throughput checkpoint servers connected to
the computational nodes by high speed networks. In the case of
Opportunistic Grid Computing, we do not want to be forced to rely on
such dedicated hardware. Instead, we want to use the shared Grid nodes
to store application data in a distributed fashion.
In this work, we provide fault-tolerant for the execution of
BSP parallel applications on heterogeneous, shared workstations [1,2]. A
precompiler instruments application source code to save state
periodically into checkpoint files. Generated checkpoints are
portable and can be recovered in a machine of different architecture.
We implemented a monitoring and recovering infrastructure in the
InteGrade Grid middleware.
To deal with the storage problem, we are evaluating several strategies to store checkpoints on
distributed non-dedicated repositories [3]. We consider the tradeoff among
computational overhead, storage overhead, and degree of fault-tolerance
of these strategies. We compare the use of replication, parity
information, and information dispersal (IDA).