AN EFFICIENT COORDINATED CHECKPOINTING APPROACH FOR DISTRIBUTED COMPUTING SYSTEMS WITH RELIABLE CHANNELS

Lalit K. Awasthi, Manoj Misra, and Ramesh C. Joshi

Keywords

Checkpoint, coordinated checkpointing, checkpoint interval, inducedcheckpoint, intrusive checkpointing protocols, consistent global state

Abstract

In distributed systems, likelihood of failure increases with increase in the number of processes and a single failure often renders the entire system state useless. Checkpointing and rollback recovery is a common technique used for increasing the system reliability against various anticipated and unanticipated failures. Checkpointing can be independent, quasi-synchronous and coordinated. Coordinated checkpointing can be blocking or non-blocking. Also, either all the processes in the distributed system may need to checkpoint or only a minimum number of processes may be required to checkpoint. Minimizing the number of processes to checkpoint may introduce blocking. The non-blocking checkpointing protocols introduce overhead of piggybacking some information for non-intrusiveness. Minimization of this piggybacked information is the objective of our work. We have designed a non-blocking coordinated checkpointing protocol for distributed systems with reliable communication channels that minimize piggybacked information on each message.

Important Links:

Go Back