Lalit K. Awasthi, Manoj Misra, and Ramesh C. Joshi
Checkpoint, coordinated checkpointing, checkpoint interval, inducedcheckpoint, intrusive checkpointing protocols, consistent global state
In distributed systems, likelihood of failure increases with increase in the number of processes and a single failure often renders the entire system state useless. Checkpointing and rollback recovery is a common technique used for increasing the system reliability against various anticipated and unanticipated failures. Checkpointing can be independent, quasi-synchronous and coordinated. Coordinated checkpointing can be blocking or non-blocking. Also, either all the processes in the distributed system may need to checkpoint or only a minimum number of processes may be required to checkpoint. Minimizing the number of processes to checkpoint may introduce blocking. The non-blocking checkpointing protocols introduce overhead of piggybacking some information for non-intrusiveness. Minimization of this piggybacked information is the objective of our work. We have designed a non-blocking coordinated checkpointing protocol for distributed systems with reliable communication channels that minimize piggybacked information on each message.
Important Links:
Go Back