Improving checkpointing intervals by considering individual job failure probabilities

Abstract

Checkpointing is a popular resilience method in HPC and its efficiency highly depends on the choice of the checkpoint interval. Standard analytical approaches optimize intervals for big, long-running jobs that fail with high probability, while they are unable to minimize checkpointing overheads for jobs with a low or medium probability of failing. Nevertheless, our analysis of batch traces of four HPC systems shows that these jobs are extremely common.We therefore propose an iterative checkpointing algorithm to compute efficient intervals for jobs with a medium risk of failure. The method also supports big and long-running jobs by converging to the results of various traditional methods for these. We validated our algorithm using batch system simulations including traces from four HPC systems and compared it to five alternative checkpoint methods. The evaluations show up to 40% checkpoint savings for individual jobs when using our method, while improving checkpointing costs of complete HPC systems between 2.8% and 24.4% compared to the best alternative approach.

Publication
IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Reza Salkhordeh
Reza Salkhordeh
Postdoctoral researcher

My research interests include operating systems, solid-state drives, and data storage systems.