TU Darmstadt / ULB / TUbiblio

Efficient Fault Tolerance through Dynamic Node Replacement

Prabhakaran, Suraj ; Neumann, Marcel ; Wolf, Felix (2018)
Efficient Fault Tolerance through Dynamic Node Replacement.
18th International Symposium on Cluster, Cloud and Grid Computing (CCGrid). Washington DC, USA (01.-04.05.2018)
doi: 10.1109/CCGRID.2018.00031
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

The mean time between failures of upcoming exascale systems is expected to be one hour or less. To be able to successfully complete execution of applications in such scenarios, several improved checkpoint/restart mechanisms such as multi-level checkpointing are being developed. Today, resource management systems handle job interruptions due to node failures by restarting the affected job from a checkpoint on a fresh set of nodes. This method, however, will add non-negligible overhead and will not allow taking full advantage of multi-level checkpointing in future systems. Alternatively, some spare nodes can be allocated for each job so that only processes that die on the failed nodes need to be restarted on spare nodes. However, given the magnitude of the expected failure rates, the number of spare nodes to be allocated for each job would be high, causing significant resource wastage. This work proposes a dynamic way handling node failures by enabling on-the-fly replacement of failed nodes with healthy ones. We propose a dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs. Our evaluation with a simulator shows that this approach can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2018
Autor(en): Prabhakaran, Suraj ; Neumann, Marcel ; Wolf, Felix
Art des Eintrags: Bibliographie
Titel: Efficient Fault Tolerance through Dynamic Node Replacement
Sprache: Englisch
Publikationsjahr: 16 Juli 2018
Verlag: IEEE
Buchtitel: Proceedings: 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
Veranstaltungstitel: 18th International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
Veranstaltungsort: Washington DC, USA
Veranstaltungsdatum: 01.-04.05.2018
DOI: 10.1109/CCGRID.2018.00031
Kurzbeschreibung (Abstract):

The mean time between failures of upcoming exascale systems is expected to be one hour or less. To be able to successfully complete execution of applications in such scenarios, several improved checkpoint/restart mechanisms such as multi-level checkpointing are being developed. Today, resource management systems handle job interruptions due to node failures by restarting the affected job from a checkpoint on a fresh set of nodes. This method, however, will add non-negligible overhead and will not allow taking full advantage of multi-level checkpointing in future systems. Alternatively, some spare nodes can be allocated for each job so that only processes that die on the failed nodes need to be restarted on spare nodes. However, given the magnitude of the expected failure rates, the number of spare nodes to be allocated for each job would be high, causing significant resource wastage. This work proposes a dynamic way handling node failures by enabling on-the-fly replacement of failed nodes with healthy ones. We propose a dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs. Our evaluation with a simulator shows that this approach can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing.

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Parallele Programmierung
Hinterlegungsdatum: 20 Apr 2018 12:24
Letzte Änderung: 01 Mär 2024 10:04
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen