Prabhakaran, Suraj ; Neumann, Marcel ; Wolf, Felix (2018)
Efficient Fault Tolerance through Dynamic Node Replacement.
18th International Symposium on Cluster, Cloud and Grid Computing (CCGrid). Washington DC, USA (01.-04.05.2018)
doi: 10.1109/CCGRID.2018.00031
Konferenzveröffentlichung, Bibliographie
Kurzbeschreibung (Abstract)
The mean time between failures of upcoming exascale systems is expected to be one hour or less. To be able to successfully complete execution of applications in such scenarios, several improved checkpoint/restart mechanisms such as multi-level checkpointing are being developed. Today, resource management systems handle job interruptions due to node failures by restarting the affected job from a checkpoint on a fresh set of nodes. This method, however, will add non-negligible overhead and will not allow taking full advantage of multi-level checkpointing in future systems. Alternatively, some spare nodes can be allocated for each job so that only processes that die on the failed nodes need to be restarted on spare nodes. However, given the magnitude of the expected failure rates, the number of spare nodes to be allocated for each job would be high, causing significant resource wastage. This work proposes a dynamic way handling node failures by enabling on-the-fly replacement of failed nodes with healthy ones. We propose a dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs. Our evaluation with a simulator shows that this approach can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing.
Typ des Eintrags: | Konferenzveröffentlichung |
---|---|
Erschienen: | 2018 |
Autor(en): | Prabhakaran, Suraj ; Neumann, Marcel ; Wolf, Felix |
Art des Eintrags: | Bibliographie |
Titel: | Efficient Fault Tolerance through Dynamic Node Replacement |
Sprache: | Englisch |
Publikationsjahr: | 16 Juli 2018 |
Verlag: | IEEE |
Buchtitel: | Proceedings: 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) |
Veranstaltungstitel: | 18th International Symposium on Cluster, Cloud and Grid Computing (CCGrid) |
Veranstaltungsort: | Washington DC, USA |
Veranstaltungsdatum: | 01.-04.05.2018 |
DOI: | 10.1109/CCGRID.2018.00031 |
Kurzbeschreibung (Abstract): | The mean time between failures of upcoming exascale systems is expected to be one hour or less. To be able to successfully complete execution of applications in such scenarios, several improved checkpoint/restart mechanisms such as multi-level checkpointing are being developed. Today, resource management systems handle job interruptions due to node failures by restarting the affected job from a checkpoint on a fresh set of nodes. This method, however, will add non-negligible overhead and will not allow taking full advantage of multi-level checkpointing in future systems. Alternatively, some spare nodes can be allocated for each job so that only processes that die on the failed nodes need to be restarted on spare nodes. However, given the magnitude of the expected failure rates, the number of spare nodes to be allocated for each job would be high, causing significant resource wastage. This work proposes a dynamic way handling node failures by enabling on-the-fly replacement of failed nodes with healthy ones. We propose a dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs. Our evaluation with a simulator shows that this approach can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing. |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Parallele Programmierung |
Hinterlegungsdatum: | 20 Apr 2018 12:24 |
Letzte Änderung: | 18 Jun 2024 07:08 |
PPN: | 519209206 |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |