TU Darmstadt / ULB / TUbiblio

Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

Böhme, David ; Geimer, Markus ; Wolf, Felix ; Arnold, Lukas (2016)
Identifying the Root Causes of Wait States in Large-Scale Parallel Applications.
In: ACM Transactions on Parallel Computing, 3 (2)
doi: 10.1145/2934661
Artikel, Bibliographie

Kurzbeschreibung (Abstract)

Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira, Jr., et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both forward and backward, we can identify the processes and call paths responsible for the most severe imbalances, even for runs with hundreds of thousands of processes.

Typ des Eintrags: Artikel
Erschienen: 2016
Autor(en): Böhme, David ; Geimer, Markus ; Wolf, Felix ; Arnold, Lukas
Art des Eintrags: Bibliographie
Titel: Identifying the Root Causes of Wait States in Large-Scale Parallel Applications
Sprache: Englisch
Publikationsjahr: 20 Juli 2016
Verlag: ACM
Titel der Zeitschrift, Zeitung oder Schriftenreihe: ACM Transactions on Parallel Computing
Jahrgang/Volume einer Zeitschrift: 3
(Heft-)Nummer: 2
Buchtitel: Proc. of
Veranstaltungstitel: Proc. of the 39th International Conference on Parallel Processing (ICPP), San Diego, CA, USA
DOI: 10.1145/2934661
Kurzbeschreibung (Abstract):

Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira, Jr., et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both forward and backward, we can identify the processes and call paths responsible for the most severe imbalances, even for runs with hundreds of thousands of processes.

Zusätzliche Informationen:

Art.No.: 11

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Parallele Programmierung
Hinterlegungsdatum: 20 Apr 2018 09:35
Letzte Änderung: 17 Mai 2024 07:13
PPN: 518391043
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen