TU Darmstadt / ULB / TUbiblio

Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics

Mazaheri, Arya ; Wolf, Felix ; Jannesari, Ali (2018)
Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics.
47th International Conference on Parallel Processing (ICPP). Eugene, USA (13.08.2018-16.08.2018)
doi: 10.1145/3225058.3225142
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

A critical factor for developing robust shared-memory applications is the efficient use of the cache and the communication between threads. Inappropriate data structures, algorithm design, and inefficient thread affinity may result in superfluous communication between threads/cores and severe performance problems. For this reason, state-of-the-art profiling tools focus on thread communication and behavior to present different metrics that enable programmers to write cache-friendly programs. The data shared between a pair of threads should be reused with a reasonable distance to preserve data locality. However, existing tools do not take into account the locality of communication events and mainly focus on analyzing the amount of communication instead. In this paper, we introduce a new method to analyze performance and communication bottlenecks that arise from data-access patterns and thread interactions of each code region. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. We evaluated our approach on the SPLASH and Rodinia benchmark suites. Experimental results validate the effectiveness of our approach by finding communication locality issues due to inefficient data structures and/or poor algorithm implementations. By applying the suggested optimizations, we improved the performance in Rodinia benchmarks by up to 56%. Furthermore, by varying the input size we demonstrated the ability of our method to assess the cache usage and scalability of a given application in terms of its inherent communication.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2018
Autor(en): Mazaheri, Arya ; Wolf, Felix ; Jannesari, Ali
Art des Eintrags: Bibliographie
Titel: Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics
Sprache: Englisch
Publikationsjahr: 13 August 2018
Verlag: ACM
Buchtitel: ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
Veranstaltungstitel: 47th International Conference on Parallel Processing (ICPP)
Veranstaltungsort: Eugene, USA
Veranstaltungsdatum: 13.08.2018-16.08.2018
DOI: 10.1145/3225058.3225142
Kurzbeschreibung (Abstract):

A critical factor for developing robust shared-memory applications is the efficient use of the cache and the communication between threads. Inappropriate data structures, algorithm design, and inefficient thread affinity may result in superfluous communication between threads/cores and severe performance problems. For this reason, state-of-the-art profiling tools focus on thread communication and behavior to present different metrics that enable programmers to write cache-friendly programs. The data shared between a pair of threads should be reused with a reasonable distance to preserve data locality. However, existing tools do not take into account the locality of communication events and mainly focus on analyzing the amount of communication instead. In this paper, we introduce a new method to analyze performance and communication bottlenecks that arise from data-access patterns and thread interactions of each code region. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. We evaluated our approach on the SPLASH and Rodinia benchmark suites. Experimental results validate the effectiveness of our approach by finding communication locality issues due to inefficient data structures and/or poor algorithm implementations. By applying the suggested optimizations, we improved the performance in Rodinia benchmarks by up to 56%. Furthermore, by varying the input size we demonstrated the ability of our method to assess the cache usage and scalability of a given application in terms of its inherent communication.

Freie Schlagworte: LOEWE|SF4.0, BMBF|01IH16008D, DoE|DE-SC0015524, KTS|00.253.2014
Zusätzliche Informationen:

Art.No.: 6

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Parallele Programmierung
Hinterlegungsdatum: 31 Okt 2018 08:06
Letzte Änderung: 04 Jun 2024 07:42
PPN: 518806855
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen