Mazaheri, Arya ; Wolf, Felix ; Jannesari, Ali (2018)
Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics.
47th International Conference on Parallel Processing (ICPP). Eugene, USA (13.08.2018-16.08.2018)
doi: 10.1145/3225058.3225142
Konferenzveröffentlichung, Bibliographie
Kurzbeschreibung (Abstract)
A critical factor for developing robust shared-memory applications is the efficient use of the cache and the communication between threads. Inappropriate data structures, algorithm design, and inefficient thread affinity may result in superfluous communication between threads/cores and severe performance problems. For this reason, state-of-the-art profiling tools focus on thread communication and behavior to present different metrics that enable programmers to write cache-friendly programs. The data shared between a pair of threads should be reused with a reasonable distance to preserve data locality. However, existing tools do not take into account the locality of communication events and mainly focus on analyzing the amount of communication instead. In this paper, we introduce a new method to analyze performance and communication bottlenecks that arise from data-access patterns and thread interactions of each code region. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. We evaluated our approach on the SPLASH and Rodinia benchmark suites. Experimental results validate the effectiveness of our approach by finding communication locality issues due to inefficient data structures and/or poor algorithm implementations. By applying the suggested optimizations, we improved the performance in Rodinia benchmarks by up to 56%. Furthermore, by varying the input size we demonstrated the ability of our method to assess the cache usage and scalability of a given application in terms of its inherent communication.
Typ des Eintrags: | Konferenzveröffentlichung |
---|---|
Erschienen: | 2018 |
Autor(en): | Mazaheri, Arya ; Wolf, Felix ; Jannesari, Ali |
Art des Eintrags: | Bibliographie |
Titel: | Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics |
Sprache: | Englisch |
Publikationsjahr: | 13 August 2018 |
Verlag: | ACM |
Buchtitel: | ICPP '18: Proceedings of the 47th International Conference on Parallel Processing |
Veranstaltungstitel: | 47th International Conference on Parallel Processing (ICPP) |
Veranstaltungsort: | Eugene, USA |
Veranstaltungsdatum: | 13.08.2018-16.08.2018 |
DOI: | 10.1145/3225058.3225142 |
Kurzbeschreibung (Abstract): | A critical factor for developing robust shared-memory applications is the efficient use of the cache and the communication between threads. Inappropriate data structures, algorithm design, and inefficient thread affinity may result in superfluous communication between threads/cores and severe performance problems. For this reason, state-of-the-art profiling tools focus on thread communication and behavior to present different metrics that enable programmers to write cache-friendly programs. The data shared between a pair of threads should be reused with a reasonable distance to preserve data locality. However, existing tools do not take into account the locality of communication events and mainly focus on analyzing the amount of communication instead. In this paper, we introduce a new method to analyze performance and communication bottlenecks that arise from data-access patterns and thread interactions of each code region. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. We evaluated our approach on the SPLASH and Rodinia benchmark suites. Experimental results validate the effectiveness of our approach by finding communication locality issues due to inefficient data structures and/or poor algorithm implementations. By applying the suggested optimizations, we improved the performance in Rodinia benchmarks by up to 56%. Furthermore, by varying the input size we demonstrated the ability of our method to assess the cache usage and scalability of a given application in terms of its inherent communication. |
Freie Schlagworte: | LOEWE|SF4.0, BMBF|01IH16008D, DoE|DE-SC0015524, KTS|00.253.2014 |
Zusätzliche Informationen: | Art.No.: 6 |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Parallele Programmierung |
Hinterlegungsdatum: | 31 Okt 2018 08:06 |
Letzte Änderung: | 04 Jun 2024 07:42 |
PPN: | 518806855 |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |