TU Darmstadt / ULB / TUbiblio

Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep Learning

Ritter, Marcus ; Wolf, Felix (2023)
Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep Learning.
SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. Denver, USA (12.-17.11.2023)
doi: 10.1145/3624062.3624204
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

With the rapidly increasing size and complexity of DNNs, equally sophisticated methods are needed to train them efficiently, including distributed training and various model/hybrid parallelism approaches. Even though developers heavily rely on state-of-the-art frameworks such as PyTorch and TensorFlow, these provide little insight into an application’s training behavior at scale, leading to latent performance bottlenecks and inefficient training configurations. We propose Extra-Deep, an automated empirical performance modeling approach for distributed deep learning to model performance metrics, such as the training time, as a function of the applications’ configuration parameters. We leverage the created models to analyze a training task’s performance, scalability, efficiency, and cost. Gathering empirical measurements of full training runs is very laborious and costly. Therefore, we employ an efficient sampling strategy that reduces the profiling time for the required empirical measurements by, on average, about 94.9%. Using our sampling strategy, we can analyze the performance behavior and identify cost-effective training configurations even for large-scale and long-running applications. We evaluated our approach on three parallelization strategies, with four DNN models and five datasets. The results show that Extra-Deep has an average prediction accuracy of 93.6% when compared to empirical results.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2023
Autor(en): Ritter, Marcus ; Wolf, Felix
Art des Eintrags: Bibliographie
Titel: Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep Learning
Sprache: Englisch
Publikationsjahr: 12 November 2023
Verlag: ACM
Buchtitel: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
Veranstaltungstitel: SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
Veranstaltungsort: Denver, USA
Veranstaltungsdatum: 12.-17.11.2023
DOI: 10.1145/3624062.3624204
Kurzbeschreibung (Abstract):

With the rapidly increasing size and complexity of DNNs, equally sophisticated methods are needed to train them efficiently, including distributed training and various model/hybrid parallelism approaches. Even though developers heavily rely on state-of-the-art frameworks such as PyTorch and TensorFlow, these provide little insight into an application’s training behavior at scale, leading to latent performance bottlenecks and inefficient training configurations. We propose Extra-Deep, an automated empirical performance modeling approach for distributed deep learning to model performance metrics, such as the training time, as a function of the applications’ configuration parameters. We leverage the created models to analyze a training task’s performance, scalability, efficiency, and cost. Gathering empirical measurements of full training runs is very laborious and costly. Therefore, we employ an efficient sampling strategy that reduces the profiling time for the required empirical measurements by, on average, about 94.9%. Using our sampling strategy, we can analyze the performance behavior and identify cost-effective training configurations even for large-scale and long-running applications. We evaluated our approach on three parallelization strategies, with four DNN models and five datasets. The results show that Extra-Deep has an average prediction accuracy of 93.6% when compared to empirical results.

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Parallele Programmierung
Zentrale Einrichtungen
Zentrale Einrichtungen > Hochschulrechenzentrum (HRZ)
Zentrale Einrichtungen > Hochschulrechenzentrum (HRZ) > Hochleistungsrechner
Hinterlegungsdatum: 13 Feb 2024 15:18
Letzte Änderung: 30 Apr 2024 07:34
PPN: 517666901
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen