Ritter, Marcus ; Wolf, Felix (2023)
Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep Learning.
SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. Denver, USA (12.-17.11.2023)
doi: 10.1145/3624062.3624204
Konferenzveröffentlichung, Bibliographie
Kurzbeschreibung (Abstract)
With the rapidly increasing size and complexity of DNNs, equally sophisticated methods are needed to train them efficiently, including distributed training and various model/hybrid parallelism approaches. Even though developers heavily rely on state-of-the-art frameworks such as PyTorch and TensorFlow, these provide little insight into an application’s training behavior at scale, leading to latent performance bottlenecks and inefficient training configurations. We propose Extra-Deep, an automated empirical performance modeling approach for distributed deep learning to model performance metrics, such as the training time, as a function of the applications’ configuration parameters. We leverage the created models to analyze a training task’s performance, scalability, efficiency, and cost. Gathering empirical measurements of full training runs is very laborious and costly. Therefore, we employ an efficient sampling strategy that reduces the profiling time for the required empirical measurements by, on average, about 94.9%. Using our sampling strategy, we can analyze the performance behavior and identify cost-effective training configurations even for large-scale and long-running applications. We evaluated our approach on three parallelization strategies, with four DNN models and five datasets. The results show that Extra-Deep has an average prediction accuracy of 93.6% when compared to empirical results.
Typ des Eintrags: | Konferenzveröffentlichung |
---|---|
Erschienen: | 2023 |
Autor(en): | Ritter, Marcus ; Wolf, Felix |
Art des Eintrags: | Bibliographie |
Titel: | Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep Learning |
Sprache: | Englisch |
Publikationsjahr: | 12 November 2023 |
Verlag: | ACM |
Buchtitel: | Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis |
Veranstaltungstitel: | SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis |
Veranstaltungsort: | Denver, USA |
Veranstaltungsdatum: | 12.-17.11.2023 |
DOI: | 10.1145/3624062.3624204 |
Kurzbeschreibung (Abstract): | With the rapidly increasing size and complexity of DNNs, equally sophisticated methods are needed to train them efficiently, including distributed training and various model/hybrid parallelism approaches. Even though developers heavily rely on state-of-the-art frameworks such as PyTorch and TensorFlow, these provide little insight into an application’s training behavior at scale, leading to latent performance bottlenecks and inefficient training configurations. We propose Extra-Deep, an automated empirical performance modeling approach for distributed deep learning to model performance metrics, such as the training time, as a function of the applications’ configuration parameters. We leverage the created models to analyze a training task’s performance, scalability, efficiency, and cost. Gathering empirical measurements of full training runs is very laborious and costly. Therefore, we employ an efficient sampling strategy that reduces the profiling time for the required empirical measurements by, on average, about 94.9%. Using our sampling strategy, we can analyze the performance behavior and identify cost-effective training configurations even for large-scale and long-running applications. We evaluated our approach on three parallelization strategies, with four DNN models and five datasets. The results show that Extra-Deep has an average prediction accuracy of 93.6% when compared to empirical results. |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Parallele Programmierung Zentrale Einrichtungen Zentrale Einrichtungen > Hochschulrechenzentrum (HRZ) Zentrale Einrichtungen > Hochschulrechenzentrum (HRZ) > Hochleistungsrechner |
Hinterlegungsdatum: | 13 Feb 2024 15:18 |
Letzte Änderung: | 30 Apr 2024 07:34 |
PPN: | 517666901 |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |