Ghafouri, Saeid ; Razavi, Kamran ; Salmani, Mehran ; Sanaee, Alireza ; Lorido-Botran, Tania ; Wang, Lin ; Doyle, Joseph ; Jamshidi, Pooyan (2024)
IPA: Inference Pipeline Adaptation to achieve high accuracy and cost-efficiency.
In: Journal of Systems Research, 4 (1)
doi: 10.5070/SR34163500
Artikel, Bibliographie
Kurzbeschreibung (Abstract)
Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial chal- lenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the ex- ploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an on- line deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained mod- els for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically config- ures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Ser- vice Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while re- maining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows IPA to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experi- ments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to- end accuracy by up to 21% with a minimal cost increase.
Typ des Eintrags: | Artikel |
---|---|
Erschienen: | 2024 |
Autor(en): | Ghafouri, Saeid ; Razavi, Kamran ; Salmani, Mehran ; Sanaee, Alireza ; Lorido-Botran, Tania ; Wang, Lin ; Doyle, Joseph ; Jamshidi, Pooyan |
Art des Eintrags: | Bibliographie |
Titel: | IPA: Inference Pipeline Adaptation to achieve high accuracy and cost-efficiency |
Sprache: | Englisch |
Publikationsjahr: | April 2024 |
Verlag: | University of Texas |
Titel der Zeitschrift, Zeitung oder Schriftenreihe: | Journal of Systems Research |
Jahrgang/Volume einer Zeitschrift: | 4 |
(Heft-)Nummer: | 1 |
DOI: | 10.5070/SR34163500 |
Kurzbeschreibung (Abstract): | Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial chal- lenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the ex- ploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an on- line deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained mod- els for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically config- ures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Ser- vice Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while re- maining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows IPA to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experi- ments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to- end accuracy by up to 21% with a minimal cost increase. |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Telekooperation |
TU-Projekte: | DFG|SFB1053|SFB1053 TPA01 Mühlhä DFG|SFB1053|SFB1053 TPB02 Mühlhä |
Hinterlegungsdatum: | 30 Apr 2024 09:32 |
Letzte Änderung: | 03 Sep 2024 09:28 |
PPN: | 521061164 |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |