TU Darmstadt / ULB / TUbiblio

IPA: Inference Pipeline Adaptation to achieve high accuracy and cost-efficiency

Ghafouri, Saeid ; Razavi, Kamran ; Salmani, Mehran ; Sanaee, Alireza ; Lorido-Botran, Tania ; Wang, Lin ; Doyle, Joseph ; Jamshidi, Pooyan (2024)
IPA: Inference Pipeline Adaptation to achieve high accuracy and cost-efficiency.
In: Journal of Systems Research, 4 (1)
doi: 10.5070/SR34163500
Artikel, Bibliographie

Kurzbeschreibung (Abstract)

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial chal- lenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the ex- ploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an on- line deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained mod- els for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically config- ures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Ser- vice Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while re- maining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows IPA to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experi- ments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to- end accuracy by up to 21% with a minimal cost increase.

Typ des Eintrags: Artikel
Erschienen: 2024
Autor(en): Ghafouri, Saeid ; Razavi, Kamran ; Salmani, Mehran ; Sanaee, Alireza ; Lorido-Botran, Tania ; Wang, Lin ; Doyle, Joseph ; Jamshidi, Pooyan
Art des Eintrags: Bibliographie
Titel: IPA: Inference Pipeline Adaptation to achieve high accuracy and cost-efficiency
Sprache: Englisch
Publikationsjahr: April 2024
Verlag: University of Texas
Titel der Zeitschrift, Zeitung oder Schriftenreihe: Journal of Systems Research
Jahrgang/Volume einer Zeitschrift: 4
(Heft-)Nummer: 1
DOI: 10.5070/SR34163500
Kurzbeschreibung (Abstract):

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial chal- lenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the ex- ploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an on- line deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained mod- els for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically config- ures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Ser- vice Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while re- maining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows IPA to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experi- ments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to- end accuracy by up to 21% with a minimal cost increase.

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Telekooperation
TU-Projekte: DFG|SFB1053|SFB1053 TPA01 Mühlhä
DFG|SFB1053|SFB1053 TPB02 Mühlhä
Hinterlegungsdatum: 30 Apr 2024 09:32
Letzte Änderung: 03 Sep 2024 09:28
PPN: 521061164
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen