IPA: Inference Pipeline Adaptation to achieve high accuracy and cost-efficiency

Ghafouri, Saeid ; Razavi, Kamran ; Salmani, Mehran ; Sanaee, Alireza ; Lorido-Botran, Tania ; Wang, Lin ; Doyle, Joseph ; Jamshidi, Pooyan (2024)
IPA: Inference Pipeline Adaptation to achieve high accuracy and cost-efficiency.
In: Journal of Systems Research, 4 (1)
doi: 10.5070/SR34163500
Artikel, Bibliographie

Kurzbeschreibung (Abstract)

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial chal- lenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the ex- ploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an on- line deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained mod- els for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically config- ures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Ser- vice Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while re- maining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows IPA to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experi- ments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to- end accuracy by up to 21% with a minimal cost increase.

Typ des Eintrags:	Artikel
Erschienen:	2024
Autor(en):	Ghafouri, Saeid ; Razavi, Kamran ; Salmani, Mehran ; Sanaee, Alireza ; Lorido-Botran, Tania ; Wang, Lin ; Doyle, Joseph ; Jamshidi, Pooyan
Art des Eintrags:	Bibliographie
Titel:	IPA: Inference Pipeline Adaptation to achieve high accuracy and cost-efficiency
Sprache:	Englisch
Publikationsjahr:	April 2024
Verlag:	University of Texas
Titel der Zeitschrift, Zeitung oder Schriftenreihe:	Journal of Systems Research
Jahrgang/Volume einer Zeitschrift:	4
(Heft-)Nummer:	1
DOI:	10.5070/SR34163500
Kurzbeschreibung (Abstract):	Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial chal- lenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the ex- ploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an on- line deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained mod- els for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically config- ures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Ser- vice Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while re- maining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows IPA to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experi- ments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to- end accuracy by up to 21% with a minimal cost increase.
Fachbereich(e)/-gebiet(e):	20 Fachbereich Informatik 20 Fachbereich Informatik > Telekooperation
TU-Projekte:	DFG\|SFB1053\|SFB1053 TPA01 Mühlhä DFG\|SFB1053\|SFB1053 TPB02 Mühlhä
Hinterlegungsdatum:	30 Apr 2024 09:32
Letzte Änderung:	03 Sep 2024 09:28
PPN:	521061164
Export:

Suche nach Titel in:	TUfind oder in Google

Frage zum Eintrag

Optionen (nur für Redakteure)

Redaktionelle Details anzeigen

OAI 2.0-Basis-URL: https://tubiblio.ulb.tu-darmstadt.de/cgi/oai2 TUbiblio verwendet EPrints 3.

Drucken |

Impressum |

Datenschutzerklärung