TU Darmstadt / ULB / TUbiblio

I can't believe it's not (only) software!: bionic distributed storage for Parquet files

Kuhring, Lucas ; István, Zsolt (2019)
I can't believe it's not (only) software!: bionic distributed storage for Parquet files.
In: Proceedings of the VLDB Endowment, 12 (12)
doi: 10.14778/3352063.3352079
Artikel, Bibliographie

Kurzbeschreibung (Abstract)

There is a steady increase in the size of data stored and processed as part of data science applications, leading to bottlenecks and inefficiencies at various layers of the stack. One way of reducing such bottlenecks and increasing energy efficiency is by tailoring the underlying distributed storage solution to the application domain, using resources more efficiently. We explore this idea in the context of a popular column-oriented storage format used in big data workloads, namely Apache Parquet. Our prototype uses an FPGA-based storage node that offers high bandwidth data deduplication and a companion software library that exposes an API for Parquet file access. This way the storage node remains general purpose and could be shared by applications from different domains, while, at the same time, benefiting from deduplication well suited to Apache Parquet files and from selective reads of columns in the file. In this demonstration we show, on the one hand, that by relying on the FPGA's dataflow processing model, it is possible to implement in-line deduplication without increasing latencies significantly or reducing throughput. On the other hand, we highlight the benefits of implementing the application-specific aspects in a software library instead of FPGA circuits and how this enables, for instance, regular data science frameworks running in Python to access the data on the storage node and to offload filtering operations.

Typ des Eintrags: Artikel
Erschienen: 2019
Autor(en): Kuhring, Lucas ; István, Zsolt
Art des Eintrags: Bibliographie
Titel: I can't believe it's not (only) software!: bionic distributed storage for Parquet files
Sprache: Englisch
Publikationsjahr: August 2019
Verlag: VLDB Endowment
Titel der Zeitschrift, Zeitung oder Schriftenreihe: Proceedings of the VLDB Endowment
Jahrgang/Volume einer Zeitschrift: 12
(Heft-)Nummer: 12
DOI: 10.14778/3352063.3352079
Kurzbeschreibung (Abstract):

There is a steady increase in the size of data stored and processed as part of data science applications, leading to bottlenecks and inefficiencies at various layers of the stack. One way of reducing such bottlenecks and increasing energy efficiency is by tailoring the underlying distributed storage solution to the application domain, using resources more efficiently. We explore this idea in the context of a popular column-oriented storage format used in big data workloads, namely Apache Parquet. Our prototype uses an FPGA-based storage node that offers high bandwidth data deduplication and a companion software library that exposes an API for Parquet file access. This way the storage node remains general purpose and could be shared by applications from different domains, while, at the same time, benefiting from deduplication well suited to Apache Parquet files and from selective reads of columns in the file. In this demonstration we show, on the one hand, that by relying on the FPGA's dataflow processing model, it is possible to implement in-line deduplication without increasing latencies significantly or reducing throughput. On the other hand, we highlight the benefits of implementing the application-specific aspects in a software library instead of FPGA circuits and how this enables, for instance, regular data science frameworks running in Python to access the data on the storage node and to offload filtering operations.

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Distributed and Networked Systems
Hinterlegungsdatum: 23 Jan 2023 09:55
Letzte Änderung: 31 Mär 2023 07:23
PPN: 506507653
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen