Kuhring, Lucas ; István, Zsolt (2019)
I can't believe it's not (only) software!: bionic distributed storage for Parquet files.
In: Proceedings of the VLDB Endowment, 12 (12)
doi: 10.14778/3352063.3352079
Artikel, Bibliographie
Kurzbeschreibung (Abstract)
There is a steady increase in the size of data stored and processed as part of data science applications, leading to bottlenecks and inefficiencies at various layers of the stack. One way of reducing such bottlenecks and increasing energy efficiency is by tailoring the underlying distributed storage solution to the application domain, using resources more efficiently. We explore this idea in the context of a popular column-oriented storage format used in big data workloads, namely Apache Parquet. Our prototype uses an FPGA-based storage node that offers high bandwidth data deduplication and a companion software library that exposes an API for Parquet file access. This way the storage node remains general purpose and could be shared by applications from different domains, while, at the same time, benefiting from deduplication well suited to Apache Parquet files and from selective reads of columns in the file. In this demonstration we show, on the one hand, that by relying on the FPGA's dataflow processing model, it is possible to implement in-line deduplication without increasing latencies significantly or reducing throughput. On the other hand, we highlight the benefits of implementing the application-specific aspects in a software library instead of FPGA circuits and how this enables, for instance, regular data science frameworks running in Python to access the data on the storage node and to offload filtering operations.
Typ des Eintrags: | Artikel |
---|---|
Erschienen: | 2019 |
Autor(en): | Kuhring, Lucas ; István, Zsolt |
Art des Eintrags: | Bibliographie |
Titel: | I can't believe it's not (only) software!: bionic distributed storage for Parquet files |
Sprache: | Englisch |
Publikationsjahr: | August 2019 |
Verlag: | VLDB Endowment |
Titel der Zeitschrift, Zeitung oder Schriftenreihe: | Proceedings of the VLDB Endowment |
Jahrgang/Volume einer Zeitschrift: | 12 |
(Heft-)Nummer: | 12 |
DOI: | 10.14778/3352063.3352079 |
Kurzbeschreibung (Abstract): | There is a steady increase in the size of data stored and processed as part of data science applications, leading to bottlenecks and inefficiencies at various layers of the stack. One way of reducing such bottlenecks and increasing energy efficiency is by tailoring the underlying distributed storage solution to the application domain, using resources more efficiently. We explore this idea in the context of a popular column-oriented storage format used in big data workloads, namely Apache Parquet. Our prototype uses an FPGA-based storage node that offers high bandwidth data deduplication and a companion software library that exposes an API for Parquet file access. This way the storage node remains general purpose and could be shared by applications from different domains, while, at the same time, benefiting from deduplication well suited to Apache Parquet files and from selective reads of columns in the file. In this demonstration we show, on the one hand, that by relying on the FPGA's dataflow processing model, it is possible to implement in-line deduplication without increasing latencies significantly or reducing throughput. On the other hand, we highlight the benefits of implementing the application-specific aspects in a software library instead of FPGA circuits and how this enables, for instance, regular data science frameworks running in Python to access the data on the storage node and to offload filtering operations. |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Distributed and Networked Systems |
Hinterlegungsdatum: | 23 Jan 2023 09:55 |
Letzte Änderung: | 31 Mär 2023 07:23 |
PPN: | 506507653 |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |