TU Darmstadt / ULB / TUbiblio

Storing Parquet Tile by Tile: Application-Aware Storage with Deduplication

Kuhring, Lucas ; István, Zsolt (2019)
Storing Parquet Tile by Tile: Application-Aware Storage with Deduplication.
29th International Conference on Field Programmable Logic and Applications. Barcelona, Spain (09.-13.09.2019)
doi: 10.1109/FPL.2019.00073
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

Distributed storage in the cloud needs to offer both low latency and high bandwidth access to data and efficient use of storage capacity in order to keep up with emerging big data workloads. Deduplication has been successfully used to help with the latter requirement but it is often at odds with low latency data access. Deduplication ratios can be significantly increased if the storage nodes are aware of the file format and the ways clients interact with it - but implementing different file-type specific parsing on FPGAs for multiple tenants can be unfeasible due to area constraints. We show the benefits of making the storage system aware of the application through the example of Parquet files, a columnar format used in machine learning and big data frameworks to store and transfer datasets. We achieve high deduplication ratios by using a companion software library that allows Parquet files to be stored in a "divided" way. This makes deduplication more efficient and enables clients to access individual columns or meta-data fields selectively. At the same time, the storage nodes remain general purpose and can store and deduplicate arbitrary data. This work paves the way for in-storage processing for Parquet files and other columnar formats because the different columns can be accessed in a streaming fashion and their processing requires no specialized logic on the FPGA.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2019
Autor(en): Kuhring, Lucas ; István, Zsolt
Art des Eintrags: Bibliographie
Titel: Storing Parquet Tile by Tile: Application-Aware Storage with Deduplication
Sprache: Englisch
Publikationsjahr: 7 November 2019
Verlag: IEEE
Buchtitel: Proceedings: 29th International Conference on Field-Programmable Logic and Applications (FPL 2019)
Veranstaltungstitel: 29th International Conference on Field Programmable Logic and Applications
Veranstaltungsort: Barcelona, Spain
Veranstaltungsdatum: 09.-13.09.2019
DOI: 10.1109/FPL.2019.00073
Kurzbeschreibung (Abstract):

Distributed storage in the cloud needs to offer both low latency and high bandwidth access to data and efficient use of storage capacity in order to keep up with emerging big data workloads. Deduplication has been successfully used to help with the latter requirement but it is often at odds with low latency data access. Deduplication ratios can be significantly increased if the storage nodes are aware of the file format and the ways clients interact with it - but implementing different file-type specific parsing on FPGAs for multiple tenants can be unfeasible due to area constraints. We show the benefits of making the storage system aware of the application through the example of Parquet files, a columnar format used in machine learning and big data frameworks to store and transfer datasets. We achieve high deduplication ratios by using a companion software library that allows Parquet files to be stored in a "divided" way. This makes deduplication more efficient and enables clients to access individual columns or meta-data fields selectively. At the same time, the storage nodes remain general purpose and can store and deduplicate arbitrary data. This work paves the way for in-storage processing for Parquet files and other columnar formats because the different columns can be accessed in a streaming fashion and their processing requires no specialized logic on the FPGA.

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Distributed and Networked Systems
Hinterlegungsdatum: 23 Jan 2023 10:03
Letzte Änderung: 31 Mär 2023 07:05
PPN: 506507157
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen