Kuhring, Lucas ; István, Zsolt (2019)
Storing Parquet Tile by Tile: Application-Aware Storage with Deduplication.
29th International Conference on Field Programmable Logic and Applications. Barcelona, Spain (09.09.2019-13.09.2019)
doi: 10.1109/FPL.2019.00073
Konferenzveröffentlichung, Bibliographie
Kurzbeschreibung (Abstract)
Distributed storage in the cloud needs to offer both low latency and high bandwidth access to data and efficient use of storage capacity in order to keep up with emerging big data workloads. Deduplication has been successfully used to help with the latter requirement but it is often at odds with low latency data access. Deduplication ratios can be significantly increased if the storage nodes are aware of the file format and the ways clients interact with it - but implementing different file-type specific parsing on FPGAs for multiple tenants can be unfeasible due to area constraints. We show the benefits of making the storage system aware of the application through the example of Parquet files, a columnar format used in machine learning and big data frameworks to store and transfer datasets. We achieve high deduplication ratios by using a companion software library that allows Parquet files to be stored in a "divided" way. This makes deduplication more efficient and enables clients to access individual columns or meta-data fields selectively. At the same time, the storage nodes remain general purpose and can store and deduplicate arbitrary data. This work paves the way for in-storage processing for Parquet files and other columnar formats because the different columns can be accessed in a streaming fashion and their processing requires no specialized logic on the FPGA.
Typ des Eintrags: | Konferenzveröffentlichung |
---|---|
Erschienen: | 2019 |
Autor(en): | Kuhring, Lucas ; István, Zsolt |
Art des Eintrags: | Bibliographie |
Titel: | Storing Parquet Tile by Tile: Application-Aware Storage with Deduplication |
Sprache: | Englisch |
Publikationsjahr: | 7 November 2019 |
Verlag: | IEEE |
Buchtitel: | Proceedings: 29th International Conference on Field-Programmable Logic and Applications (FPL 2019) |
Veranstaltungstitel: | 29th International Conference on Field Programmable Logic and Applications |
Veranstaltungsort: | Barcelona, Spain |
Veranstaltungsdatum: | 09.09.2019-13.09.2019 |
DOI: | 10.1109/FPL.2019.00073 |
Kurzbeschreibung (Abstract): | Distributed storage in the cloud needs to offer both low latency and high bandwidth access to data and efficient use of storage capacity in order to keep up with emerging big data workloads. Deduplication has been successfully used to help with the latter requirement but it is often at odds with low latency data access. Deduplication ratios can be significantly increased if the storage nodes are aware of the file format and the ways clients interact with it - but implementing different file-type specific parsing on FPGAs for multiple tenants can be unfeasible due to area constraints. We show the benefits of making the storage system aware of the application through the example of Parquet files, a columnar format used in machine learning and big data frameworks to store and transfer datasets. We achieve high deduplication ratios by using a companion software library that allows Parquet files to be stored in a "divided" way. This makes deduplication more efficient and enables clients to access individual columns or meta-data fields selectively. At the same time, the storage nodes remain general purpose and can store and deduplicate arbitrary data. This work paves the way for in-storage processing for Parquet files and other columnar formats because the different columns can be accessed in a streaming fashion and their processing requires no specialized logic on the FPGA. |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Distributed and Networked Systems |
Hinterlegungsdatum: | 23 Jan 2023 10:03 |
Letzte Änderung: | 31 Mär 2023 07:05 |
PPN: | 506507157 |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |