Hartle, Michael (2010)
A Formal, Declarative Approach to Data Format Description.
Technische Universität Darmstadt
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
The concept of data formats is central to information storage and exchange, as it coins the process of how information is written to and read back from format-compliant data by senders and receivers. In contrast to the widespread use of natural-language descriptions intended for human engineers, and of procedural definitions of format-compliant components, describing data format knowledge in a formal, declarative manner is necessary for making this knowledge machine-processible, enabling its flexible, automated application to format-compliant data. To that effect, data format knowledge is considered both on the level of format-compliant data as a data format instance and on the level of a data format consisting of such instances. In a survey of current State of the Art in Data Format Description, examined related work from the data-centric research domains of Digital Preservation, Multimedia and Telecommunication show a lack suitable formalised models for universal applicability. As well, examined related work provides only a subset of the four necessary descriptive capabilities to describe data which may be primitive, structured, transcoded or fragmented. In the analysis, a formalisation is presented which is based on the research hypothesis that a data format defines a normative set of lossless information representations, where there exists a bijective mapping between interal representations of senders / receivers, and external representations that are exchanged as format-compliant data. The formalisation is universally applicable for arbitrary data formats, is suitable for both so-called lossless and lossy data formats, and leads to the notion of four elementary descriptive capabilities, which exactly match those used in the State of the Art survey. A valid Portable Network Graphics (PNG) raster image is given as ``litmus test'' for data format description, as its description exercises all four elementary descriptive capabilities. Based on the formalisation, it is shown that a universal approach to data format description is too powerful in computational terms as to be able to guarantee termination, that the tractability of bijective mapping functions and their inverses is neither given nor necessarily related, and that one-to-one correspondence of a bijective mapping function can be guaranteed using information-preserving, Turing-complete Reversible Turing Machines. Building on the formalisation given in the analysis, the thesis defines the Bitstream Segment Graph (BSG) model for describing arbitrary data format instances. For BSG instances, representations are defined both for visualisation as well as for storage and exchange through machine-processible, RDF-based representations. Incremental construction and modification of BSG instances is enabled through a closed set of operations, and the ``coverage'' of a BSG instance is defined as a measure of its completeness. Actual tool support for the construction, modification and exploration of BSG instances on arbitrary data is provided through the Apeiron BSG Editor. Applications of the BSG model are demonstrated through the description of the PNG image ``litmus test'' from the previous analysis, and for the description of an exploit in the context of IT Security. Building on the BSG model, the thesis defines the BSG Reasoning approach for describing arbitrary data formats as potentially infinite sets of data format instances. Using logic rules, a BSG instance can be inferred for a given bit sequence which is considered to be format-compliant data. The BSG Reasoning approach defines the representation of rulesets for storage and exchange. Applications of BSG Reasoning are demonstrated through the description of a PNG image file format subset and through an outlined approach for format-aware fuzzing of bitstreams in IT Security. The PNG image file format subset described through BSG Reasoning exercises all elementary descriptive capabilities previously identified in the analysis, and it is shown that the resulting set of logic rules, despite a low number of format-specific rules, already yields a high coverage of inferred BSG instances on a number of valid PNG images. The thesis closes with a retrospection, conclusions and an outlook on potential future research on the BSG model and the BSG Reasoning approach, focusing on aspects such as the computer-aided reverse-engineering of data format rules, or the use of reversible programming languages for the definition of lossless coding and transformation functions.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2010 | ||||
Autor(en): | Hartle, Michael | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | A Formal, Declarative Approach to Data Format Description | ||||
Sprache: | Englisch | ||||
Referenten: | Mühlhäuser, Prof. Dr. Max ; Rauber, Prof. Dr. Andreas | ||||
Publikationsjahr: | 31 August 2010 | ||||
Datum der mündlichen Prüfung: | 16 Juli 2010 | ||||
URL / URN: | urn:nbn:de:tuda-tuprints-22715 | ||||
Zugehörige Links: | |||||
Kurzbeschreibung (Abstract): | The concept of data formats is central to information storage and exchange, as it coins the process of how information is written to and read back from format-compliant data by senders and receivers. In contrast to the widespread use of natural-language descriptions intended for human engineers, and of procedural definitions of format-compliant components, describing data format knowledge in a formal, declarative manner is necessary for making this knowledge machine-processible, enabling its flexible, automated application to format-compliant data. To that effect, data format knowledge is considered both on the level of format-compliant data as a data format instance and on the level of a data format consisting of such instances. In a survey of current State of the Art in Data Format Description, examined related work from the data-centric research domains of Digital Preservation, Multimedia and Telecommunication show a lack suitable formalised models for universal applicability. As well, examined related work provides only a subset of the four necessary descriptive capabilities to describe data which may be primitive, structured, transcoded or fragmented. In the analysis, a formalisation is presented which is based on the research hypothesis that a data format defines a normative set of lossless information representations, where there exists a bijective mapping between interal representations of senders / receivers, and external representations that are exchanged as format-compliant data. The formalisation is universally applicable for arbitrary data formats, is suitable for both so-called lossless and lossy data formats, and leads to the notion of four elementary descriptive capabilities, which exactly match those used in the State of the Art survey. A valid Portable Network Graphics (PNG) raster image is given as ``litmus test'' for data format description, as its description exercises all four elementary descriptive capabilities. Based on the formalisation, it is shown that a universal approach to data format description is too powerful in computational terms as to be able to guarantee termination, that the tractability of bijective mapping functions and their inverses is neither given nor necessarily related, and that one-to-one correspondence of a bijective mapping function can be guaranteed using information-preserving, Turing-complete Reversible Turing Machines. Building on the formalisation given in the analysis, the thesis defines the Bitstream Segment Graph (BSG) model for describing arbitrary data format instances. For BSG instances, representations are defined both for visualisation as well as for storage and exchange through machine-processible, RDF-based representations. Incremental construction and modification of BSG instances is enabled through a closed set of operations, and the ``coverage'' of a BSG instance is defined as a measure of its completeness. Actual tool support for the construction, modification and exploration of BSG instances on arbitrary data is provided through the Apeiron BSG Editor. Applications of the BSG model are demonstrated through the description of the PNG image ``litmus test'' from the previous analysis, and for the description of an exploit in the context of IT Security. Building on the BSG model, the thesis defines the BSG Reasoning approach for describing arbitrary data formats as potentially infinite sets of data format instances. Using logic rules, a BSG instance can be inferred for a given bit sequence which is considered to be format-compliant data. The BSG Reasoning approach defines the representation of rulesets for storage and exchange. Applications of BSG Reasoning are demonstrated through the description of a PNG image file format subset and through an outlined approach for format-aware fuzzing of bitstreams in IT Security. The PNG image file format subset described through BSG Reasoning exercises all elementary descriptive capabilities previously identified in the analysis, and it is shown that the resulting set of logic rules, despite a low number of format-specific rules, already yields a high coverage of inferred BSG instances on a number of valid PNG images. The thesis closes with a retrospection, conclusions and an outlook on potential future research on the BSG model and the BSG Reasoning approach, focusing on aspects such as the computer-aided reverse-engineering of data format rules, or the use of reversible programming languages for the definition of lossless coding and transformation functions. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
Freie Schlagworte: | data formats, data format description, bitstream segment graph, bsg, bsg reasoning | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik > Telekooperation 20 Fachbereich Informatik |
||||
Hinterlegungsdatum: | 02 Sep 2010 05:33 | ||||
Letzte Änderung: | 05 Mär 2013 09:36 | ||||
PPN: | |||||
Referenten: | Mühlhäuser, Prof. Dr. Max ; Rauber, Prof. Dr. Andreas | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 16 Juli 2010 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |