TU Darmstadt / ULB / TUbiblio

On Table Extraction from Text Sources with Markups

Weizsäcker, Lorenz ; Fürnkranz, Johannes (2008)
On Table Extraction from Text Sources with Markups.
Report, Bibliographie

Kurzbeschreibung (Abstract)

Table extraction is the task of locating tables in documents and extracting their entries along with the arrangement of the entries inside the tables. The notion of tables applied in this work excludes any sort of meta data, e.g. only the content elements of the tables are to be extracted. We follow a simple unsupervised approach by selecting the tables according to a score that measures the in-column consistency as pairwise similarities of entries where separators columns are also taken into account. Since the average is less reliable for smaller table this score demands a levelling in favor of greater tables for which we make different propositions that are covered by experiments on a test set of HTML documents. In order to reduce the number of candidate tables we use assumptions on the entry borders in terms of markup tags. They only hold for a part of the test set but allow us to evaluate any potential table without referring to the HTML syntax. The experiments show that the discriminative power of the in-column similarities are limited but also considerable given the simplicity of the applied similarity functions.

Typ des Eintrags: Report
Erschienen: 2008
Autor(en): Weizsäcker, Lorenz ; Fürnkranz, Johannes
Art des Eintrags: Bibliographie
Titel: On Table Extraction from Text Sources with Markups
Sprache: Englisch
Publikationsjahr: 2008
URL / URN: http://www.ke.informatik.tu-darmstadt.de/publications/report...
Kurzbeschreibung (Abstract):

Table extraction is the task of locating tables in documents and extracting their entries along with the arrangement of the entries inside the tables. The notion of tables applied in this work excludes any sort of meta data, e.g. only the content elements of the tables are to be extracted. We follow a simple unsupervised approach by selecting the tables according to a score that measures the in-column consistency as pairwise similarities of entries where separators columns are also taken into account. Since the average is less reliable for smaller table this score demands a levelling in favor of greater tables for which we make different propositions that are covered by experiments on a test set of HTML documents. In order to reduce the number of candidate tables we use assumptions on the entry borders in terms of markup tags. They only hold for a part of the test set but allow us to evaluate any potential table without referring to the HTML syntax. The experiments show that the discriminative power of the in-column similarities are limited but also considerable given the simplicity of the applied similarity functions.

ID-Nummer: TUD-KE-2008-05
Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Knowledge Engineering
Hinterlegungsdatum: 24 Jun 2011 15:11
Letzte Änderung: 26 Aug 2018 21:26
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen