TU Darmstadt / ULB / TUbiblio

Domain-Specific Corpus Expansion with Focused Webcrawling

Remus, Steffen ; Biemann, Chris
Hrsg.: Calzolari, Nicoletta ; Choukri, Khalid ; Declerck, Thierry ; Goggi, Sara ; Grobelnik, Marko ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Helene ; Moreno, Asuncion ; Odijk, Jan ; Piperidis, Stelios (2016)
Domain-Specific Corpus Expansion with Focused Webcrawling.
Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia (23.05.2016-28.05.2016)
Konferenzveröffentlichung, Bibliographie

Kurzbeschreibung (Abstract)

This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.

Typ des Eintrags: Konferenzveröffentlichung
Erschienen: 2016
Herausgeber: Calzolari, Nicoletta ; Choukri, Khalid ; Declerck, Thierry ; Goggi, Sara ; Grobelnik, Marko ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Helene ; Moreno, Asuncion ; Odijk, Jan ; Piperidis, Stelios
Autor(en): Remus, Steffen ; Biemann, Chris
Art des Eintrags: Bibliographie
Titel: Domain-Specific Corpus Expansion with Focused Webcrawling
Sprache: Englisch
Publikationsjahr: Mai 2016
Ort: Paris
Verlag: European Language Resources Association (ELRA)
Buchtitel: LREC 2016, Tenth International Conference on Language Resources and Evaluation : May 23-28, 2016, Grand Hotel Bernardin Conference Center, Portorož, Slovenia
Veranstaltungstitel: Tenth International Conference on Language Resources and Evaluation (LREC 2016)
Veranstaltungsort: Portorož, Slovenia
Veranstaltungsdatum: 23.05.2016-28.05.2016
URL / URN: http://www.lrec-conf.org/proceedings/lrec2016/summaries/316....
Zugehörige Links:
Kurzbeschreibung (Abstract):

This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.

Freie Schlagworte: Knowledge Discovery in Scientific Literature
ID-Nummer: TUD-CS-2016-0064
Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Sprachtechnologie
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
DFG-Graduiertenkollegs
DFG-Graduiertenkollegs > Graduiertenkolleg 1994 Adaptive Informationsaufbereitung aus heterogenen Quellen
Hinterlegungsdatum: 23 Okt 2018 13:56
Letzte Änderung: 09 Feb 2024 13:31
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen