TU Darmstadt / ULB / TUbiblio

Domain-Specific Corpus Expansion with Focused Webcrawling

Remus, Steffen and Biemann, Chris (2016):
Domain-Specific Corpus Expansion with Focused Webcrawling.
In: Proceedings Tenth International Conference on Language Resources and Evaluation (LREC 2016), ELRA, pp. 3607-3611, [Online-Edition: http://www.lrec-conf.org/proceedings/lrec2016/pdf/316_Paper....],
[Conference or Workshop Item]

Abstract

This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.

Item Type: Conference or Workshop Item
Erschienen: 2016
Creators: Remus, Steffen and Biemann, Chris
Title: Domain-Specific Corpus Expansion with Focused Webcrawling
Language: English
Abstract:

This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.

Title of Book: Proceedings Tenth International Conference on Language Resources and Evaluation (LREC 2016)
Publisher: ELRA
Uncontrolled Keywords: Knowledge Discovery in Scientific Literature
Divisions: 20 Department of Computer Science
20 Department of Computer Science > Sprachtechnologie
20 Department of Computer Science > Ubiquitous Knowledge Processing
Date Deposited: 31 Dec 2016 09:42
Official URL: http://www.lrec-conf.org/proceedings/lrec2016/pdf/316_Paper....
Identification Number: TUD-CS-2016-0064
Export:
Suche nach Titel in: TUfind oder in Google
Send an inquiry Send an inquiry

Options (only for editors)

View Item View Item