Domain-Specific Corpus Expansion with Focused Webcrawling

Remus, Steffen ; Biemann, Chris
Hrsg.: Calzolari, Nicoletta ; Choukri, Khalid ; Declerck, Thierry ; Goggi, Sara ; Grobelnik, Marko ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Helene ; Moreno, Asuncion ; Odijk, Jan ; Piperidis, Stelios (2016)
Domain-Specific Corpus Expansion with Focused Webcrawling.
Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia (May 23-28, 2016)
Konferenzveröffentlichung, Bibliographie

URL / URN: http://www.lrec-conf.org/proceedings/lrec2016/summaries/316....

Kurzbeschreibung (Abstract)

This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.

Typ des Eintrags:	Konferenzveröffentlichung
Erschienen:	2016
Herausgeber:	Calzolari, Nicoletta ; Choukri, Khalid ; Declerck, Thierry ; Goggi, Sara ; Grobelnik, Marko ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Helene ; Moreno, Asuncion ; Odijk, Jan ; Piperidis, Stelios
Autor(en):	Remus, Steffen ; Biemann, Chris
Art des Eintrags:	Bibliographie
Titel:	Domain-Specific Corpus Expansion with Focused Webcrawling
Sprache:	Englisch
Publikationsjahr:	Mai 2016
Ort:	Paris
Verlag:	European Language Resources Association (ELRA)
Buchtitel:	LREC 2016, Tenth International Conference on Language Resources and Evaluation : May 23-28, 2016, Grand Hotel Bernardin Conference Center, Portorož, Slovenia
Veranstaltungstitel:	Tenth International Conference on Language Resources and Evaluation (LREC 2016)
Veranstaltungsort:	Portorož, Slovenia
Veranstaltungsdatum:	May 23-28, 2016
URL / URN:	http://www.lrec-conf.org/proceedings/lrec2016/summaries/316....
Zugehörige Links:	Verwandtes Werk
Kurzbeschreibung (Abstract):	This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.
Freie Schlagworte:	Knowledge Discovery in Scientific Literature
ID-Nummer:	TUD-CS-2016-0064
Fachbereich(e)/-gebiet(e):	20 Fachbereich Informatik 20 Fachbereich Informatik > Sprachtechnologie 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung DFG-Graduiertenkollegs DFG-Graduiertenkollegs > Graduiertenkolleg 1994 Adaptive Informationsaufbereitung aus heterogenen Quellen
Hinterlegungsdatum:	23 Okt 2018 13:56
Letzte Änderung:	09 Feb 2024 13:31
PPN:
Zugehörige Links:	Verwandtes Werk
Export:

Suche nach Titel in:	TUfind oder in Google

Frage zum Eintrag

Optionen (nur für Redakteure)

Redaktionelle Details anzeigen

OAI 2.0-Basis-URL: https://tubiblio.ulb.tu-darmstadt.de/cgi/oai2 TUbiblio verwendet EPrints 3.

Drucken |

Impressum |

Datenschutzerklärung