Remus, Steffen ; Biemann, Chris
Hrsg.: Calzolari, Nicoletta ; Choukri, Khalid ; Declerck, Thierry ; Goggi, Sara ; Grobelnik, Marko ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Helene ; Moreno, Asuncion ; Odijk, Jan ; Piperidis, Stelios (2016)
Domain-Specific Corpus Expansion with Focused Webcrawling.
Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia (23.05.2016-28.05.2016)
Konferenzveröffentlichung, Bibliographie
Kurzbeschreibung (Abstract)
This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.
Typ des Eintrags: | Konferenzveröffentlichung |
---|---|
Erschienen: | 2016 |
Herausgeber: | Calzolari, Nicoletta ; Choukri, Khalid ; Declerck, Thierry ; Goggi, Sara ; Grobelnik, Marko ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Helene ; Moreno, Asuncion ; Odijk, Jan ; Piperidis, Stelios |
Autor(en): | Remus, Steffen ; Biemann, Chris |
Art des Eintrags: | Bibliographie |
Titel: | Domain-Specific Corpus Expansion with Focused Webcrawling |
Sprache: | Englisch |
Publikationsjahr: | Mai 2016 |
Ort: | Paris |
Verlag: | European Language Resources Association (ELRA) |
Buchtitel: | LREC 2016, Tenth International Conference on Language Resources and Evaluation : May 23-28, 2016, Grand Hotel Bernardin Conference Center, Portorož, Slovenia |
Veranstaltungstitel: | Tenth International Conference on Language Resources and Evaluation (LREC 2016) |
Veranstaltungsort: | Portorož, Slovenia |
Veranstaltungsdatum: | 23.05.2016-28.05.2016 |
URL / URN: | http://www.lrec-conf.org/proceedings/lrec2016/summaries/316.... |
Zugehörige Links: | |
Kurzbeschreibung (Abstract): | This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software. |
Freie Schlagworte: | Knowledge Discovery in Scientific Literature |
ID-Nummer: | TUD-CS-2016-0064 |
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Sprachtechnologie 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung DFG-Graduiertenkollegs DFG-Graduiertenkollegs > Graduiertenkolleg 1994 Adaptive Informationsaufbereitung aus heterogenen Quellen |
Hinterlegungsdatum: | 23 Okt 2018 13:56 |
Letzte Änderung: | 09 Feb 2024 13:31 |
PPN: | |
Export: | |
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |