Leipzig Corpora Collection - Crawler FAQ

The Leipzig Corpora Collection (LCC) is a project of the Natural Language Processing Group of the University of Leipzig. The LCC offers access to monolingual dictionaries in more than 200 languages.

The crawler that visited your website is collecting data for this project. The crawled data are used for language documentation and language statistics which are freely available on our website.

The crawling is restricted to text. Audio and video material is excluded from the crawling. If such items are crawled due to technical limitations, they are never stored.

The crawler Heritrix (Vers. 3.3.0) is used, see this link for details. Heritrix was developed by the Internet Archive and is used by several institutions.

If you want to exclude crawlers from your website, it is common practice to do this by using the robots.txt file of your domain. Most crawlers, like ours, respect the rules you specify there. By adding the following lines to your robots.txt you can exclude the crawler of the LCC from your domain. Please allow for one day until changes take effect.

User-agent: LCC
Disallow: /

We can also offer you to include your domain in our black list of websites which are not to be crawled. In that case, please write us an e-mail.