Web Crawler Lazy Preservation | Research | Home

Attention Web Masters

We are currently active in downloading a variety of websites that have been indexed by dmoz.org. Our research involves reconstructing websites that have been lost due to some catastrophe. We are only using downloaded data for research purposes and will not be making the downloaded pages available for any commercial purposes or for harvesting e-mails.

Our crawler presents itself as

Mozilla/5.0 (compatible; heritrix/1.8.0 +http://www.cs.odu.edu/~fmccown/research/lazy/crawler.html)

in the user agent field and respects the Robots Exclusion Protocol (robots.txt). Our crawler is attempting to download a website exactly as it appears to a web archiving crawler (like the one used by the Internet Archive.) Therefore it downloads all HTML, JavaScript, images, etc.

Our downloads occur once a week and will run for several months. If you feel that we are downloading your site too frequently or don't want us to crawl your site at all, please edit your robots.txt file to indicate that our crawler should not crawl your site. Or you may email me at fmccown at cs dot odu dot edu.

Sites being crawled: dmoz_300_website_urls_sorted.txt
Sorted by PageRank: dmoz_300_website_urls_PR_lang_sorted.txt


Heritrix Settings

These are the setting used by Heritrix when crawling:

Modules

Submodules

Other settings


Lazy Preservation | Research | Home Page last modified: