Attention Web Masters
We are currently active in downloading a variety of websites that have
been indexed by
dmoz.org.
Our research involves reconstructing websites that have been lost due
to some catastrophe. We are only using downloaded data for research purposes
and will not be making the downloaded pages available for any commercial
purposes or for harvesting e-mails.
Our crawler
presents itself as
Mozilla/5.0 (compatible; heritrix/1.8.0 +http://www.cs.odu.edu/~fmccown/research/lazy/crawler.html)
in the user agent
field and respects the
Robots Exclusion Protocol (robots.txt). Our crawler is attempting to download a website exactly as it appears to a
web archiving crawler (like the one used by the
Internet Archive.)
Therefore it downloads all HTML, JavaScript, images,
etc.
Our downloads occur once a week and will run for several months.
If you feel that we are
downloading your site too frequently or don't want us to crawl your site at all,
please edit your robots.txt file to indicate that our crawler should not
crawl your site. Or you may email me at
.
Sites being crawled: dmoz_300_website_urls_sorted.txt
Sorted by PageRank: dmoz_300_website_urls_PR_lang_sorted.txt
Heritrix Settings
These are the setting used by
Heritrix when crawling:
Modules
- Crawl Scope: org.archive.crawler.deciderules.DecidingScope
- URI Frontier: org.archive.crawler.frontier.BdbFrontier
- Pre Processors:
- org.archive.crawler.prefetch.Preselector
- org.archive.crawler.prefetch.PreconditionEnforcer
- Fetchers:
- org.archive.crawler.fetcher.FetchDNS
- org.archive.crawler.fetcher.FetchHTTP
- Extractors:
- org.archive.crawler.extractor.ExtractorHTML
- Writers: org.archive.crawler.writer.ARCWriterProcessor
- Post Processors:
- org.archive.crawler.postprocessor.CrawlStateUpdater
- org.archive.crawler.postprocessor.LinksScoper
- org.archive.crawler.postprocessor.FrontierScheduler
- Statistics Tracking: org.archive.crawler.admin.StatisticsTracker
Submodules
- decide-rules:
- org.archive.crawler.deciderules.RejectDecideRule
- org.archive.crawler.deciderules.SurtPrefixedDecideRule
- org.archive.crawler.deciderules.PrerequisiteAcceptDecideRule
- org.archive.crawler.deciderules.PathologicalPathDecideRule (max-repetitions: 2)
- org.archive.crawler.deciderules.TooManyHopsDecideRule (max-hops: 15)
- org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule (max-path-depth: 15)
- robots-honoring-policy: classic
- uri-canonicalization-rules:
- org.archive.crawler.url.canonicalize.LowercaseRule
- org.archive.crawler.url.canonicalize.StripUserinfoRule
- org.archive.crawler.url.canonicalize.StripWWWRule
- org.archive.crawler.url.canonicalize.StripSessionIDs
- org.archive.crawler.url.canonicalize.FixupQueryStr
Other settings
- HTTP - max-length-bytes: 10485760 (10 MB)
- frontier - delay-factor: 4.0a (multiple of last elapsed time to wait before re-contacting server)