you would like to save the crawled files in a file/directory format instead of saving them in WARC files.
First, create a job with a single seed, http://foo.org/bar/. Configure the warcWriter bean so that its class is org.archive.modules.writer.MirrorWriterProcessor. This Processor will store files in a directory structure that matches the crawled URIs. The files will be stored in the crawl job's mirror directory.
Mirroring HTML Files Only
猜你喜欢
转载自sharehua.iteye.com/blog/1745554
今日推荐
周排行