写在前面
之前两篇博客基本上讲完了Spider,四大组件还有三个包装类没有讲,这篇博客讲讲一下OOSpider,也是对Spider的一个补充,但是我觉得OOSpider是WebMagic的一个很强大的功能,提供了注解的爬虫,官网地址:
http://webmagic.io/docs/zh/posts/ch5-annotation/README.html
例子
先看一下怎么使用注解编写爬虫
@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl("https://github.com/\\w+")
public class GithubRepo {
@ExtractBy(value = "//h1[@class='entry-title public']/strong/a/text()", notNull = true)
private String name;
@ExtractByUrl("https://github\\.com/(\\w+)/.*")
private String author;
@ExtractBy("//div[@id='readme']/tidyText()")
private String readme;
public static void main(String[] args) {
OOSpider.create(Site.me().setSleepTime(1000)
, new ConsolePageModelPipeline(), GithubRepo.class)
.addUrl("https://github.com/code4craft").thread(5).run();
}
}
在我第一眼看到使用注解的方式爬虫,我最想说的是,“太TMD的Java了”,真的很佩服作者,所以我们来走一下这个流程。
你所需要的知识
首先你要知道什么是注解,这里不需要你有依赖注入的知识,不懂Spring也行,还有就是需要理解反射。
OOSpider
先让我们看看OOSpider这个类吧
public class OOSpider<T> extends Spider {
private ModelPageProcessor modelPageProcessor;
private ModelPipeline modelPipeline;
private PageModelPipeline pageModelPipeline;
private List<Class> pageModelClasses = new ArrayList<Class>();
protected OOSpider(ModelPageProcessor modelPageProcessor) {
super(modelPageProcessor);
this.modelPageProcessor = modelPageProcessor;
}
public OOSpider(PageProcessor pageProcessor) {
super(pageProcessor);
}
/**
* create a spider
*
* @param site site
* @param pageModelPipeline pageModelPipeline
* @param pageModels pageModels
*/
public OOSpider(Site site, PageModelPipeline pageModelPipeline, Class... pageModels) {
this(ModelPageProcessor.create(site, pageModels));
this.modelPipeline = new ModelPipeline();
super.addPipeline(modelPipeline);
for (Class pageModel : pageModels) {
if (pageModelPipeline != null) {
this.modelPipeline.put(pageModel, pageModelPipeline);
}
pageModelClasses.add(pageModel);
}
}
@Override
protected CollectorPipeline getCollectorPipeline() {
return new PageModelCollectorPipeline<T>(pageModelClasses.get(0));
}
public static OOSpider create(Site site, Class... pageModels) {
return new OOSpider(site, null, pageModels);
}
public static OOSpider create(Site site, PageModelPipeline pageModelPipeline, Class... pageModels) {
return new OOSpider(site, pageModelPipeline, pageModels);
}
public OOSpider addPageModel(PageModelPipeline pageModelPipeline, Class... pageModels) {
for (Class pageModel : pageModels) {
modelPageProcessor.addPageModel(pageModel);
modelPipeline.put(pageModel, pageModelPipeline);
}
return this;
}
}
这个类继承Spider,但是对于四大组件中的PageProcesser做了更改,你想也明白,因为已经注入到你定义的类中,所以不需要再传入PageProcesser了。
然后就是Pipeline需要继承PageModelPipeline,OOSpider成员变量有个ModelPipeline,ModelPipeline首先执行,然后调用用户自己实现的PageModelPipeline。
可以看下面的初始化代码
初始化
public OOSpider(Site site, PageModelPipeline pageModelPipeline, Class... pageModels) {
this(ModelPageProcessor.create(site, pageModels));
this.modelPipeline = new ModelPipeline();
super.addPipeline(modelPipeline);
for (Class pageModel : pageModels) {
if (pageModelPipeline != null) {
this.modelPipeline.put(pageModel, pageModelPipeline);
}
pageModelClasses.add(pageModel);
}
}
然后我们看ModelPageProcessor,这个类继承PageProcessor,所以在主流程中将会执行对Page的解析工作(如果对主流程不熟悉那就先看看我的第一篇博客), 具体的解析工作是由PageModelExtractor执行,每个包含注解的class都会对应一个PageModelExtractor。
解析
最主要的是PageProcessor接口的process方法,如下
class ModelPageProcessor implements PageProcessor {
private List<PageModelExtractor> pageModelExtractorList = new ArrayList<PageModelExtractor>();
private Site site;
public static ModelPageProcessor create(Site site, Class... clazzs) {
ModelPageProcessor modelPageProcessor = new ModelPageProcessor(site);
for (Class clazz : clazzs) {
modelPageProcessor.addPageModel(clazz);
}
return modelPageProcessor;
}
public ModelPageProcessor addPageModel(Class clazz) {
PageModelExtractor pageModelExtractor = PageModelExtractor.create(clazz);
pageModelExtractorList.add(pageModelExtractor);
return this;
}
private ModelPageProcessor(Site site) {
this.site = site;
}
@Override
public void process(Page page) {
for (PageModelExtractor pageModelExtractor : pageModelExtractorList) {
extractLinks(page, pageModelExtractor.getHelpUrlRegionSelector(), pageModelExtractor.getHelpUrlPatterns());
extractLinks(page, pageModelExtractor.getTargetUrlRegionSelector(), pageModelExtractor.getTargetUrlPatterns());
Object process = pageModelExtractor.process(page);
if (process == null || (process instanceof List && ((List) process).size() == 0)) {
continue;
}
postProcessPageModel(pageModelExtractor.getClazz(), process);
page.putField(pageModelExtractor.getClazz().getCanonicalName(), process);
}
if (page.getResultItems().getAll().size() == 0) {
page.getResultItems().setSkip(true);
}
}
private void extractLinks(Page page, Selector urlRegionSelector, List<Pattern> urlPatterns) {
List<String> links;
if (urlRegionSelector == null) {
links = page.getHtml().links().all();
} else {
links = page.getHtml().selectList(urlRegionSelector).links().all();
}
for (String link : links) {
for (Pattern targetUrlPattern : urlPatterns) {
Matcher matcher = targetUrlPattern.matcher(link);
if (matcher.find()) {
page.addTargetRequest(new Request(matcher.group(1)));
}
}
}
}
protected void postProcessPageModel(Class clazz, Object object) {
}
@Override
public Site getSite() {
return site;
}
}
extractLinks方法主要是根据注解的@TargetUrl和@HelpUrl来解析Page中可能存在的url,如果匹配那么久加入targeturl中,targeturl会在Spider的流程中加入Scheduler,作为下一步爬的url,在Spider的这个方法中
protected void extractAndAddRequests(Page page, boolean spawnUrl) {
if (spawnUrl && CollectionUtils.isNotEmpty(page.getTargetRequests())) {
for (Request request : page.getTargetRequests()) {
addRequest(request);
}
}
}
注意,targetUrl是有可能重复的,比如你在a.html,这个页面中可能充满了a.thml的链接,那么很有可能这个page的targetUrl全部是a.html,当然之前也说了去重的任务是Scheduler的!
Object process = pageModelExtractor.process(page);
process方法中最重要的是这个方法,这个方法的返回就是包含注解类的class的实例,如果你对如何生成这个实例不感兴趣,那么就接着
page.putField(pageModelExtractor.getClazz().getCanonicalName(), process);
这个方法也就将process这个实例放到了Page中的ResultItem中,ResultItem将会是Pipeline的process传入的方法,也就是我们可以拿来持久化的具体实例!
如何注入
具体的注入代码:
public Object process(Page page) {
boolean matched = false;
for (Pattern targetPattern : targetUrlPatterns) {
if (targetPattern.matcher(page.getUrl().toString()).matches()) {
matched = true;
}
}
if (!matched) {
return null;
}
if (objectExtractor == null) {
return processSingle(page, null, true);
} else {
if (objectExtractor.multi) {
List<Object> os = new ArrayList<Object>();
List<String> list = objectExtractor.getSelector().selectList(page.getRawText());
for (String s : list) {
Object o = processSingle(page, s, false);
if (o != null) {
os.add(o);
}
}
return os;
} else {
String select = objectExtractor.getSelector().select(page.getRawText());
Object o = processSingle(page, select, false);
return o;
}
}
}
首先判断当前的Page是否符合targetUrl的正则,如何是则调用具体的注入
private Object processSingle(Page page, String html, boolean isRaw) {
Object o = null;
try {
o = clazz.newInstance();
for (FieldExtractor fieldExtractor : fieldExtractors) {
if (fieldExtractor.isMulti()) {
List<String> value;
switch (fieldExtractor.getSource()) {
case RawHtml:
value = page.getHtml().selectDocumentForList(fieldExtractor.getSelector());
break;
case Html:
if (isRaw) {
value = page.getHtml().selectDocumentForList(fieldExtractor.getSelector());
} else {
value = fieldExtractor.getSelector().selectList(html);
}
break;
case Url:
value = fieldExtractor.getSelector().selectList(page.getUrl().toString());
break;
default:
value = fieldExtractor.getSelector().selectList(html);
}
if ((value == null || value.size() == 0) && fieldExtractor.isNotNull()) {
return null;
}
if (fieldExtractor.getObjectFormatter() != null) {
List<Object> converted = convert(value, fieldExtractor.getObjectFormatter());
setField(o, fieldExtractor, converted);
} else {
setField(o, fieldExtractor, value);
}
} else {
String value;
switch (fieldExtractor.getSource()) {
case RawHtml:
value = page.getHtml().selectDocument(fieldExtractor.getSelector());
break;
case Html:
if (isRaw) {
value = page.getHtml().selectDocument(fieldExtractor.getSelector());
} else {
value = fieldExtractor.getSelector().select(html);
}
break;
case Url:
value = fieldExtractor.getSelector().select(page.getUrl().toString());
break;
default:
value = fieldExtractor.getSelector().select(html);
}
if (value == null && fieldExtractor.isNotNull()) {
return null;
}
if (fieldExtractor.getObjectFormatter() != null) {
Object converted = convert(value, fieldExtractor.getObjectFormatter());
if (converted == null && fieldExtractor.isNotNull()) {
return null;
}
setField(o, fieldExtractor, converted);
} else {
setField(o, fieldExtractor, value);
}
}
}
if (AfterExtractor.class.isAssignableFrom(clazz)) {
((AfterExtractor) o).afterProcess(page);
}
} catch (InstantiationException e) {
logger.error("extract fail", e);
} catch (IllegalAccessException e) {
logger.error("extract fail", e);
} catch (InvocationTargetException e) {
logger.error("extract fail", e);
}
return o;
}
o = clazz.newInstance();这就是生成具体的实例了
setfield方法将具体的属性值注入至类中,这里你不需要设置setter方法,会自动为你添加一个setter方法
private void setField(Object o, FieldExtractor fieldExtractor, Object value) throws IllegalAccessException, InvocationTargetException {
if (value == null) {
return;
}
if (fieldExtractor.getSetterMethod() != null) {
fieldExtractor.getSetterMethod().invoke(o, value);
}
fieldExtractor.getField().set(o, value);
}
这部分代码还是有点复杂的,因为还包括了解析数据,如果对解析不感兴趣,到这里就可以了,fieldExtractor是为每个field具体解析的类
class FieldExtractor extends Extractor {
private final Field field;
private Method setterMethod;
private ObjectFormatter objectFormatter;
public FieldExtractor(Field field, Selector selector, Source source, boolean notNull, boolean multi) {
super(selector, source, notNull, multi);
this.field = field;
}
Field getField() {
return field;
}
Selector getSelector() {
return selector;
}
Source getSource() {
return source;
}
void setSetterMethod(Method setterMethod) {
this.setterMethod = setterMethod;
}
Method getSetterMethod() {
return setterMethod;
}
boolean isNotNull() {
return notNull;
}
ObjectFormatter getObjectFormatter() {
return objectFormatter;
}
void setObjectFormatter(ObjectFormatter objectFormatter) {
this.objectFormatter = objectFormatter;
}
}
结束
到了这里基本的OOSpider独有的流程就走完了,剩下的和Spider是一样的。