htmlcleaner爬取页面报contains未定义

htmlcleaner用xpath中查找node,若xpath调用contains函数会报:“org.htmlcleaner.XPatherException: Unknown function contains”。

  htmlcleaner构建的Rootnode不能直接调用contains函数需要进行如下转换:

lazy val htmlCleaner = new HtmlCleaner
lazy val ISSNXPath = "//div[@class='bdy4']//b[contains(text(), '国际刊号')]"
  def extract(path: String) = {
    val root = htmlCleaner.clean(new File(path))
    val doc = new DomSerializer(new CleanerProperties).createDOM(root)
    val xpath = javax.xml.xpath.XPathFactory.newInstance.newXPath
    val value = xpath.evaluate(ISSNXPath, doc, javax.xml.xpath.XPathConstants.NODE)
    println(value)
    val next = value.asInstanceOf[org.w3c.dom.Node].getNextSibling
    println(next.getTextContent)
  }
如上scala code所示:将htmlCleaner创建的根节点转为标准w3c节点,构建标准w3c XPath,进行查询即可。

对非Node后text类型文本的获取,可通过getNextSibling获得。
来自: http://www.imilo.cn/findblog/28

猜你喜欢

转载自zhymin77.iteye.com/blog/1866199