htmlcleaner用xpath中查找node,若xpath调用contains函数会报:“org.htmlcleaner.XPatherException: Unknown function contains”。
htmlcleaner构建的Rootnode不能直接调用contains函数需要进行如下转换:
lazy val htmlCleaner = new HtmlCleaner
lazy val ISSNXPath = "//div[@class='bdy4']//b[contains(text(), '国际刊号')]"
def extract(path: String) = {
val root = htmlCleaner.clean(new File(path))
val doc = new DomSerializer(new CleanerProperties).createDOM(root)
val xpath = javax.xml.xpath.XPathFactory.newInstance.newXPath
val value = xpath.evaluate(ISSNXPath, doc, javax.xml.xpath.XPathConstants.NODE)
println(value)
val next = value.asInstanceOf[org.w3c.dom.Node].getNextSibling
println(next.getTextContent)
}
如上scala code所示:将htmlCleaner创建的根节点转为标准w3c节点,构建标准w3c XPath,进行查询即可。
对非Node后text类型文本的获取,可通过getNextSibling获得。
来自:
http://www.imilo.cn/findblog/28
htmlcleaner爬取页面报contains未定义
猜你喜欢
转载自zhymin77.iteye.com/blog/1866199
今日推荐
周排行