引言

最近需要使用node.js访问网页、解析html文档进而提取网页上面的数据。
由于使用python写爬虫的时习惯使用xpath进行html文档解析，于是也许搜了一下xpath解析方式的node.js实现。
发现了两个第三方库：

分别是xpath.js和xpath，xpath是fork的xpath.js项目进一步开发的，最后commit时间较原始项目更近，使用方式也更加人性化。
关于xpath项目，可以在项目主页上看到一些使用示例，也可以看到文档。
我们选用goto100/xpath进行html解析。

依赖

npm install xpath
npm install xmldom

作者推荐使用xmldom作为xml引擎。

例子

例1

const xpath = require('xpath')
const dom = require('xmldom').DOMParser

let xml = "<book><title>Harry Potter</title></book>"
let doc = new dom().parseFromString(xml)
let nodes = xpath.select("//title", doc)

console.log(nodes[0].localName + ": " + nodes[0].firstChild.data)
console.log("Node: " + nodes[0].toString())

例2

const fs = require('fs');
const xpath = require('xpath');
const dom = require('xmldom').DOMParser;
const util = require('util');

// 打开文件
let content = fs.readFileSync("123.html", {encoding: 'UTF-8'});
// 构造dom
let doc = new dom().parseFromString(content, 'text/xml');

// 抽取指定元素1
let titles = xpath.select("//title", doc);
// 抽取元素长度
console.log(util.format("length: %s", titles.length));
// 遍历
titles.forEach(function (title, index) {
    console.log("title: " + title.toString())
});

// 抽取指定元素2
let items = xpath.select("//div[@id='mac-data']/table/tr", doc);
// 抽取元素长度
console.log(util.format("length: %s", items.length));
// 遍历
items.forEach(function (item, index) {
    // xpath解析子元素
    let field = xpath.select("string(./td[2]/a)", item);
    console.log(field);
});

遇到的问题

网址www.chinanpo.gov.cn/search/orgcx.html的页面是xhtml标准，导致xpath会按照xhtml标准解析。
页面上会有以下元素：

<html xmlns="http://www.w3.org/1999/xhtml">

xpath库会将页面辨认为xhtml格式。
由于xhtml标准严格定义html标签的闭合，而这个页面上面会有两个标签，导致解析失败，无法使用xpath抽取数据。
当我们把<html xmlns="http://www.w3.org/1999/xhtml">中的xmlns="http://www.w3.org/1999/xhtml"去掉之后，可以正常解析。
因此我们写正则，过滤html标签中的xmlns属性：

content = content.replace(/<html\s.*?>/g, "<html>");

然后就可以使用xpath正常解析了。

使用node.js第三方库xpath进行html文档解析

引言

依赖

例子

例1

例2

遇到的问题

猜你喜欢