简易爬虫

superagent 用来在服务端发送http或者https请求的
cheerio 把页面解析，返回一个jquery选择器一样的函数

使用：

下载两个包
- npm i superagent
- npm i cheerio
导入
- const superagent=require(‘superagent’)
- const cheerio=require(‘cheerio’)
开始使用
- 使用superagent去访问你要爬取的页面
  - end()方法就是访问地址结束的回调函数
- 使用cheerio解析一下
  - 使用cheerio.load(你要解析的内容)方法
  - 返回值：就是一个向$函数一样的东西
- 按照你的需求拆解内容
  - 提前准备好一个数组
  - 向数组里面添加

const superagent = require('superagent')
const cheerio = require('cheerio')
const fs = require('fs')

const goodsList = []

superagent
  .get('https://list.jd.com/list.html?cat=670%2C671%2C672&go=0')
  .end((err, data) => {
    if (err) return console.log('爬取页面失败')

    // data.text 就是整个页面文件
    parseData(data.text)
  })

function parseData(page) {

  const $ = cheerio.load(page)

  $('.gl-warp > .gl-item').each((index, item) => {
    const obj = {
      goods_img: $(item).find('img').prop('src'),
      goods_price: $(item).find('.p-price i').text(),
      goods_title: $(item).find('.p-name i').text(),
      goods_name: $(item).find('.p-name em').text()
    }

    goodsList.push(obj)
  })

  console.log(goodsList)
  fs.writeFile('./goods_list.json', JSON.stringify(goodsList), () => console.log('写入完成'))

}

简易爬虫

简易爬虫

猜你喜欢