python3 网络数据采集1

第一部分：

一、可靠的网络连接：

使用库：

python标准库: urllib

python第三方库：BeautifulSoup

安装：pip3 install beautifulsoup4

导入：import bs4

 cat scrapetest2.py 
#!/usr/local/bin/python3

from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.error import HTTPError

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title


x = 'http://pythonscraping.com/pages/page1.html' 
title = getTitle(x)

if title == None:
    print('Title could not be found.')
else:
    print(title)


#######执行结果#######
python3 scrapetest2.py
/usr/local/lib/python3.5/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 21 of the file scrapetest2.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html.parser")

  markup_type=markup_type))
<h1>An Interesting Title</h1>

View Code

二、复杂的HTML解析

python 网络数据采集1

python3 网络数据采集1

猜你喜欢