python3 网络数据采集1
第一部分:
一、可靠的网络连接:
使用库:
python标准库: urllib
python第三方库:BeautifulSoup
安装:pip3 install beautifulsoup4
导入:import bs4
cat scrapetest2.py #!/usr/local/bin/python3 from urllib.request import urlopen from bs4 import BeautifulSoup from urllib.error import HTTPError def getTitle(url): try: html = urlopen(url) except HTTPError as e: return None try: bsObj = BeautifulSoup(html.read()) title = bsObj.body.h1 except AttributeError as e: return None return title x = 'http://pythonscraping.com/pages/page1.html' title = getTitle(x) if title == None: print('Title could not be found.') else: print(title) #######执行结果####### python3 scrapetest2.py /usr/local/lib/python3.5/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 21 of the file scrapetest2.py. To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP}) to this: BeautifulSoup(YOUR_MARKUP, "html.parser") markup_type=markup_type)) <h1>An Interesting Title</h1>
二、 复杂的HTML解析