urllib.parse包学习

1、前言

我是在进行全站爬取某个网站时用到的这个包,它的主要功能就是分解URL,在对URL处理时是一个非常有用的包

2、功能介绍

This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.”

这组模块(即urllib.parse包)定义了一个标准接口,用于将URL分解成一个一个个组件,将组件重新组建成一个URL字符串。也就是利用基本的URL将相对地址(URL)转化成绝对地址。

3、函数介绍

3.1、URL Parsing

The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.

3.1.1、urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)

urlparse()会将URL分解成六个部分,看例子

>>> from urllib.parse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o   
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

这六个部分的解释

Attribute Index Value Value if not present
scheme 0 URL scheme specifier(也就是http/https) scheme parameter
netloc 1 Network location part(域名) empty string
path 2 Hierarchical path(分层路径) empty string
params 3 Parameters for last path element(最后一个路径元素的参数) empty string
query 4 Query component(查询组件) empty string
fragment 5 Fragment identifier(片段识别) empty string

函数方法说明
urlstring : URL路径
scheme : 协议类型,http或者https
allow_fragments: 默认是True,如果设置为False,fragment identifiers将不会被识别,就是说netloc后面的都会当成URL中的路径处理。

If the allow_fragments argument is false, fragment identifiers are not recognized. Instead, they are parsed as part of the path, parameters or query component, and fragment is set to the empty string in the return value.

更多关于urllib.parse的内容可前往官网

猜你喜欢

转载自blog.csdn.net/qq_38251616/article/details/80846555