一
1.集合类型定义及其操作:
集合用{}表示,元素用逗号分隔,无序,唯一
集合操作符:
|:并
-:减
&:交
^ :补
<= <:判断子集关系
>= >:判断包含关系
|=:
-=:
&=:
^=:
集合处理方法:
S.add(x) 字面意思
S.discard(x) 移除S中元素x,若不在,不报错
S.remove(x) 移除S中元素x,若不在,产生KeyError异常
S.clear(x) 移除S中所有元素
S.pop(x) 随机返回S的一个元素,更新S,若S为空产生KeyError异常
S.copy()
len(S)
x in S
x not in S
set(x)
集合类型应用场景:
包含关系比较
元素去重
2.序列类型及其操作:
定义:一维元素向量,元素类型可以不同
比如字符串
序列操作符:in, not in,s+t,s*n,s[i],s[i:j:k]
函数:len(s),min(s),max(s),s.index(x)或s.index(x,i,j),返回序列s从i开始到j位置中第一次出现x的位置,s.count(x),返回s中出现x的总次数
元组类型及其操作:一旦创建不能被更改,tuple()
列表类型及其操作:可以随意修改,使用[]或list()创建
ls[i] = x
ls[i:j:k] = lt
del s[i],del ls[i:j:k]删除
ls += lt
ls *= n
ls.append(x) 在列表ls最后增加一个元素x
ls.clear()
ls.copy()
ls.insert(i,x)
ls.pop(i) 取出并删除
ls.remove(x) 删除第一个x
ls.reverse() 将ls中的元素反转
3.序列类型应用场景:
item遍历
4.sorted排序
二
基本统计值实例
#CalStatisticsV1.py def getNum(): nums = [] iNumStr = input("请输入数字(回车退出):") while iNumStr !="": nums.append(eval(iNumStr)) iNumStr = input("请输入数字(回车退出):") return nums def mean(numbers): s = 0.0 for num in numbers: s = s+num return s / len(numbers) def dev(numbers,mean): sdev = 0.0 for num in numbers: sdev = sdev + (num-mean)**2 return pow(sdev / (len(numbers)-1),0.5) def median(numbers): sorted(numbers) size = len(numbers) if size % 2 == 0: med = (numbers[size//2-1]+numbers[size//2])/2 else: med = numbers[size//2] return med n = getNum() m = mean(n) print("平均值:{},方差:{:.2},中位数:{}".format(m,dev(n,m),median(n)))
三、字典类型
映射是键和值的对应
采用{}和dict()创建,键值对用冒号:表示
d = {"C":"B","M":"H","F":"B"} >>> d {'C': 'B', 'M': 'H', 'F': 'B'} >>> d["C"]
del d[k] 删除字典d中键k对应的数据值
k in d 判断键k是否在字典中
d.keys() 返回字典d中所有的键信息
d.values() 返回字典d中所有的值信息
d.items() 返回字典d中所有的键值对信息
>>> d = {"中国":"北京","美国":"华盛顿","法国":"巴黎"} >>> "中国" in d True >>> d.keys() dict_keys(['中国', '美国', '法国']) >>> d.values() dict_values(['北京', '华盛顿', '巴黎'])
d.get(k,<default>) 键k存在,则返回相应的值,不在,则返回<default>值
d.pop(k,<default>) 键k存在,则取出相应的值,不在,则返回<default>值
d.popitem() 随机从字典d中取出一个键值对,以元组形式返回
d.clear() 删除所有键值对
len(d) 返回字典d中元素的个数
>>> d.get("中国","伊斯兰堡") '北京' >>> d.get("日本","伊斯兰堡") '伊斯兰堡' >>> d.popitem() ('法国', '巴黎')
定义空字典 :d = {}
向d增加两个键值对元素:d["a"] = 1;d["b"] = 2
字典类型应用场景:
四、jieba库
Microsoft Windows [版本 10.0.17134.648] (c) 2018 Microsoft Corporation。保留所有权利。 C:\Users\ASUS>pip install jieba Collecting jieba Downloading https://files.pythonhosted.org/packages/71/46/c6f9179f73b818d5827202ad1c4a94e371a29473b7f043b736b4dab6b8cd/jieba-0.39.zip (7.3MB) 26% |████████▌ | 1.9MB 19kB/s eta 0:04:32Exception: Traceback (most recent call last): File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_vendor\urllib3\response.py", line 331, in _error_catcher yield File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_vendor\urllib3\response.py", line 413, in read data = self._fp.read(amt) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 62, in read data = self.__fp.read(amt) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\http\client.py", line 447, in read n = self.readinto(b) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\http\client.py", line 491, in readinto n = self.fp.readinto(b) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\socket.py", line 589, in readinto return self._sock.recv_into(b) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\ssl.py", line 1052, in recv_into return self.read(nbytes, buffer) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\ssl.py", line 911, in read return self._sslobj.read(len, buffer) socket.timeout: The read operation timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\cli\base_command.py", line 143, in main status = self.run(options, args) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\commands\install.py", line 318, in run resolver.resolve(requirement_set) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\resolve.py", line 102, in resolve self._resolve_one(requirement_set, req) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\resolve.py", line 256, in _resolve_one abstract_dist = self._get_abstract_dist_for(req_to_install) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\resolve.py", line 209, in _get_abstract_dist_for self.require_hashes File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\operations\prepare.py", line 283, in prepare_linked_requirement progress_bar=self.progress_bar File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\download.py", line 836, in unpack_url progress_bar=progress_bar File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\download.py", line 673, in unpack_http_url progress_bar) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\download.py", line 897, in _download_http_url _download_url(resp, link, content_file, hashes, progress_bar) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\download.py", line 617, in _download_url hashes.check_against_chunks(downloaded_chunks) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\utils\hashes.py", line 48, in check_against_chunks for chunk in chunks: File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\download.py", line 585, in written_chunks for chunk in chunks: File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\utils\ui.py", line 159, in iter for x in it: File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\download.py", line 574, in resp_read decode_content=False): File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_vendor\urllib3\response.py", line 465, in stream data = self.read(amt=amt, decode_content=decode_content) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_vendor\urllib3\response.py", line 430, in read raise IncompleteRead(self._fp_bytes_read, self.length_remaining) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\contextlib.py", line 130, in __exit__ self.gen.throw(type, value, traceback) File "c:\users\asus\appdata\local\programs\python\python37-32\lib\site-packages\pip\_vendor\urllib3\response.py", line 336, in _error_catcher raise ReadTimeoutError(self._pool, None, 'Read timed out.') pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. You are using pip version 18.1, however version 19.0.3 is available. You should consider upgrading via the 'python -m pip install --upgrade pip' command. C:\Users\ASUS>python -m pip install --upgrade pip Collecting pip Downloading https://files.pythonhosted.org/packages/d8/f3/413bab4ff08e1fc4828dfc59996d721917df8e8583ea85385d51125dceff/pip-19.0.3-py2.py3-none-any.whl (1.4MB) 100% |████████████████████████████████| 1.4MB 20kB/s Installing collected packages: pip Found existing installation: pip 18.1 Uninstalling pip-18.1: Successfully uninstalled pip-18.1 Successfully installed pip-19.0.3 C:\Users\ASUS>pip install jieba Collecting jieba Downloading https://files.pythonhosted.org/packages/71/46/c6f9179f73b818d5827202ad1c4a94e371a29473b7f043b736b4dab6b8cd/jieba-0.39.zip (7.3MB) 100% |████████████████████████████████| 7.3MB 445kB/s Installing collected packages: jieba Running setup.py install for jieba ... done Successfully installed jieba-0.39 C:\Users\ASUS>
精确模式:把文本精确地切分开,不存在冗余单词
全模式:有冗余
搜索引擎模式:在精确模式基础上,对长词再次切分
函数:
jieba.lcut(s) ,精确模式,返回一个列表类型的分词结果
jieba.lcut(s,cut_all=True) ,全模式,返回一个列表类型的分词结果,存在冗余
jieba.lcut_for_search(s) ,搜索引擎模式,返回一个列表类型的分词结果,存在冗余
jieba.add_word(w) 向分词词典增加新词w
>>> import jieba >>> jieba.lcut("中国是一个伟大的国家") Building prefix dict from the default dictionary ... Dumping model to file cache C:\Users\ASUS\AppData\Local\Temp\jieba.cache Loading model cost 1.139 seconds. Prefix dict has been built succesfully. ['中国', '是', '一个', '伟大', '的', '国家'] >>> jieba.lcut("中国是一个伟大的国家") ['中国', '是', '一个', '伟大', '的', '国家'] >>> jieba.lcut("中国是一个伟大的国家",cut_all=True) ['中国', '国是', '一个', '伟大', '的', '国家'] >>> jieba.lcut_for_search("中华人民共和国是伟大的") ['中华', '华人', '人民', '共和', '共和国', '中华人民共和国', '是', '伟大', '的']
五、“文本词频统计”实例
#CalHamletV1.py def getText(): txt = open("hamlet.txt","r").read() #读取文本 txt = txt.lower() for ch in '~!@#$%^&*()_+{}:"<>?[];,./-=': txt = txt.replace(ch," ") return txt hamletTxt = getText() words = hamletTxt.split() counts = {} for word in words: counts[word] = counts.get(word,0)+1 #统计词频 items = list(counts.items()) items.sort(key=lambda x:x[1],reverse=True) #降序排列 for i in range(10): word,count = items[i] print("{0:<10}{1:>5}".format(word,count))
#CalThreeKingdomsV1.py import jieba txt = open("threekingdoms.txt","r",encoding="utf-8").read() words = jieba.lcut(txt) counts = {} for word in words: if len(word)==1: continue else: counts[word] = counts.get(word,0)+1 items = list(counts.items()) items.sort(key=lambda x:x[1],reverse=True) for i in range(15): word,count = items[i] print("{0:<10}{1:>5}".format(word,count))
运行遇到问题,直接评论即可
#CalThreeKingdomsV2.py import jieba excludes= {"将军","却说","荆州","二人","不可","不能","如此","商议","如何","军士"} txt = open("threekingdoms.txt","r",encoding="utf-8").read() words = jieba.lcut(txt) counts = {} for word in words: if len(word)==1: continue elif word=="诸葛亮" or word=="孔明曰": rword = "孔明" elif word=="关公" or word=="云长": reord = "关羽" elif word=="玄德" or word=="玄德曰": reord = "刘备" elif word=="孟德" or word=="丞相曰" or word=="丞相" or word=="主公": reord = "曹操" else: rword = word counts[rword] = counts.get(rword,0)+1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1],reverse=True) for i in range(15): word,count = items[i] print("{0:<10}{1:>5}".format(word,count)) #CalThreeKingdomsV3.py import jieba excludes = {"将军","却说","荆州","二人","不可","不能","如此"} txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue elif word == "诸葛亮" or word == "孔明曰": rword = "孔明" elif word == "关公" or word == "云长": rword = "关羽" elif word == "玄德" or word == "玄德曰": rword = "刘备" elif word == "孟德" or word == "丞相": rword = "曹操" else: rword = word counts[rword] = counts.get(rword,0) + 1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count))
这两个程序结果出入较大!!!!!