pandas是数据分析的一个核心框架,集成了数据结构化和数据清洗以及分析的一些方法。pandas在numpy的基础上新增了三个数据类型,Series、DataFrame、Panel
import numpy as np
import pandas as pd
Series是一种类似与一维数组的对象,由下面两个部分组成:
- values:一组数据(ndarray类型)
- index:相关的数据索引标签
# 引入Series
from pandas import Series
两种创建方式:
(1) 由列表或numpy数组创建
默认索引为0到N-1的整数型索引
nd = np.array([1,2,3,4])
nd
array([1, 2, 3, 4])
s = Series(nd) # 没有指定索引默认0~N-1
s
0 1 1 2 2 3 3 4 dtype: int32
s = Series([1,2,3,4,5],index=list("abcde"))
s
a 1 b 2 c 3 d 4 e 5 dtype: int64
s["a"]
1
s = Series([1,2,3,4,5,6],index=["A","A","B","B","A","C"])
s
A 1 A 2 B 3 B 4 A 5 C 6 dtype: int64
s["B"]
B 3 B 4 dtype: int64
(2) 由字典创建
s = Series({"a":1,"b":2,"c":3})
s
a 1 b 2 c 3 dtype: int64
s1=Series({"a":123,"b":431},index=list("ac"))
s1
a 123.0 c NaN dtype: float64
============================================
练习1:
使用多种方法创建以下Series,命名为s1:
语文 150
数学 150
英语 150
理综 300
============================================
#数组
nd = np.array([150,150,150,300])
s1 = Series(nd,index=["语文","数学","英语","理综"])
s1
语文 150 数学 150 英语 150 理综 300 dtype: int32
dic = {"语文":150,"数学":150,"英语":150,"理综":300}
s2 = Series(dic)
s2
数学 150 理综 300 英语 150 语文 150 dtype: int64
nd[0] = 1000
nd
array([1000, 150, 150, 300])
s1 #由数组和列表创建Series是一个浅拷贝(只拷贝引用地址,不拷贝对象本身)
语文 1000 数学 150 英语 150 理综 300 dtype: int32
dic["数学"] = 120
dic
{'数学': 120, '理综': 300, '英语': 150, '语文': 150}
s2 # 由字典创建Series是一个创建副本的过程(也叫深拷贝)
数学 150 理综 300 英语 150 语文 150 dtype: int64
可以使用中括号取单个索引(此时返回的是元素类型),或者中括号里一个列表取多个索引(此时返回的仍然是一个Series类型)。分为显示索引和隐式索引:
(1) 显式索引:
- 使用index中的元素作为索引值
- 使用.loc[](推荐)
注意,此时是闭区间
s
a 1 b 2 c 3 dtype: int64
s.values
array([1, 2, 3], dtype=int64)
s.index # index的值是显示索引
Index(['a', 'b', 'c'], dtype='object')
#方式一
s["a"]
1
# 方式二(推荐)
s.loc["a"]
1
s.loc["a","b"] # 不能写成这种形式
--------------------------------------------------------------------------- IndexingError Traceback (most recent call last) <ipython-input-22-1c72e1f53463> in <module>() ----> 1 s.loc["a","b"] # 不能写成这种形式 d:\Anaconda\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1323 except (KeyError, IndexError): 1324 pass -> 1325 return self._getitem_tuple(key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj) d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup) 839 840 # no multi-index, so validate all of the indexers --> 841 self._has_valid_tuple(tup) 842 843 # ugly hack for GH #836 d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key) 186 for i, k in enumerate(key): 187 if i >= self.obj.ndim: --> 188 raise IndexingError('Too many indexers') 189 if not self._has_valid_type(k, i): 190 raise ValueError("Location based indexing can only have [%s] " IndexingError: Too many indexers
s.loc[["a","b","a"]] # 通过列表来查找,实际上就是从s中截取子series
(2) 隐式索引:
- 使用整数作为索引值
- 使用.iloc[](推荐)
注意,此时是半开区间
s.iloc[0]
s2.iloc[0]
150
s.iloc[0,1]
--------------------------------------------------------------------------- IndexingError Traceback (most recent call last) <ipython-input-24-61ca01bee2d4> in <module>() ----> 1 s.iloc[0,1] d:\Anaconda\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1323 except (KeyError, IndexError): 1324 pass -> 1325 return self._getitem_tuple(key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj) d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup) 1660 def _getitem_tuple(self, tup): 1661 -> 1662 self._has_valid_tuple(tup) 1663 try: 1664 return self._getitem_lowerdim(tup) d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key) 186 for i, k in enumerate(key): 187 if i >= self.obj.ndim: --> 188 raise IndexingError('Too many indexers') 189 if not self._has_valid_type(k, i): 190 raise ValueError("Location based indexing can only have [%s] " IndexingError: Too many indexers
s.iloc[[0,1]]
a 1 b 2 dtype: int64
(3)切片
# 显示
s.loc["a":"c"] # 闭区间
a 1 b 2 c 3 dtype: int64
# 隐式
s.iloc[0:2] # 前闭后开
a 1 b 2 dtype: int64
s = Series([1,2,3,4,5,6],index=["A","A","B","C","B","C"])
s
A 1 A 2 B 3 C 4 B 5 C 6 dtype: int64
s.loc["A":"C"]
# 如果显式索引中有重复的不建议用显示索引来切片
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-29-dd4f4933695c> in <module>() ----> 1 s.loc["A":"C"] 2 # 如果显式索引中有重复的不建议用显示索引来切片 d:\Anaconda\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj) -> 1328 return self._getitem_axis(key, axis=0) 1329 1330 def _is_scalar_access(self, key): d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1504 if isinstance(key, slice): 1505 self._has_valid_type(key, axis) -> 1506 return self._get_slice_axis(key, axis=axis) 1507 elif is_bool_indexer(key): 1508 return self._getbool_axis(key, axis=axis) d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_obj, axis) 1354 labels = obj._get_axis(axis) 1355 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, -> 1356 slice_obj.step, kind=self.name) 1357 1358 if isinstance(indexer, slice): d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind) 3348 """ 3349 start_slice, end_slice = self.slice_locs(start, end, step=step, -> 3350 kind=kind) 3351 3352 # return a slice d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, end, step, kind) 3542 end_slice = None 3543 if end is not None: -> 3544 end_slice = self.get_slice_bound(end, 'right', kind) 3545 if end_slice is None: 3546 end_slice = len(self) d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, label, side, kind) 3496 if isinstance(slc, np.ndarray): 3497 raise KeyError("Cannot get %s slice bound for non-unique " -> 3498 "label: %r" % (side, original_label)) 3499 3500 if isinstance(slc, slice): KeyError: "Cannot get right slice bound for non-unique label: 'C'"
============================================
练习2:
使用多种方法对练习1创建的Series s1进行索引和切片:
索引: 数学 150
切片: 语文 150 数学 150 英语 150
============================================
s1[[1]]
s1.loc[["数学"]]
数学 150 dtype: int32
s1.loc["语文":"英语"]
语文 1000 数学 150 英语 150 dtype: int32
s1.iloc[0:3]
语文 1000 数学 150 英语 150 dtype: int32
可以把Series看成一个定长的有序字典
可以通过shape,size,index,values等得到series的属性
s.shape
(6,)
s.reshape((3,2)) # 一般不对Series进行reshape操作,会改变原来的数据形式
d:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead """Entry point for launching an IPython kernel.
array([[1, 2], [3, 4], [5, 6]], dtype=int64)
s.size
6
s.index
Index(['A', 'A', 'B', 'C', 'B', 'C'], dtype='object')
s.index[[0,1]]
Index(['A', 'A'], dtype='object')
s.index[0:2]
Index(['A', 'A'], dtype='object')
s.index = list("abcdef")
s
a 1 b 2 c 3 d 4 e 5 f 6 dtype: int64
s.index[0] = "A" # index的值不允许单个修改
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-40-7c5a7b9320c3> in <module>() ----> 1 s.index[0] = "A" # index的值不允许单个修改 d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value) 1668 1669 def __setitem__(self, key, value): -> 1670 raise TypeError("Index does not support mutable operations") 1671 1672 def __getitem__(self, key): TypeError: Index does not support mutable operations
可以通过head(),tail()快速查看Series对象的样式
# 把数据读入
data = pd.read_csv("./titanic.txt")
data
row.names | pclass | survived | name | age | embarked | home.dest | room | ticket | boat | sex | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1st | 1 | Allen, Miss Elisabeth Walton | 29.0000 | Southampton | St Louis, MO | B-5 | 24160 L221 | 2 | female |
1 | 2 | 1st | 0 | Allison, Miss Helen Loraine | 2.0000 | Southampton | Montreal, PQ / Chesterville, ON | C26 | NaN | NaN | female |
2 | 3 | 1st | 0 | Allison, Mr Hudson Joshua Creighton | 30.0000 | Southampton | Montreal, PQ / Chesterville, ON | C26 | NaN | (135) | male |
3 | 4 | 1st | 0 | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) | 25.0000 | Southampton | Montreal, PQ / Chesterville, ON | C26 | NaN | NaN | female |
4 | 5 | 1st | 1 | Allison, Master Hudson Trevor | 0.9167 | Southampton | Montreal, PQ / Chesterville, ON | C22 | NaN | 11 | male |
5 | 6 | 1st | 1 | Anderson, Mr Harry | 47.0000 | Southampton | New York, NY | E-12 | NaN | 3 | male |
6 | 7 | 1st | 1 | Andrews, Miss Kornelia Theodosia | 63.0000 | Southampton | Hudson, NY | D-7 | 13502 L77 | 10 | female |
7 | 8 | 1st | 0 | Andrews, Mr Thomas, jr | 39.0000 | Southampton | Belfast, NI | A-36 | NaN | NaN | male |
8 | 9 | 1st | 1 | Appleton, Mrs Edward Dale (Charlotte Lamson) | 58.0000 | Southampton | Bayside, Queens, NY | C-101 | NaN | 2 | female |
9 | 10 | 1st | 0 | Artagaveytia, Mr Ramon | 71.0000 | Cherbourg | Montevideo, Uruguay | NaN | NaN | (22) | male |
10 | 11 | 1st | 0 | Astor, Colonel John Jacob | 47.0000 | Cherbourg | New York, NY | NaN | 17754 L224 10s 6d | (124) | male |
11 | 12 | 1st | 1 | Astor, Mrs John Jacob (Madeleine Talmadge Force) | 19.0000 | Cherbourg | New York, NY | NaN | 17754 L224 10s 6d | 4 | female |
12 | 13 | 1st | 1 | Aubert, Mrs Leontine Pauline | NaN | Cherbourg | Paris, France | B-35 | 17477 L69 6s | 9 | female |
13 | 14 | 1st | 1 | Barkworth, Mr Algernon H. | NaN | Southampton | Hessle, Yorks | A-23 | NaN | B | male |
14 | 15 | 1st | 0 | Baumann, Mr John D. | NaN | Southampton | New York, NY | NaN | NaN | NaN | male |
15 | 16 | 1st | 1 | Baxter, Mrs James (Helene DeLaudeniere Chaput) | 50.0000 | Cherbourg | Montreal, PQ | B-58/60 | NaN | 6 | female |
16 | 17 | 1st | 0 | Baxter, Mr Quigg Edmond | 24.0000 | Cherbourg | Montreal, PQ | B-58/60 | NaN | NaN | male |
17 | 18 | 1st | 0 | Beattie, Mr Thomson | 36.0000 | Cherbourg | Winnipeg, MN | C-6 | NaN | NaN | male |
18 | 19 | 1st | 1 | Beckwith, Mr Richard Leonard | 37.0000 | Southampton | New York, NY | D-35 | NaN | 5 | male |
19 | 20 | 1st | 1 | Beckwith, Mrs Richard Leonard (Sallie Monypeny) | 47.0000 | Southampton | New York, NY | D-35 | NaN | 5 | female |
20 | 21 | 1st | 1 | Behr, Mr Karl Howell | 26.0000 | Cherbourg | New York, NY | C-148 | NaN | 5 | male |
21 | 22 | 1st | 0 | Birnbaum, Mr Jakob | 25.0000 | Cherbourg | San Francisco, CA | NaN | NaN | (148) | male |
22 | 23 | 1st | 1 | Bishop, Mr Dickinson H. | 25.0000 | Cherbourg | Dowagiac, MI | B-49 | NaN | 7 | male |
23 | 24 | 1st | 1 | Bishop, Mrs Dickinson H. (Helen Walton) | 19.0000 | Cherbourg | Dowagiac, MI | B-49 | NaN | 7 | female |
24 | 25 | 1st | 1 | Bjornstrm-Steffansson, Mr Mauritz Hakan | 28.0000 | Southampton | Stockholm, Sweden / Washington, DC | NaN | D | male | |
25 | 26 | 1st | 0 | Blackwell, Mr Stephen Weart | 45.0000 | Southampton | Trenton, NJ | NaN | NaN | (241) | male |
26 | 27 | 1st | 1 | Blank, Mr Henry | 39.0000 | Cherbourg | Glen Ridge, NJ | A-31 | NaN | 7 | male |
27 | 28 | 1st | 1 | Bonnell, Miss Caroline | 30.0000 | Southampton | Youngstown, OH | C-7 | NaN | 8 | female |
28 | 29 | 1st | 1 | Bonnell, Miss Elizabeth | 58.0000 | Southampton | Birkdale, England Cleveland, Ohio | C-103 | NaN | 8 | female |
29 | 30 | 1st | 0 | Borebank, Mr John James | NaN | Southampton | London / Winnipeg, MB | D-21/2 | NaN | NaN | male |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1283 | 1284 | 3rd | 0 | Vestrom, Miss Hulda Amanda Adolfina | NaN | NaN | NaN | NaN | NaN | NaN | female |
1284 | 1285 | 3rd | 0 | Vonk, Mr Jenko | NaN | NaN | NaN | NaN | NaN | NaN | male |
1285 | 1286 | 3rd | 0 | Ware, Mr Frederick | NaN | NaN | NaN | NaN | NaN | NaN | male |
1286 | 1287 | 3rd | 0 | Warren, Mr Charles William | NaN | NaN | NaN | NaN | NaN | NaN | male |
1287 | 1288 | 3rd | 0 | Wazli, Mr Yousif | NaN | NaN | NaN | NaN | NaN | NaN | male |
1288 | 1289 | 3rd | 0 | Webber, Mr James | NaN | NaN | NaN | NaN | NaN | NaN | male |
1289 | 1290 | 3rd | 1 | Wennerstrom, Mr August Edvard | NaN | NaN | NaN | NaN | NaN | NaN | male |
1290 | 1291 | 3rd | 0 | Wenzel, Mr Linhart | NaN | NaN | NaN | NaN | NaN | NaN | male |
1291 | 1292 | 3rd | 0 | Widegren, Mr Charles Peter | NaN | NaN | NaN | NaN | NaN | NaN | male |
1292 | 1293 | 3rd | 0 | Wiklund, Mr Jacob Alfred | NaN | NaN | NaN | NaN | NaN | NaN | male |
1293 | 1294 | 3rd | 1 | Wilkes, Mrs Ellen | NaN | NaN | NaN | NaN | NaN | NaN | female |
1294 | 1295 | 3rd | 0 | Willer, Mr Aaron | NaN | NaN | NaN | NaN | NaN | NaN | male |
1295 | 1296 | 3rd | 0 | Willey, Mr Edward | NaN | NaN | NaN | NaN | NaN | NaN | male |
1296 | 1297 | 3rd | 0 | Williams, Mr Howard Hugh | NaN | NaN | NaN | NaN | NaN | NaN | male |
1297 | 1298 | 3rd | 0 | Williams, Mr Leslie | NaN | NaN | NaN | NaN | NaN | NaN | male |
1298 | 1299 | 3rd | 0 | Windelov, Mr Einar | NaN | NaN | NaN | NaN | NaN | NaN | male |
1299 | 1300 | 3rd | 0 | Wirz, Mr Albert | NaN | NaN | NaN | NaN | NaN | NaN | male |
1300 | 1301 | 3rd | 0 | Wiseman, Mr Phillippe | NaN | NaN | NaN | NaN | NaN | NaN | male |
1301 | 1302 | 3rd | 0 | Wittevrongel, Mr Camiel | NaN | NaN | NaN | NaN | NaN | NaN | male |
1302 | 1303 | 3rd | 1 | Yalsevac, Mr Ivan | NaN | NaN | NaN | NaN | NaN | NaN | male |
1303 | 1304 | 3rd | 0 | Yasbeck, Mr Antoni | NaN | NaN | NaN | NaN | NaN | NaN | male |
1304 | 1305 | 3rd | 1 | Yasbeck, Mrs Antoni | NaN | NaN | NaN | NaN | NaN | NaN | female |
1305 | 1306 | 3rd | 0 | Youssef, Mr Gerios | NaN | NaN | NaN | NaN | NaN | NaN | male |
1306 | 1307 | 3rd | 0 | Zabour, Miss Hileni | NaN | NaN | NaN | NaN | NaN | NaN | female |
1307 | 1308 | 3rd | 0 | Zabour, Miss Tamini | NaN | NaN | NaN | NaN | NaN | NaN | female |
1308 | 1309 | 3rd | 0 | Zakarian, Mr Artun | NaN | NaN | NaN | NaN | NaN | NaN | male |
1309 | 1310 | 3rd | 0 | Zakarian, Mr Maprieder | NaN | NaN | NaN | NaN | NaN | NaN | male |
1310 | 1311 | 3rd | 0 | Zenn, Mr Philip | NaN | NaN | NaN | NaN | NaN | NaN | male |
1311 | 1312 | 3rd | 0 | Zievens, Rene | NaN | NaN | NaN | NaN | NaN | NaN | female |
1312 | 1313 | 3rd | 0 | Zimmerman, Leo | NaN | NaN | NaN | NaN | NaN | NaN | male |
1313 rows × 11 columns
data.head(3)
row.names | pclass | survived | name | age | embarked | home.dest | room | ticket | boat | sex | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1st | 1 | Allen, Miss Elisabeth Walton | 29.0 | Southampton | St Louis, MO | B-5 | 24160 L221 | 2 | female |
1 | 2 | 1st | 0 | Allison, Miss Helen Loraine | 2.0 | Southampton | Montreal, PQ / Chesterville, ON | C26 | NaN | NaN | female |
2 | 3 | 1st | 0 | Allison, Mr Hudson Joshua Creighton | 30.0 | Southampton | Montreal, PQ / Chesterville, ON | C26 | NaN | (135) | male |
data.tail(10)
row.names | pclass | survived | name | age | embarked | home.dest | room | ticket | boat | sex | |
---|---|---|---|---|---|---|---|---|---|---|---|
1303 | 1304 | 3rd | 0 | Yasbeck, Mr Antoni | NaN | NaN | NaN | NaN | NaN | NaN | male |
1304 | 1305 | 3rd | 1 | Yasbeck, Mrs Antoni | NaN | NaN | NaN | NaN | NaN | NaN | female |
1305 | 1306 | 3rd | 0 | Youssef, Mr Gerios | NaN | NaN | NaN | NaN | NaN | NaN | male |
1306 | 1307 | 3rd | 0 | Zabour, Miss Hileni | NaN | NaN | NaN | NaN | NaN | NaN | female |
1307 | 1308 | 3rd | 0 | Zabour, Miss Tamini | NaN | NaN | NaN | NaN | NaN | NaN | female |
1308 | 1309 | 3rd | 0 | Zakarian, Mr Artun | NaN | NaN | NaN | NaN | NaN | NaN | male |
1309 | 1310 | 3rd | 0 | Zakarian, Mr Maprieder | NaN | NaN | NaN | NaN | NaN | NaN | male |
1310 | 1311 | 3rd | 0 | Zenn, Mr Philip | NaN | NaN | NaN | NaN | NaN | NaN | male |
1311 | 1312 | 3rd | 0 | Zievens, Rene | NaN | NaN | NaN | NaN | NaN | NaN | female |
1312 | 1313 | 3rd | 0 | Zimmerman, Leo | NaN | NaN | NaN | NaN | NaN | NaN | male |
DataFrame是由Series构成的
data["age"]
0 29.0000 1 2.0000 2 30.0000 3 25.0000 4 0.9167 5 47.0000 6 63.0000 7 39.0000 8 58.0000 9 71.0000 10 47.0000 11 19.0000 12 NaN 13 NaN 14 NaN 15 50.0000 16 24.0000 17 36.0000 18 37.0000 19 47.0000 20 26.0000 21 25.0000 22 25.0000 23 19.0000 24 28.0000 25 45.0000 26 39.0000 27 30.0000 28 58.0000 29 NaN ... 1283 NaN 1284 NaN 1285 NaN 1286 NaN 1287 NaN 1288 NaN 1289 NaN 1290 NaN 1291 NaN 1292 NaN 1293 NaN 1294 NaN 1295 NaN 1296 NaN 1297 NaN 1298 NaN 1299 NaN 1300 NaN 1301 NaN 1302 NaN 1303 NaN 1304 NaN 1305 NaN 1306 NaN 1307 NaN 1308 NaN 1309 NaN 1310 NaN 1311 NaN 1312 NaN Name: age, Length: 1313, dtype: float64
当索引没有对应的值时,可能出现缺失数据显示NaN(not a number)的情况
s = Series({"a":123,"b":345,"c":685},index=list("abcdef"))
s
a 123.0 b 345.0 c 685.0 d NaN e NaN f NaN dtype: float64
s.index = list("abcdnm")
s
a 123.0 b 345.0 c 685.0 d NaN n NaN m NaN dtype: float64
可以使用pd.isnull(),pd.notnull(),或自带isnull(),notnull()函数检测缺失数据
pd.isnull(s)
a False b False c False d True n True m True dtype: bool
s.isnull()
a False b False c False d True n True m True dtype: bool
ind = s.isnull()
ind
a False b False c False d True n True m True dtype: bool
s[ind] # 索引对应的值为True则输出
# 输出所有的缺失的数据
d NaN n NaN m NaN dtype: float64
s.notnull()
a True b True c True d False n False m False dtype: bool
s[s.notnull()]
a 123.0 b 345.0 c 685.0 dtype: float64
s1 = Series({"a":True,"b":False,"c":True,"d":False,"n":True,"m":True})
s1
a True b False c True d False m True n True dtype: bool
s[s1]
# 一个Series如果索引和另外一个完全一致,值是bool类型,这个Series是可以作为另外
# 一个Series的索引来查找对应元素的,查找的结果就是所有bool为True那些索引对应的元素
a 123.0 c 685.0 n NaN m NaN dtype: float64
s
a 123.0 b 345.0 c 685.0 d NaN n NaN m NaN dtype: float64
s[s.isnull()] = 1000
s
a 123.0 b 345.0 c 685.0 d 1000.0 n 1000.0 m 1000.0 dtype: float64
Series对象本身及其实例都有一个name属性
s.name = "python"
s # name属性是series在DataFrame中的表头信息
a 123.0 b 345.0 c 685.0 d 1000.0 n 1000.0 m 1000.0 Name: python, dtype: float64
【拓展】过滤条件
s = Series({"a":1,"b":2,"c":3})
s > 2
a False b False c True dtype: bool
nd = np.random.randint(0,20,size=10)
nd
array([12, 15, 13, 6, 5, 3, 1, 3, 2, 6])
a = nd > 10
a
array([ True, True, True, False, False, False, False, False, False, False], dtype=bool)
nd[a]
array([12, 15, 13])
19、将一个一维数组转化为二进制表示矩阵。例如
[1,2,3]
转化为
[[0,0,1],
[0,1,0],
[0,1,1]]
1 and 0
0
1 or 0
1
1 & 2
0
# 60 = 1*2^5 + 1*2^4 + 1*2^3 +0*2^2+0*2^1+ 1*2^0
# 111001 & 2^3
#2 = 010
# 3 = 011
# 011
# 010
1 & 2**0
1 & 2**1
1 & 2**2
3 & 2**0
3 & 2**1
3 & 2**2
0
I = np.array([0,1,2,3,15,17,32,64,128])
I
array([ 0, 1, 2, 3, 15, 17, 32, 64, 128])
A = I.reshape((-1,1))
A
array([[ 0], [ 1], [ 2], [ 3], [ 15], [ 17], [ 32], [ 64], [128]])
B = 2**np.arange(8)
B
array([ 1, 2, 4, 8, 16, 32, 64, 128], dtype=int32)
M = A & B
M
array([[ 0, 0, 0, 0, 0, 0, 0, 0], [ 1, 0, 0, 0, 0, 0, 0, 0], [ 0, 2, 0, 0, 0, 0, 0, 0], [ 1, 2, 0, 0, 0, 0, 0, 0], [ 1, 2, 4, 8, 0, 0, 0, 0], [ 1, 0, 0, 0, 16, 0, 0, 0], [ 0, 0, 0, 0, 0, 32, 0, 0], [ 0, 0, 0, 0, 0, 0, 64, 0], [ 0, 0, 0, 0, 0, 0, 0, 128]], dtype=int32)
[[ 0,0,0,0,0,0,0,0], 1,2,4,8,16,32,64,128
[ 1,1,1,1,1,11,1,1],
[ 2],
[ 3], &
[ 15],
[ 17],
[ 32],
[ 64],
[128]]
M != 0
array([[False, False, False, False, False, False, False, False], [ True, False, False, False, False, False, False, False], [False, True, False, False, False, False, False, False], [ True, True, False, False, False, False, False, False], [ True, True, True, True, False, False, False, False], [ True, False, False, False, True, False, False, False], [False, False, False, False, False, True, False, False], [False, False, False, False, False, False, True, False], [False, False, False, False, False, False, False, True]], dtype=bool)
M[M!=0] = 1
M
array([[0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 0, 0, 0, 0], [1, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1]], dtype=int32)
M[:,::-1]
array([[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 1, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
(1) 适用于numpy的数组运算也适用于Series
s
a 1 b 2 c 3 dtype: int64
s + 3
a 4 b 5 c 6 dtype: int64
s * 3
a 3 b 6 c 9 dtype: int64
(2) Series之间的运算
- 在运算中自动对齐不同索引的数据
- 如果索引不对应,则补NaN
- 注意:要想保留所有的index,则需要使用.add()函数
s1 = Series(np.random.randint(0,150,size=4),index=list("ABCD"),name="数学")
s1
A 121 B 97 C 76 D 27 Name: 数学, dtype: int32
s2 = Series(np.random.randint(0,150,size=4),index=list("ABmn"),name="python")
s2
A 64 B 19 m 17 n 137 Name: python, dtype: int32
np.nan + 3
nan
s1 + s2
A 185.0 B 116.0 C NaN D NaN m NaN n NaN dtype: float64
A 142 A 132 B 0 + B 107 C 6 m 50 D 67 n 5 m nan C nan n nan D nan
s1 * s2
A 7744.0 B 1843.0 C NaN D NaN m NaN n NaN dtype: float64
s1 / s2
A 1.890625 B 5.105263 C NaN D NaN m NaN n NaN dtype: float64
s1.add(s2)
A 185.0 B 116.0 C NaN D NaN m NaN n NaN dtype: float64
s1.subtract(s2)
A 57.0 B 78.0 C NaN D NaN m NaN n NaN dtype: float64
s1.pow(2)
A 14641 B 9409 C 5776 D 729 Name: 数学, dtype: int32
============================================
练习3:
想一想Series运算和ndarray运算的规则有什么不同?
新建另一个索引包含“文综”的Series s2,并与s2进行多种算术操作。思考如何保存所有数据。
============================================
DataFrame是一个【表格型】的数据结构,可以看做是【由Series组成的字典】(共用同一个索引)。DataFrame由按一定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维拓展到多维。DataFrame既有行索引,也有列索引。
- 行索引:index
- 列索引:columns
- 值:values(numpy的二维数组)
from pandas import DataFrame
dic = {
"姓名":["狗蛋","刘德华","鹿晗","TF","张柏芝","谢霆锋"],
"语文":[120,130,140,110,110,100],
"数学":[130,110,100,100,100,120],
"外语":[120,130,123,125,112,143],
"综合":[230,240,254,234,100,100]
}
dic
{'外语': [120, 130, 123, 125, 112, 143], '姓名': ['狗蛋', '刘德华', '鹿晗', 'TF', '张柏芝', '谢霆锋'], '数学': [130, 110, 100, 100, 100, 120], '综合': [230, 240, 254, 234, 100, 100], '语文': [120, 130, 140, 110, 110, 100]}
df = DataFrame(dic,index=list("abcdef"))
# 如果不指定索引,默认是把当前行的位置序列做为索引
df
外语 | 姓名 | 数学 | 综合 | 语文 | |
---|---|---|---|---|---|
a | 120 | 狗蛋 | 130 | 230 | 120 |
b | 130 | 刘德华 | 110 | 240 | 130 |
c | 123 | 鹿晗 | 100 | 254 | 140 |
d | 125 | TF | 100 | 234 | 110 |
e | 112 | 张柏芝 | 100 | 100 | 110 |
f | 143 | 谢霆锋 | 120 | 100 | 100 |
DataFrame属性:values、columns、index、shape
df.values # 二维数组
array([[120, '狗蛋', 130, 230, 120], [130, '刘德华', 110, 240, 130], [123, '鹿晗', 100, 254, 140], [125, 'TF', 100, 234, 110], [112, '张柏芝', 100, 100, 110], [143, '谢霆锋', 120, 100, 100]], dtype=object)
df.index # index索引对象
Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
df.index[0] = "q" # 索引不能单独赋值
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-90-2b136e7ea00f> in <module>() ----> 1 df.index[0] = "q" # 索引不能单独赋值 d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value) 1668 1669 def __setitem__(self, key, value): -> 1670 raise TypeError("Index does not support mutable operations") 1671 1672 def __getitem__(self, key): TypeError: Index does not support mutable operations
df.index = list("ABCDEF") # index只能整体赋值
df
外语 | 姓名 | 数学 | 综合 | 语文 | |
---|---|---|---|---|---|
A | 120 | 狗蛋 | 130 | 230 | 120 |
B | 130 | 刘德华 | 110 | 240 | 130 |
C | 123 | 鹿晗 | 100 | 254 | 140 |
D | 125 | TF | 100 | 234 | 110 |
E | 112 | 张柏芝 | 100 | 100 | 110 |
F | 143 | 谢霆锋 | 120 | 100 | 100 |
df.columns # index对象
Index(['外语', '姓名', '数学', '综合', '语文'], dtype='object')
df.columns[0] = "python" # 不能单独赋值
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-94-82c06599cb92> in <module>() ----> 1 df.columns[0] = "python" # 不能单独赋值 d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value) 1668 1669 def __setitem__(self, key, value): -> 1670 raise TypeError("Index does not support mutable operations") 1671 1672 def __getitem__(self, key): TypeError: Index does not support mutable operations
df.columns = ["python","java","h5","ruby","javascript"]
df
python | java | h5 | ruby | javascript | |
---|---|---|---|---|---|
A | 120 | 狗蛋 | 130 | 230 | 120 |
B | 130 | 刘德华 | 110 | 240 | 130 |
C | 123 | 鹿晗 | 100 | 254 | 140 |
D | 125 | TF | 100 | 234 | 110 |
E | 112 | 张柏芝 | 100 | 100 | 110 |
F | 143 | 谢霆锋 | 120 | 100 | 100 |
df1 = DataFrame(dic,columns=["姓名","java","h5","ruby","javascript"])
df1 # 字典创建的时候,如果属性不对应,不对应那些就会缺失
姓名 | java | h5 | ruby | javascript | |
---|---|---|---|---|---|
0 | 狗蛋 | NaN | NaN | NaN | NaN |
1 | 刘德华 | NaN | NaN | NaN | NaN |
2 | 鹿晗 | NaN | NaN | NaN | NaN |
3 | TF | NaN | NaN | NaN | NaN |
4 | 张柏芝 | NaN | NaN | NaN | NaN |
5 | 谢霆锋 | NaN | NaN | NaN | NaN |
df2 = DataFrame(dic,columns=["姓名","语文","数学","外语","综合"])
df2
姓名 | 语文 | 数学 | 外语 | 综合 | |
---|---|---|---|---|---|
0 | 狗蛋 | 120 | 130 | 120 | 230 |
1 | 刘德华 | 130 | 110 | 130 | 240 |
2 | 鹿晗 | 140 | 100 | 123 | 254 |
3 | TF | 110 | 100 | 125 | 234 |
4 | 张柏芝 | 110 | 100 | 112 | 100 |
5 | 谢霆锋 | 100 | 120 | 143 | 100 |
nd = np.random.randint(0,150,size=(4,4))
df3 = DataFrame(nd,index=list("abcd"),columns=list("ABCD"))
df3
A | B | C | D | |
---|---|---|---|---|
a | 84 | 37 | 48 | 114 |
b | 122 | 31 | 4 | 35 |
c | 100 | 40 | 73 | 0 |
d | 138 | 111 | 101 | 110 |
df2["数学"][0] = 1000
d:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
df2
姓名 | 语文 | 数学 | 外语 | 综合 | |
---|---|---|---|---|---|
0 | 狗蛋 | 120 | 1000 | 120 | 230 |
1 | 刘德华 | 130 | 110 | 130 | 240 |
2 | 鹿晗 | 140 | 100 | 123 | 254 |
3 | TF | 110 | 100 | 125 | 234 |
4 | 张柏芝 | 110 | 100 | 112 | 100 |
5 | 谢霆锋 | 100 | 120 | 143 | 100 |
dic
{'外语': [120, 130, 123, 125, 112, 143], '姓名': ['狗蛋', '刘德华', '鹿晗', 'TF', '张柏芝', '谢霆锋'], '数学': [130, 110, 100, 100, 100, 120], '综合': [230, 240, 254, 234, 100, 100], '语文': [120, 130, 140, 110, 110, 100]}
df3["A"]["a"] = 2000
df3
A | B | C | D | |
---|---|---|---|---|
a | 2000 | 37 | 48 | 114 |
b | 122 | 31 | 4 | 35 |
c | 100 | 40 | 73 | 0 |
d | 138 | 111 | 101 | 110 |
nd
array([[2000, 37, 48, 114], [ 122, 31, 4, 35], [ 100, 40, 73, 0], [ 138, 111, 101, 110]])
【注意】用字典创建是副本拷贝,用数组创建是索引的拷贝
============================================
练习4:
根据以下考试成绩表,创建一个DataFrame,命名为df:
张三 李四
语文 150 0
数学 150 0
英语 150 0
理综 300 0
============================================
(1) 对列进行索引
- 通过类似字典的方式
- 通过属性的方式
可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFrame相同的索引,且name属性也已经设置好了,就是相应的列名。
df
python | java | h5 | ruby | javascript | |
---|---|---|---|---|---|
A | 120 | 狗蛋 | 130 | 230 | 120 |
B | 130 | 刘德华 | 110 | 240 | 130 |
C | 123 | 鹿晗 | 100 | 254 | 140 |
D | 125 | TF | 100 | 234 | 110 |
E | 112 | 张柏芝 | 100 | 100 | 110 |
F | 143 | 谢霆锋 | 120 | 100 | 100 |
# 键值形式
df["java"]
A 狗蛋 B 刘德华 C 鹿晗 D TF E 张柏芝 F 谢霆锋 Name: java, dtype: object
#属性形式
df.java
df2.数学
0 1000 1 110 2 100 3 100 4 100 5 120 Name: 数学, dtype: int64
# 索引多列
# df.java.python # 属性不能索引多列
df[["python","java"]] # 索引多累得到一个DataFrame
python | java | |
---|---|---|
A | 120 | 狗蛋 |
B | 130 | 刘德华 |
C | 123 | 鹿晗 |
D | 125 | TF |
E | 112 | 张柏芝 |
F | 143 | 谢霆锋 |
df[["python"]]
python | |
---|---|
A | 120 |
B | 130 |
C | 123 |
D | 125 |
E | 112 |
F | 143 |
df["python":"java"] # 对列进行切片没有意义
(2) 对行进行索引
- 使用.ix[]来进行行索引
- 使用.loc[]加index来进行行索引
- 使用.iloc[]加整数来进行行索引
同样返回一个Series,index为原来的columns。
df
python | java | h5 | ruby | javascript | |
---|---|---|---|---|---|
A | 120 | 狗蛋 | 130 | 230 | 120 |
B | 130 | 刘德华 | 110 | 240 | 130 |
C | 123 | 鹿晗 | 100 | 254 | 140 |
D | 125 | TF | 100 | 234 | 110 |
E | 112 | 张柏芝 | 100 | 100 | 110 |
F | 143 | 谢霆锋 | 120 | 100 | 100 |
df.loc["A"]
python 120 java 狗蛋 h5 130 ruby 230 javascript 120 Name: A, dtype: object
df.iloc[1]
python 130 java 刘德华 h5 110 ruby 240 javascript 130 Name: B, dtype: object
# 切片
df.loc["A":"A"]
python | java | h5 | ruby | javascript | |
---|---|---|---|---|---|
A | 120 | 狗蛋 | 130 | 230 | 120 |
df.loc[["A","B"]]
python | java | h5 | ruby | javascript | |
---|---|---|---|---|---|
A | 120 | 狗蛋 | 130 | 230 | 120 |
B | 130 | 刘德华 | 110 | 240 | 130 |
df.iloc[0:3]
python | java | h5 | ruby | javascript | |
---|---|---|---|---|---|
A | 120 | 狗蛋 | 130 | 230 | 120 |
B | 130 | 刘德华 | 110 | 240 | 130 |
C | 123 | 鹿晗 | 100 | 254 | 140 |
df.loc["A","python"]
120
df.iloc[1,1] # 隐式索引相当于索引数组
'刘德华'
df["python","A"] # 不能先列后行
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2441 try: -> 2442 return self._engine.get_loc(key) 2443 except KeyError: pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: ('python', 'A') During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-120-d9617f00040f> in <module>() ----> 1 df["python","A"] # 不能先列后行 d:\Anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key) 1962 return self._getitem_multilevel(key) 1963 else: -> 1964 return self._getitem_column(key) 1965 1966 def _getitem_column(self, key): d:\Anaconda\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key) 1969 # get column 1970 if self.columns.is_unique: -> 1971 return self._get_item_cache(key) 1972 1973 # duplicate columns & possible reduce dimensionality d:\Anaconda\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item) 1643 res = cache.get(item) 1644 if res is None: -> 1645 values = self._data.get(item) 1646 res = self._box_item_values(item, values) 1647 cache[item] = res d:\Anaconda\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath) 3588 3589 if not isnull(item): -> 3590 loc = self.items.get_loc(item) 3591 else: 3592 indexer = np.arange(len(self.items))[isnull(self.items)] d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2442 return self._engine.get_loc(key) 2443 except KeyError: -> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2445 2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: ('python', 'A')
(3) 对元素索引的方法
- 使用列索引
- 使用行索引(iloc[3,1]相当于两个参数;iloc[[3,3]] 里面的[3,3]看做一个参数)
- 使用values属性(二维numpy数组)
# 两次索引
df["python"]["A"]
df.loc["A"]["java"]
df.loc["A":"C"]["java"]
A 狗蛋 B 刘德华 C 鹿晗 Name: java, dtype: object
df["python":"java"]["A"] # 对列切片得到了空
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2441 try: -> 2442 return self._engine.get_loc(key) 2443 except KeyError: pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'A' During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-122-f6746d39a074> in <module>() ----> 1 df["python":"java"]["A"] # 对列切片得到了空 d:\Anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key) 1962 return self._getitem_multilevel(key) 1963 else: -> 1964 return self._getitem_column(key) 1965 1966 def _getitem_column(self, key): d:\Anaconda\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key) 1969 # get column 1970 if self.columns.is_unique: -> 1971 return self._get_item_cache(key) 1972 1973 # duplicate columns & possible reduce dimensionality d:\Anaconda\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item) 1643 res = cache.get(item) 1644 if res is None: -> 1645 values = self._data.get(item) 1646 res = self._box_item_values(item, values) 1647 cache[item] = res d:\Anaconda\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath) 3588 3589 if not isnull(item): -> 3590 loc = self.items.get_loc(item) 3591 else: 3592 indexer = np.arange(len(self.items))[isnull(self.items)] d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2442 return self._engine.get_loc(key) 2443 except KeyError: -> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2445 2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'A'
df.iloc[0:3,4] # 隐式切片的时候第二个也要隐式
A 120 B 130 C 140 Name: javascript, dtype: int64
df.iloc[0:3]["h5"]
A 130 B 110 C 100 Name: h5, dtype: int64
【注意】 直接用中括号时:
- 索引表示的是列索引
- 切片表示的是行切片
============================================
练习5:
使用多种方法对ddd进行索引和切片,并比较其中的区别
============================================
(1) DataFrame之间的运算
同Series一样:
- 在运算中自动对齐不同索引的数据
- 如果索引不对应,则补NaN
创建DataFrame df1 不同人员的各科目成绩,月考一
df1 = DataFrame(np.random.randint(0,150,size=(4,4)),
columns=["python","线性代数","微积分","数理统计"],
index=list("abcd"))
df1
python | 线性代数 | 微积分 | 数理统计 | |
---|---|---|---|---|
a | 101 | 33 | 31 | 87 |
b | 98 | 69 | 139 | 104 |
c | 12 | 59 | 40 | 136 |
d | 8 | 120 | 92 | 91 |
创建DataFrame df2 不同人员的各科目成绩,月考二
有新学生转入
df2 = DataFrame(np.random.randint(0,150,size=(6,4)),
columns=["python","java","ruby","数理统计"],
index=list("abcdef"))
df2
python | java | ruby | 数理统计 | |
---|---|---|---|---|
a | 47 | 51 | 134 | 116 |
b | 91 | 27 | 133 | 79 |
c | 123 | 83 | 0 | 54 |
d | 50 | 10 | 31 | 143 |
e | 84 | 117 | 17 | 132 |
f | 103 | 87 | 62 | 23 |
df1 + df2
java | python | ruby | 微积分 | 数理统计 | 线性代数 | |
---|---|---|---|---|---|---|
a | NaN | 148.0 | NaN | NaN | 203.0 | NaN |
b | NaN | 189.0 | NaN | NaN | 183.0 | NaN |
c | NaN | 135.0 | NaN | NaN | 190.0 | NaN |
d | NaN | 58.0 | NaN | NaN | 234.0 | NaN |
e | NaN | NaN | NaN | NaN | NaN | NaN |
f | NaN | NaN | NaN | NaN | NaN | NaN |
df1.add(df2,fill_value=0)
java | python | ruby | 微积分 | 数理统计 | 线性代数 | |
---|---|---|---|---|---|---|
a | 51.0 | 148.0 | 134.0 | 31.0 | 203.0 | 33.0 |
b | 27.0 | 189.0 | 133.0 | 139.0 | 183.0 | 69.0 |
c | 83.0 | 135.0 | 0.0 | 40.0 | 190.0 | 59.0 |
d | 10.0 | 58.0 | 31.0 | 92.0 | 234.0 | 120.0 |
e | 117.0 | 84.0 | 17.0 | NaN | 132.0 | NaN |
f | 87.0 | 103.0 | 62.0 | NaN | 23.0 | NaN |
df2.add(df1,fill_value=0) # fill_value参数在补用它的值去填补缺失的地方
#(注意)原来既缺行又缺列的仍然补nan
java | python | ruby | 微积分 | 数理统计 | 线性代数 | |
---|---|---|---|---|---|---|
a | 51.0 | 148.0 | 134.0 | 31.0 | 203.0 | 33.0 |
b | 27.0 | 189.0 | 133.0 | 139.0 | 183.0 | 69.0 |
c | 83.0 | 135.0 | 0.0 | 40.0 | 190.0 | 59.0 |
d | 10.0 | 58.0 | 31.0 | 92.0 | 234.0 | 120.0 |
e | 117.0 | 84.0 | 17.0 | NaN | 132.0 | NaN |
f | 87.0 | 103.0 | 62.0 | NaN | 23.0 | NaN |
java 数学 python java python 数学
a 0 + a 0 b 0 b 0 c 0 c 0 d 0 0 nan d 0
下面是Python 操作符与pandas操作函数的对应表:
Python Operator | Pandas Method(s) |
---|---|
+ |
add() |
- |
sub() , subtract() |
* |
mul() , multiply() |
/ |
truediv() , div() , divide() |
// |
floordiv() |
% |
mod() |
** |
pow() |
(2) Series与DataFrame之间的运算
【重要】
使用Python操作符:以行为单位操作(参数必须是行),对所有行都有效。(类似于numpy中二维数组与一维数组的运算,但可能出现NaN)
使用pandas操作函数:
axis=0:以列为单位操作(参数必须是列),对所有列都有效。 axis=1:以行为单位操作(参数必须是行),对所有行都有效。
df2
python | java | ruby | 数理统计 | |
---|---|---|---|---|
a | 47 | 51 | 134 | 116 |
b | 91 | 27 | 133 | 79 |
c | 123 | 83 | 0 | 54 |
d | 50 | 10 | 31 | 143 |
e | 84 | 117 | 17 | 132 |
f | 103 | 87 | 62 | 23 |
s_row = df2.loc["c"]
s_row
python 123 java 83 ruby 0 数理统计 54 Name: c, dtype: int32
s_col = df2["python"]
s_col
a 47 b 91 c 123 d 50 e 84 f 103 Name: python, dtype: int32
df2
python | java | ruby | 数理统计 | |
---|---|---|---|---|
a | 47 | 51 | 134 | 116 |
b | 91 | 27 | 133 | 79 |
c | 123 | 83 | 0 | 54 |
d | 50 | 10 | 31 | 143 |
e | 84 | 117 | 17 | 132 |
f | 103 | 87 | 62 | 23 |
df2.add(s_row,axis=1) # 按照列标进行加(把s加到所有的行中)
python | java | ruby | 数理统计 | |
---|---|---|---|---|
a | 170 | 134 | 134 | 170 |
b | 214 | 110 | 133 | 133 |
c | 246 | 166 | 0 | 108 |
d | 173 | 93 | 31 | 197 |
e | 207 | 200 | 17 | 186 |
f | 226 | 170 | 62 | 77 |
df2.add(s_col,axis=0) # 按行标来加(把s加到所有的列中)
python | java | ruby | 数理统计 | |
---|---|---|---|---|
a | 94 | 98 | 181 | 163 |
b | 182 | 118 | 224 | 170 |
c | 246 | 206 | 123 | 177 |
d | 100 | 60 | 81 | 193 |
e | 168 | 201 | 101 | 216 |
f | 206 | 190 | 165 | 126 |
============================================
练习6:
假设ddd是期中考试成绩,ddd2是期末考试成绩,请自由创建ddd2,并将其与ddd相加,求期中期末平均值。
假设张三期中考试数学被发现作弊,要记为0分,如何实现?
李四因为举报张三作弊立功,期中考试所有科目加100分,如何实现?
后来老师发现有一道题出错了,为了安抚学生情绪,给每位学生每个科目都加10分,如何实现?
============================================
dic = {
"姓名":["狗蛋","刘德华","鹿晗","TF","张三","李四"],
"语文":[120,130,140,110,110,100],
"数学":[130,110,100,100,100,120],
"外语":[120,130,123,125,112,143],
"综合":[230,240,254,234,100,100]
}
ddd = DataFrame(dic,columns=["姓名","语文","数学","外语","综合"])
ddd
姓名 | 语文 | 数学 | 外语 | 综合 | |
---|---|---|---|---|---|
0 | 狗蛋 | 120 | 130 | 120 | 230 |
1 | 刘德华 | 130 | 110 | 130 | 240 |
2 | 鹿晗 | 140 | 100 | 123 | 254 |
3 | TF | 110 | 100 | 125 | 234 |
4 | 张三 | 110 | 100 | 112 | 100 |
5 | 李四 | 100 | 120 | 143 | 100 |
dic = {
"姓名":["狗蛋","刘德华","鹿晗","TF","张三","李四"],
"语文":[121,131,140,110,110,100],
"数学":[130,110,100,120,100,120],
"外语":[120,30,123,125,132,143],
"综合":[230,40,254,34,100,60]
}
ddd2 = DataFrame(dic,columns=["姓名","语文","数学","外语","综合"])
ddd2
姓名 | 语文 | 数学 | 外语 | 综合 | |
---|---|---|---|---|---|
0 | 狗蛋 | 121 | 130 | 120 | 230 |
1 | 刘德华 | 131 | 110 | 30 | 40 |
2 | 鹿晗 | 140 | 100 | 123 | 254 |
3 | TF | 110 | 120 | 125 | 34 |
4 | 张三 | 110 | 100 | 132 | 100 |
5 | 李四 | 100 | 120 | 143 | 60 |
d = ddd[["语文","数学","外语","综合"]] + ddd2[["语文","数学","外语","综合"]]
d
语文 | 数学 | 外语 | 综合 | |
---|---|---|---|---|
0 | 241 | 260 | 240 | 460 |
1 | 261 | 220 | 160 | 280 |
2 | 280 | 200 | 246 | 508 |
3 | 220 | 220 | 250 | 268 |
4 | 220 | 200 | 244 | 200 |
5 | 200 | 240 | 286 | 160 |
d = d/2
d["姓名"] = ddd["姓名"] # 添加上姓名这一列
d
语文 | 数学 | 外语 | 综合 | 姓名 | |
---|---|---|---|---|---|
0 | 120.5 | 130.0 | 120.0 | 230.0 | 狗蛋 |
1 | 130.5 | 110.0 | 80.0 | 140.0 | 刘德华 |
2 | 140.0 | 100.0 | 123.0 | 254.0 | 鹿晗 |
3 | 110.0 | 110.0 | 125.0 | 134.0 | TF |
4 | 110.0 | 100.0 | 122.0 | 100.0 | 张三 |
5 | 100.0 | 120.0 | 143.0 | 80.0 | 李四 |
ddd.loc[4,"数学"] = 0 # 如果要修改不能用二次索引
ddd
姓名 | 语文 | 数学 | 外语 | 综合 | |
---|---|---|---|---|---|
0 | 狗蛋 | 120 | 130 | 120 | 230 |
1 | 刘德华 | 130 | 110 | 130 | 240 |
2 | 鹿晗 | 140 | 100 | 123 | 254 |
3 | TF | 110 | 100 | 125 | 234 |
4 | 张三 | 110 | 0 | 112 | 100 |
5 | 李四 | 100 | 120 | 143 | 100 |
ddd.loc[5,["语文","数学","外语","综合"]] += 100
ddd
姓名 | 语文 | 数学 | 外语 | 综合 | |
---|---|---|---|---|---|
0 | 狗蛋 | 120 | 130 | 120 | 230 |
1 | 刘德华 | 130 | 110 | 130 | 240 |
2 | 鹿晗 | 140 | 100 | 123 | 254 |
3 | TF | 110 | 100 | 125 | 234 |
4 | 张三 | 110 | 0 | 112 | 100 |
5 | 李四 | 200 | 220 | 243 | 200 |
ddd[["语文","数学","外语","综合"]] += 10
ddd
姓名 | 语文 | 数学 | 外语 | 综合 | |
---|---|---|---|---|---|
0 | 狗蛋 | 130 | 140 | 130 | 240 |
1 | 刘德华 | 140 | 120 | 140 | 250 |
2 | 鹿晗 | 150 | 110 | 133 | 264 |
3 | TF | 120 | 110 | 135 | 244 |
4 | 张三 | 120 | 10 | 122 | 110 |
5 | 李四 | 210 | 230 | 253 | 210 |
ddd.iloc[:,1:5] +=10 # 用数组的方法
ddd
姓名 | 语文 | 数学 | 外语 | 综合 | |
---|---|---|---|---|---|
0 | 狗蛋 | 140 | 150 | 140 | 250 |
1 | 刘德华 | 150 | 130 | 150 | 260 |
2 | 鹿晗 | 160 | 120 | 143 | 274 |
3 | TF | 130 | 120 | 145 | 254 |
4 | 张三 | 130 | 20 | 132 | 120 |
5 | 李四 | 220 | 240 | 263 | 220 |