Python pandas DataFrame
小象学院 Python人工智能学习纪要
http://www.chinahadoop.cn/bootcamp/course/1276
Code demo:
import pandas as pd
def transfer_odd_even(x):
if x % 2 == 0:
return 'odd'
else:
return 'even'
arr = pd.DataFrame({
'a': [1, 2, 3],
'b': [4, 5, 6]
})
arr2 = arr.applymap(transfer_odd_even)
print(arr2)
Output:
a b
0 even odd
1 odd even
2 even odd
Practice:
import pandas as pd
fruits_sold = pd.DataFrame({
's1': [20.85746739, 48.69399627, 6.64139183, 43.97206466, 42.24557245, 6.59165588, 18.14644399, 0.51207489,
4.07669522, 40.2318477],
's2': [44.12131395, 21.25162529, 10.10155038, 18.26017017, 26.81437566, 22.6139889, 34.7631294, 12.53410469,
23.19350524, 13.7823773],
's3': [45.81103321, 6.06090098, 13.54856198, 26.20863107, 23.12541538, 45.61618866, 46.27495742, 14.26893644,
24.02366712, 0.64639531],
's4': [6.284111, 21.20035631, 20.64741436, 17.21031833, 1.14764087, 48.34064417, 43.86282243, 28.06504264,
49.73980011, 24.48708205],
's5': [33.1696067, 42.65221954, 10.76371411, 30.74131832, 0.37146529, 42.56511105, 42.08960913, 16.59236303,
6.43504234, 45.85999181],
's6': [42.05696716, 27.41229091, 11.66774126, 1.85674496, 3.32968606, 40.72835009, 28.8835585, 26.46075757,
46.33800476, 35.4963476]},
index=['5-1', '5-2', '5-3', '5-4', '5-5', '5-6', '5-7', '5-8', '5-9', '5-10'],
)
def sold_grade(amount):
if amount >= 40:
return 'A'
elif amount >= 30:
return 'B'
else:
return 'C'
fruits_sold_grade = fruits_sold.applymap(sold_grade)
print(fruits_sold_grade)
s1 s2 s3 s4 s5 s6
5-1 C A A C B A
5-2 A C C C A C
5-3 C C C C C C
5-4 A C C C B C
5-5 A C C C C C
5-6 C C A A A A
5-7 C B A A A C
5-8 C C C C C C
5-9 C C C A C A
5-10 A C C C A B
Conclusion
Applymap 方法将DataFrame的每个元素经过函数运算之后转换成新的元素。
Code demo
import pandas as pd
demo_arr = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
def compared_by_5_return_true_false(num):
# 这里原先用了一个非必要的if判断
# return True if num > 5 else False
return num >= 5
def sum_by_series(series):
return series.sum()
# 判断不小于5
print(demo_arr.apply(compared_by_5_return_true_false))
print(demo_arr.applymap(compared_by_5_return_true_false))
# 求每列和
print(demo_arr.apply(sum_by_series))
# 求每行和
print(demo_arr.apply(sum_by_series, axis=1))
Output
0 1 2
0 False False False
1 False True True
2 True True True
0 1 2
0 False False False
1 False True True
2 True True True
0 12
1 15
2 18
dtype: int64
0 6
1 15
2 24
dtype: int64
Practice 1
import pandas as pd
df = pd.DataFrame({
's1': [27.93, 58.08, 38.67, 45.83, 70.26, 46.61, 49.73, 34.02, 56.64, 57.28],
's2': [28.18, 50.61, 31.73, 31.48, 55.96, 22.73, 40.47, 42.02, 31.39, 64.21],
's3': [29.39, 51.62, 57.91, 45.94, 53.81, 45.77, 69.13, 28.75, 43.43, 55.7],
's4': [40.52, 48.55, 59.24, 71.21, 58.48, 63.63, 55.16, 34.9, 54, 68.03],
's5': [26.26, 54.03, 49.08, 46.53, 43.23, 56.79, 58.71, 26.43, 44.97, 54.16]
}, index=['05-21', '05-22', '05-23', '05-24', '05-25', '05-26', '05-27', '05-28', '05-29', '05-30'])
def cell_larger_than_mean(bool_v):
if bool_v:
return 'A'
else:
return 'B'
def larger_than_mean(numbers):
mean = numbers.mean()
return numbers > mean
print(df.apply(larger_than_mean, axis=1).applymap(cell_larger_than_mean))
s1 s2 s3 s4 s5
05-21 B B B A B
05-22 A B B B A
05-23 B B A A A
05-24 B B B A B
05-25 A B B A B
05-26 B B B A A
05-27 B B A A A
05-28 A A B A B
05-29 A B B A B
05-30 B A B A B
Practice 2
import pandas as pd
df = pd.DataFrame({
's1': [27.93, 58.08, 38.67, 45.83, 70.26, 46.61, 49.73, 34.02, 56.64, 57.28],
's2': [28.18, 50.61, 31.73, 31.48, 55.96, 22.73, 40.47, 42.02, 31.39, 64.21],
's3': [29.39, 51.62, 57.91, 45.94, 53.81, 45.77, 69.13, 28.75, 43.43, 55.7],
's4': [40.52, 48.55, 59.24, 71.21, 58.48, 63.63, 55.16, 34.9, 54, 68.03],
's5': [26.26, 54.03, 49.08, 46.53, 43.23, 56.79, 58.71, 26.43, 44.97, 54.16]
}, index=['05-21', '05-22', '05-23', '05-24', '05-25', '05-26', '05-27', '05-28', '05-29', '05-30'])
def gap_between_max_min(numbers):
return numbers.idxmax() + '-' + numbers.idxmin() + '=' + str(numbers.max() - numbers.min())[:4]
print(df.apply(gap_between_max_min, axis=1))
s1 s2 s3 s4 s5
05-21 B B B A B
05-22 A B B B A
05-23 B B A A A
05-24 B B B A B
05-25 A B B A B
05-26 B B B A A
05-27 B B A A A
05-28 A A B A B
05-29 A B B A B
05-30 B A B A B
Conclusion
DataFrame apply的作用
- 将DataFrame的每行或者每列(Series)经过函数运算之后转换成新的行或列。
- 将DataFrame的每行或者每列(Series)经过函数运算之后转换成一个值。
Code demo
import pandas as pd
arr_test = pd.DataFrame({'s1': [1, 2], 's2': [10, 20], 's3': [100, 200]}, index=['a', 'b'])
print('arr3:\n', arr_test)
print('s1:\n', arr_test['s1'])
print('a:\n', arr_test.loc['a'])
print('a:\n', arr_test.iloc[0])
Output
arr3:
s1 s2 s3
a 1 10 100
b 2 20 200
s1:
a 1
b 2
Name: s1, dtype: int64
a:
s1 1
s2 10
s3 100
Name: a, dtype: int64
a:
s1 1
s2 10
s3 100
Name: a, dtype: int64
Conclusion
‘column_name’: [a,b,c] 中 column_name 为列索引 如: ‘column1’: [1, 2, 3] 的列标签为 coulumn1 ,而 1 , 2, 3 是这列的元素
‘column_name’:[a,b,c] 需要在{}中定义 如 pd.DataFrame({ ‘column1’: [1, 2, 3], ‘column1’: [4, 5, 6]}
通过 index=[‘a’, ‘b’, ‘c’] 来定义数据的行索引,如 arr_test = pd.DataFrame({‘s1’: [1, 2], ‘s2’: [10, 20], ‘s3’: [100, 200]}, index=[‘a’, ‘b’]) ,在这里行 a 数据就是 1, 10, 100 , 行 b数据 是 2, 20, 200
如果 index [ ] 中的元素个数 与 其他 [ ] 数量不一致 ,则会报类似如下错误
Traceback (most recent call last):
File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1667, in create_block_manager_from_arrays
mgr = BlockManager(blocks, axes)
File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 114, in __init__
self._verify_integrity()
File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 311, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1691, in construction_error
passed, implied))
ValueError: Shape of passed values is (2, 3), indices imply (1, 3)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/code/python_project/AI_Learn/pandas_dataframe_define.py", line 3, in <module>
arr_test = pd.DataFrame({'s1': [1, 2], 's2': [10, 20], 's3': [100, 200]}, index=['a'])
File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py", line 392, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\construction.py", line 212, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\construction.py", line 61, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1671, in create_block_manager_from_arrays
construction_error(len(arrays), arrays[0].shape, axes, e)
File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1691, in construction_error
passed, implied))
ValueError: Shape of passed values is (2, 3), indices imply (1, 3)
Process finished with exit code 1
访问 DataFrame
- 可以通过 数组名[‘列索引’] 来访问某列的数据,如 array[‘s1’]
- 可以通过 loc(行索引) 来访问某行的数据,如 array.loc[‘a’]
- 可以通过 iloc(下标) 来访问某行的数据,如 array.iloc[0]
Code demo
Output
Practice
Conclusion