2. Pandas
- Series 序列
-
创建一个Series
-
list创建
s1 = pd.Series([1, 2, 3, 4]) ---------------------------- 0 1 1 2 2 3 3 4 dtype: int64
-
array创建
s2 = pd.Series(np.arange(10)) ----------------------------- 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int32
-
dict创建(Key可指定)
# dict 创建 Series s3 = pd.Series({'a':1, 'b':2, 'c':3}) ------------------------------------- a 1 b 2 c 3 dtype: int64 # 指定 index 的 Series s4 = pd.Series([1, 2, 3, 4], index={'A', 'B', 'C', 'D'}) -------------------------------------------------------- B 1 A 2 C 3 D 4 dtype: int64
-
-
Series 转换为 dict
-
to_dict()
s4.to_dict() ------------ {'B': 1, 'A': 2, 'C': 3, 'D': 4}
-
-
index 变换
# index 转换 index_1 = {'A', 'B', 'C', 'D', 'E'} s6 = pd.Series(s5, index_1) ----------------------------------- C 3.0 D 4.0 B 1.0 E NaN A 2.0 dtype: float64
-
Series 元素操作
-
判空
pd.isnull(s6) //notnull(s6) --------------------------- C False D False B False E True A False dtype: bool
-
索引命名
s6.name = 'demo' ---------------- C 3.0 D 4.0 B 1.0 E NaN A 2.0 Name: demo, dtype: float64 ========================== s6.index.name = 'demo index' s6.index --------------------------- Index(['C', 'D', 'B', 'E', 'A'], dtype='object', name='demo index')
-
-
- DataFrame 数据框
- 创建一个DataFrame
-
通过粘贴板创建一个DataFrame
# 通过粘贴的方法创建一个 DataFrame import webbrowser link = 'http://www.tiobe.com/tiobe-index' webbrowser.open(link) ---------------------------------------- True ======================================== # 获取粘贴板内容进行DataFrame创建 df = pd.read_clipboard()
-
获取列
df.columns ---------- Index(['Nov 2018', 'Nov 2017', 'Change', 'Programming Language', 'Ratings', 'Change.1'], dtype='object')
-
获取特定列的value
# 获取Ratings列的value df.Ratings ---------- 0 16.746% 1 14.396% 2 8.282% 3 7.683% 4 6.490% 5 3.952% 6 2.655% Name: Ratings, dtype: object
-
获取某几列的value(过滤产生新的DF)
df_new = DataFrame(df, columns={'Programming Language', 'Nov 2018'}) -------------------------------------------------------------------- Nov 2018 Programming Language 0 1 Java 1 2 C 2 3 C++ 3 4 Python 4 5 Visual Basic .NET 5 6 C# 6 7 JavaScript
-
通过列名进行获取value(规避列名有空格问题),获取的列类型为Series
df['Programming Language'] ------------------------- 0 Java 1 C 2 C++ 3 Python 4 Visual Basic .NET 5 C# 6 JavaScript Name: Programming Language, dtype: object ========================================= pandas.core.series.Series
-
过滤后新DF中含有原DF中不存在列,Pandas会自动进行填充NaN
df_new2 = DataFrame(df, columns={'Programming Language', 'Nov 2018', 'Sep 2018'}) ------------------------------------------------------- Nov 2018 Sep 2018 Programming Language 0 1 NaN Java 1 2 NaN C 2 3 NaN C++ 3 4 NaN Python 4 5 NaN Visual Basic .NET 5 6 NaN C# 6 7 NaN JavaScript
-
新列数据填充
-
list方式 range
df_new2['Sep 2018'] = range(0,7)
-
array方式 arange
df_new2['Sep 2018'] = np.arange(10, 17)
-
Serire方式
df_new2['Sep 2018'] = pd.Series(np.arange(20, 27))
-
-
Series对指定列元素进行数据填充
# 对新列中索引为1、2的元素进行数据填充 df_new3['Sep 2018'] = pd.Series([100, 200], index={1, 2})
-
-
- 创建一个DataFrame
- 深入理解Series和DataFrame
-
DataFrame
df1 = pd.DataFrame(data) ------------------------ Country Capital Population 0 Belgium Brussels 11190846 1 India New Delhi 1303171035 2 Brazil Brasilia 207847528 ========================================== # DataFrame 中 每列为 Serie, DataFrame 是由多个 Series 组成的 type(df1['Country']) -------------------- pandas.core.series.Series ========================= # iterrows 返回一个 生成器 generator ,可通过for循环取出内部数据 df1.iterrows() for row in df1.iterrows(): print(row) --------------
-
通过Series 创建 DataFrame
# 根据 data 创建 三个 Series s1 = pd.Series(data['Capital']) s2 = pd.Series(data['Country']) s3 = pd.Series(data['Population']) # 以 Series list 形式创建 DataFrame df_new = pd.DataFrame([s2, s1, s3], index=['Country', 'Capital', 'Population']) # 以行的形式进行了 DataFrame 构建 df_new ------ 0 1 2 Country Belgium India Brazil Capital Brussels New Delhi Brasilia Population 11190846 1303171035 207847528 ========================================================= # DataFrame转置 df_new = df_new.T ----------------- Country Capital Population 0 Belgium Brussels 11190846 1 India New Delhi 1303171035 2 Brazil Brasilia 207847528
-
- DataFrame IO
-
DataFrame and Clipboard(从粘贴板中读取数据,写入粘贴版数据)
# 写入数据到粘贴板 df1.to_clipboard()
-
DataFrame and CSV:
index=False 去除保存文件索引
# 将 DataFrame 保存为 CSV 文件,去除左侧 index df1.to_csv('df1.csv', index=False)
-
DataFrame and JSON
# to_json df1.to_json() ------------- # read_json pd.read_json(df1.to_json())
-
DataFrame and HTML
# to_html df1.to_html()
-
DataFrame and excel
# to_excel df1.to_excel('df1.xlsx')
-
- DataFrame Selecting and Indexing
-
shape
# 读取CSV文件到 DataFrame imdb = pd.read_csv('J:/csv/movie_metadata.csv') imdb.shape ---------- (5043, 28)
-
head、tail
获取前5条、后五条数据记录
-
iloc 基于index的行列过滤,与label无关
# 指定第10到第20行数据,对列不做过滤 sub_df.iloc[10:20,:] -------------------- director_name movie_title imdb_score 10 Zack Snyder Batman v Superman: Dawn of Justice 6.9 11 Bryan Singer Superman Returns 6.1 12 Marc Forster Quantum of Solace 6.7 13 Gore Verbinski Pirates of the Caribbean: Dead Man's Chest 7.3 14 Gore Verbinski The Lone Ranger 6.5 15 Zack Snyder Man of Steel 7.2 16 Andrew Adamson The Chronicles of Narnia: Prince Caspian 6.6 17 Joss Whedon The Avengers 8.1 18 Rob Marshall Pirates of the Caribbean: On Stranger Tides 6.7 19 Barry Men in Black 3 6.8
-
loc 基于label的行列过滤,与index无关
# 通过label进行过滤 sub_df.loc[15:17,'movie_title'] ------------------------------- 15 Man of Steel 16 The Chronicles of Narnia: Prince Caspian 17 The Avengers Name: movie_title, dtype: object
-
- Reindexing Series and DataFrame
-
Series Reindex:
fill_value 数据填充
s1 = pd.Series([1, 2, 3, 4], index=['A', 'B', 'C', 'D']) -------------------------------------------------------- A 1 B 2 C 3 D 4 dtype: int64 ============ s1.reindex(index=['A', 'B', 'C', 'D','E'], fill_value=10) ------------------------------------------ A 1.0 B 2.0 C 3.0 D 4.0 E 10 dtype: float64 ============== s2 = Series(['A', 'B', 'C'], index=[1, 5, 10]) ---------------------------------------------- 1 A 5 B 10 C dtype: object ============= # ffill 进行填充 0 不会自动填充 1-4 参照5;6-9参照10;11-14参照15; s2.reindex(index=range(15), method='ffill') ------------------------------------------- 0 NaN 1 A 2 A 3 A 4 A 5 B 6 B 7 B 8 B 9 B 10 C 11 C 12 C 13 C 14 C dtype: object
-
DataFrame Reindex
# 同时对一个DataFrame 进行Reindex columns and index df1.reindex(index=['A', 'B', 'C', 'D'], columns=['c1', 'c2', 'c3', 'c4']) --------------------------------------------------------- c1 c2 c3 c4 A 0.282241 0.535411 0.257932 NaN B 0.105177 0.011686 0.285663 NaN C 0.084748 0.407965 0.484152 NaN D NaN NaN NaN NaN
-
Reindex/Drop 实现切片功能
-
Series
s1.reindex(index=['A', 'B']) ---------------------------- A 1 B 2 dtype: int64
-
DataFrame
df1.reindex(index=['A', 'B']) ----------------------------- c1 c2 c3 A 0.282241 0.535411 0.257932 B 0.105177 0.011686 0.285663
-
Drop
s1.drop('A') ------------ B 2 C 3 D 4 dtype: int64 ============ # 删除行 df1.drop('A', axis=0)
-
-
- Nan - Not a Numeber
-
通过numpy创建一个NaN
# 通过numpy创建一个NaN n = np.nan type(n) ------- float
-
任何Number数据,与NaN做运算结果均为NaN
# 任何Number数据,与NaN做运算结果均为NaN m = 1 m + n ----- nan
-
NaN in Series
-
isnull / notnull 判断是否存在元素NaN,结果为bool类型
s1.isnull()
-
dropna() 移除NaN存在的数据项(行)
s1.dropna()
-
-
NaN in DataFrame
-
isnull / notnull 判断是否存在元素NaN,结果返回bool类型的DF
dframe.isnull()
-
dropna()
- axis
-
axis=0 判断行是否存在NaN数据项,存在即drop该行
# 判断行、列是否存在NaN数据项,存在即drop该行、列 df1 = dframe.dropna(axis=0, how='all')
-
axis=1 判断列是否存在NaN数据项,存在即drop该列
df2 = dframe.dropna(axis=1, how='all')
-
- how
- any:默认,只要存在NaN数据项,就进行drop操作
- all:只有该行、列中数据项均为NaN时,才进行drop操作
- thresh 设置drop操作限制
-
thresh=2 NaN数据项存在数量 > 2 时,会进行drop操作
dframe2 = DataFrame([[1, 2, 3], [np.nan, 5, 6], [7, np.nan, 9], [np.nan, np.nan, np.nan]]) --------------------------------------------------------- 0 1 2 0 1.0 2.0 3.0 1 NaN 5.0 6.0 2 7.0 NaN 9.0 3 NaN NaN NaN =========================== # thresh=2 NaN数据项存在数量 > 2 时,会进行drop操作 df2 = dframe2.dropna(thresh=2) ------------------------------ 0 1 2 0 1.0 2.0 3.0 1 NaN 5.0 6.0 2 7.0 NaN 9.0
-
- axis
-
fillna() NaN数据项填充 操作特点:调用方法后新创建结果DF,不影响原DF
-
value:NaN数据项填充值
# fillna() NaN数据项填充 默认按照列进行填充 df2.fillna(value={0:0, 1:-1, 2:-2}) ----------------------------------- 0 1 2 0 1.0 2.0 3.0 1 0.0 5.0 6.0 2 7.0 -1.0 9.0
-
-
-
- 多级index
- Series
-
多级Series
s1 = Series(np.random.randn(6), index=[['1', '1', '1', '2', '2', '2'], ['a', 'b', 'c', 'a', 'b', 'c']]) ------------------------------------------- 1 a 0.227699 b -0.137033 c -0.233315 2 a 0.201417 b 0.683764 c 0.693293 dtype: float64 ============== s1['1'] ------- a 0.227699 b -0.137033 c -0.233315 dtype: float64 ============== s1['1']['a'] ------------ 0.22769876479819515 =================== s1[:, 'a'] ---------- 1 0.227699 2 0.201417 dtype: float64
-
多级Series和DataFrame的相互转化:
unstack()
# 多级Series 向 DataFrame 转换 df1 = s1.unstack() ------------------ a b c 1 0.227699 -0.137033 -0.233315 2 0.201417 0.683764 0.693293 ================================================= # DataFrame 向 多级Series 进行转换 s1 = df1.unstack() # 转置重新构建s2 s2 = df1.T.unstack()
-
- DataFrame
-
多级DataFrame(多级index + 多级columns)
# 多级DataFrame df = DataFrame(np.arange(16).reshape([4, 4]), index=[['a','a','b','b'], [1,2,1,2]], columns=[['BJ','BJ','SH','GZ'], ['A','B','C','D']]) --------------------------------------------------------------- BJ SH GZ A B C D a 1 0 1 2 3 2 4 5 6 7 b 1 8 9 10 11 2 12 13 14 15 ========================= df['BJ'] -------- A B a 1 0 1 2 4 5 b 1 8 9 2 12 13 ================== df['BJ']['A'] ------------- a 1 0 2 4 b 1 8 2 12 Name: A, dtype: int32
-
- Series
- Mapping and Replace
-
DataFrame Mapping
# create a dataframe df1 = DataFrame({"城市": ["北京", "上海", "广州"], "人口":[1000, 2000, 1500]}) -------------------------------------------------------- 城市 人口 0 北京 1000 1 上海 2000 2 广州 1500 ==================== # add a column named GDP by Series 默认index为 0 1 2 若DF index 发生变化,需要指定index 才可以进行填充 # df1['GDP'] = Series([1000, 2000, 1500]) # map 方式增加列 gdp_map = { "北京": 1000, "上海": 2000, "广州": 1500 } # map方式增加列 df1['GDP'] = df1['城市'].map(gdp_map) ------------------------------------ 城市 人口 GDP 0 北京 1000 1000 1 上海 2000 2000 2 广州 1500 1500
-
Series Replace
# replace in Series s1 = Series(np.arange(10)) -------------------------- 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int32 ============ # replace 单个 s1.replace(1, np.nan) -------------------- 0 0.0 1 NaN 2 2.0 3 3.0 4 4.0 5 5.0 6 6.0 7 7.0 8 8.0 9 9.0 dtype: float64 ============== # 字典方式replace s1.replace({2:-2}) ------------------ 0 0 1 1 2 -2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int64 ============ # replace 多个 s1.replace([7,8,9], [-7,-8,-9]) ------------------------------- 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 -7 8 -8 9 -9 dtype: int64
-