2.3 沿轴向连接
另一种数据组合操作可互换的称为拼接,绑定或堆叠。Numpy 的 concatenate 函数可以在 numpy 数组上实现该功能。
arr=np.arange(12).reshape((3,4))
np.concatenate([arr,arr],axis=1)
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
pd.concat([s1,s2,s3])
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
pd.concat([s1,s2,s3],axis=1) # axis=1 生成dataframe 。
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
pd.concat([s1,s2,s3],axis=1,join='inner') # 交集,返回的是空的。
result=pd.concat([s1,s1,s3],keys=['one','two','three']) # 在连接轴向上创建一个多层索引
result
one a 0
b 1
two a 0
b 1
three f 5
g 6
dtype: int64
result.unstack() # 把那多层索引给去了。
a b f g
one 0.0 1.0 NaN NaN
two 0.0 1.0 NaN NaN
three NaN NaN 5.0 6.0
pd.concat([s1,s1,s3],keys=['one','two','three'],axis=1)
one two three
a 0.0 0.0 NaN
b 1.0 1.0 NaN # keys 成为dataframe 的列头。
f NaN NaN 5.0
g NaN NaN 6.0
# 相同的逻辑扩展到dataframe 上。
df1=pd.DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],columns=['one','two'])
df2=pd.DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],columns=['three','four'])
df1
one two
a 0 1
b 2 3
c 4 5
df2
three four # 5+np.arange(4) >>> array([5, 6, 7, 8]) 挺新奇啊,,
a 5 6
c 7 8
pd.concat([df1,df2],axis=1,keys=['level1','level2'])
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
pd.concat({'level1':df1,'level2':df2},axis=1,names=['upper','lower'])
# 传递的是对象的字典而不是列表,字典的键会用于keys 选项。另外通过传递 names 给轴层级命名。
upper level1 level2
lower one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
df1=pd.DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])
df2=pd.DataFrame(np.random.randn(2,3),columns=['b','d','a'])
pd.concat([df1,df2],ignore_index=True) # 不保留原来的索引,默认是保留的
a b c d
0 0.930013 -0.097318 -1.574589 -0.172548
1 0.478817 -0.453965 -0.510146 0.921774
2 -0.076038 -0.279683 1.101760 0.675676
3 -1.227738 0.439467 NaN -0.059227
4 -0.856187 -0.468771 NaN 0.900543
join_axes 这个参数没有了,,我买的不应该是假书啊。。。。。。
2.4 联合重叠数据
另一种数据联合的场景,不是合并,也不是连接操作,两个数据集的索引全部或部分重叠,
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series([0.,np.nan,2.,np.nan,9.,5.],
index=['f', 'e', 'd', 'c', 'b', 'a'])
np.where(pd.isnull(a),b,a)
array([0. , 2.5, 2. , 3.5, 4.5, 5. ])
b.combine_first(a)
f 0.0
e 2.5
d 2.0
c 3.5
b 9.0 # 当 b 不是null ,就保留了。
a 5.0
dtype: float64
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
'b': [np.nan, 2., np.nan, 6.],
'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
'b': [np.nan, 3., 4., 6., 8.]})
df1
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
df2
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
df1.combine_first(df2)
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN