Pandas使用（五）

文章目录

5-5 索引与分层索引

查看索引
重置索引
指定索引
返回index的唯一值
分层索引
分层索引即切片
交换索引

5-6 时间序列

时间序列前言
时间序列基础
生成时间序列索引
时间序列索引及选择数据
时间序列也含有重复的索引
移位日期

5-7 重采样

重采样介绍
练习

5-5 索引与分层索引

查看索引

df.index
- 查看索引
- 注意：索引值不能够单独赋值，只能进行整体的赋值

In [6]: import pandas as pd

In [7]: import numpy as np

In [8]: df = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'), columns=list('qwer'))

In [9]: df
Out[9]:
   q  w   e   r
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

In [10]: # 查看索引

In [11]: df.index
Out[11]: Index(['a', 'b', 'c'], dtype='object')

In [12]: # 索引并不能单独赋值并修改

In [13]: df.index[0] = 'e'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-57fd5743f906> in <module>
----> 1 df.index[0] = 'e'

d:\python3.6.5\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   4258
   4259     def __setitem__(self, key, value):
-> 4260         raise TypeError("Index does not support mutable operations")
   4261
   4262     def __getitem__(self, key):

TypeError: Index does not support mutable operations

In [14]: # 索引只能通过对应索引重新赋值并修改

In [16]: df.index = list('nms')

In [17]: df
Out[17]:
   q  w   e   r
n  0  1   2   3
m  4  5   6   7
s  8  9  10  11

重置索引

df.reindex()
- 如果新添加的索引中没有对应的值，则默认为nan
- 如果减少索引的值出现，相当于切片

In [22]: df
Out[22]:
   q  w   e   r
n  0  1   2   3
m  4  5   6   7
s  8  9  10  11

In [23]: # 对df进行重置索引

In [24]: df.reindex(list('nma'))
Out[24]:
     q    w    e    r
n  0.0  1.0  2.0  3.0
m  4.0  5.0  6.0  7.0
a  NaN  NaN  NaN  NaN

In [25]: # 当重置的索引中没有对应的值的话显示为nan

In [26]: # 当重置的索引中的索引值不勾，则相当于切片

In [27]: df.reindex(list('ns'))
Out[27]:
   q  w   e   r
n  0  1   2   3
s  8  9  10  11

指定索引

df.set_index()
- 将Dataframe中的列转换为行索引

In [29]: df
Out[29]:
   q  w   e   r
n  0  1   2   3
m  4  5   6   7
s  8  9  10  11

In [30]: # set_index 为DataFram中的列转化为行索引

In [31]: df.set_index('q')
Out[31]:
   w   e   r
q
0  1   2   3
4  5   6   7
8  9  10  11

In [32]: # set_index 中有个参数 drop,

In [33]: # drop : 该参数默认为True 当指定为False时，可以将指定的列索引数值显示出来

In [34]: df.set_index('q', drop=False)
Out[34]:
   q  w   e   r
q
0  0  1   2   3
4  4  5   6   7
8  8  9  10  11

返回index的唯一值

df.set_index("M").index.unique()
- df.set_index('q').index : 显示为index索引
- unique : 过滤掉重复的索引

In [48]: df
Out[48]:
   q  w   e   r
n  0  1   2   3
m  8  5   6   7
s  8  9  10  11

In [49]: # unique 主要查看是否是唯一字段

In [50]: df.set_index('q').index.unique()
Out[50]: Int64Index([0, 8], dtype='int64', name='q')

分层索引

分层索引是Pandas的重要特性，允许你在一个轴向上拥有多个(两个或两个以上)索引层级。

In [52]: # 由于数据中索引出现重复的值将会显示为空号，当我们想取多层索引的时候可以传入列表
In [53]: df.set_index(['q','w'])
Out[53]:
      e   r
q w
0 1   2   3
8 5   6   7
  9  10  11
    

In [55]: df1 = pd.DataFrame({'a': range(7),'b':range(7,0,-1),'c':['one','one','one','two','two',
    ...: 'two','two'],'d':list('hjklmno')})

In [56]: df1
Out[56]:
   a  b    c  d
0  0  7  one  h
1  1  6  one  j
2  2  5  one  k
3  3  4  two  l
4  4  3  two  m
5  5  2  two  n
6  6  1  two  o

In [57]: df2 = df1.set_index(['c','d'])

In [58]: df2
Out[58]:
       a  b
c   d
one h  0  7
    j  1  6
    k  2  5
two l  3  4
    m  4  3
    n  5  2
    o  6  1

分层索引即切片

loc
iloc

交换索引

交换的索引是内层与外层之间的索引

`df.swaplevel(i=level1, j=level2)
- 交换set_index后的内层与外层索引
- level为层级

In [21]: # 创建二维数组

In [22]: df = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'), columns=list('qwer'))

In [23]: # 设置多成索引

In [24]: df
Out[24]:
   q  w   e   r
a  0  1   2   3
b  4  5   6

分层索引也可以进行排序

sort_index(ascending=True)
- ascending : 默认情况下为True为升序，设置为False就变成降序

In [32]: df1
Out[32]:
      e   r
w q
1 0   2   3
5 4   6   7
9 8  10  11

In [33]: df1.sort_index()
Out[33]:
      e   r
w q
1 0   2   3
5 4   6   7
9 8  10  11

In [33]: #查看源代码
In [34]: df1.sort_index??
Signature:
df1.sort_index(
    axis=0,
    level=None,
    ascending=True,
    inplace=False,
    kind='quicksort',
    na_position='last',
    sort_remaining=True,
    by=None,
)
Docstring:
Sort object by labels (along an axis).

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
    The axis along which to sort.  The value 0 identifies the rows,
    and 1 identifies the columns.
level : int or level name or list of ints or list of level names
    If not None, sort on values in specified index level(s).
ascending : bool, default True
    Sort ascending vs. descending.
inplace : bool, default False
    If True, perform operation in-place.
kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
    Choice of sorting algorithm. See also ndarray.np.sort for more
    information.  `mergesort` is the only stable algorithm. For
    DataFrames, this option is only applied when sorting on a single
    column or label.
na_position : {'first', 'last'}, default 'last'
    Puts NaNs at the beginning if `first`; `last` puts NaNs at the end.
    Not implemented for MultiIndex.
sort_remaining : bool, default True
    If True and sorting by level and index is multilevel, sort by other
    levels too (in order) after sorting by specified level.

Returns
-------
sorted_obj : DataFrame or None

In [35]: df1.sort_index(ascending=False)
Out[35]:
      e   r
w q
9 8  10  11
5 4   6   7
1 0   2   3

In [36]: # 由于我们的数据是按照从小到大的效果并看不出来什么效果

In [37]: # 所以我们采用升序

In [38]: # sort_index()

In [39]: # 里面有个参数ascending

In [40]: # 默认情况下为True 这情况为降序，将我们设置为True的时候为升序

聚合函数

可以指定mean sum等其他操作

In [53]: df1
Out[53]:
      e   r
w q
1 0   2   3
5 4   6   7
9 8  10  11

In [54]: df1.sum()
Out[54]:
e    18
r    21
dtype: int64

# level 指定内层索引，就是内层索引进行聚合函数计算
In [55]: df1.sum(level=1)
Out[55]:
    e   r
q
0   2   3
4   6   7
8  10  11

将多层索引恢复到数据中

reset_index()

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'), columns=list('qwer'))

In [4]: df
Out[4]:
   q  w   e   r
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

In [5]: # 设置多层索引

In [6]: df1 = df.set_index(['q','w','r'])

In [7]: df1
Out[7]:
         e
q w r
0 1 3    2
4 5 7    6
8 9 11  10

In [8]: # reset_index : 为把多层索引转换为数据

In [9]: df1 = df1.reset_index()

In [10]: df1
Out[10]:
   q  w   r   e
0  0  1   3   2
1  4  5   7   6
2  8  9  11  10

5-6 时间序列

时间序列前言

时间序列数据在很多领域都是重要的结构化数据形式，比如金融，生态学，物理学。在多个时间点观测的数据形成了时间序列。时间序列可以是固定频率的，也可以是不规则的

不使用Pandas创建的时间序列索引

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from datetime import datetime

In [4]: dates = [datetime(2020,5,18),datetime(2020,5,19),datetime(2020,5,20)]

In [5]: Sr = pd.Series(np.random.randint(20,40, size=3), index=dates)

In [6]: Sr
Out[6]:
2020-05-18    34
2020-05-19    33
2020-05-20    33
dtype: int32

In [7]: Sr.index
Out[7]: DatetimeIndex(['2020-05-18', '2020-05-19', '2020-05-20'], dtype='datetime64[ns]', freq=None)

In [8]: # 取数据出来进行计算

In [9]: Sr[::2]
Out[9]:
2020-05-18    34
2020-05-20    33
dtype: int32

In [10]: Sr1 = Sr[::2]

In [11]: # 算术运算 会自动补齐 对应的值，对应运算，当没有数据进行运算的时候会显示NaN
    
In [12]: Sr + Sr1
Out[12]:
2020-05-18    68.0
2020-05-19     NaN
2020-05-20    66.0
dtype: float64

In [13]: # 数据类型为纳秒级别

In [14]: Sr.index
Out[14]: DatetimeIndex(['2020-05-18', '2020-05-19', '2020-05-20'], dtype='datetime64[ns]', freq=None)

In [15]: Sr.index.dtype
Out[15]: dtype('<M8[ns]')

时间序列基础

时间序列介绍

Pandas中的基础时间序列种类是由时间戳索引的Series，在Pandas外部通常表示为Panda字符串或datetime对象。

注意

datetime对象可作为索引，时间序列DatetimeIndex
<M8[ns]类型为纳秒级别的时间戳
时间序列里面每个元素为Timestamp对象

生成时间序列索引

pd.date_range(start=None,end=None,periods=None,frep=None,tz=None,normalize=False,name=None,closed=None)
- start : 起始时间
- end : 结束时间
- periods : 固定时期
- freq : 日期偏移量（频率）
  - h : 为小时
  - min : 为分钟
  - s : 为秒
  - D : 为天
  - W : 为周
  - M : 为月
  - Y : 为年
- normalize : 标准化为0的时间戳

In [40]: dt = pd.date_range(start='20200101', end='20200520',freq='1h')

In [41]: dt
Out[41]:
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 01:00:00',
               '2020-01-01 02:00:00', '2020-01-01 03:00:00',
               '2020-01-01 04:00:00', '2020-01-01 05:00:00',
               '2020-01-01 06:00:00', '2020-01-01 07:00:00',
               '2020-01-01 08:00:00', '2020-01-01 09:00:00',
               ...
               '2020-05-19 15:00:00', '2020-05-19 16:00:00',
               '2020-05-19 17:00:00', '2020-05-19 18:00:00',
               '2020-05-19 19:00:00', '2020-05-19 20:00:00',
               '2020-05-19 21:00:00', '2020-05-19 22:00:00',
               '2020-05-19 23:00:00', '2020-05-20 00:00:00'],
              dtype='datetime64[ns]', length=3361, freq='H')

In [42]: # 当指定分钟的时候

In [43]: dt = pd.date_range(start='20200101', end='20200520',freq='1h30min')

In [44]: dt
Out[44]:
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 01:30:00',
               '2020-01-01 03:00:00', '2020-01-01 04:30:00',
               '2020-01-01 06:00:00', '2020-01-01 07:30:00',
               '2020-01-01 09:00:00', '2020-01-01 10:30:00',
               '2020-01-01 12:00:00', '2020-01-01 13:30:00',
               ...
               '2020-05-19 10:30:00', '2020-05-19 12:00:00',
               '2020-05-19 13:30:00', '2020-05-19 15:00:00',
               '2020-05-19 16:30:00', '2020-05-19 18:00:00',
               '2020-05-19 19:30:00', '2020-05-19 21:00:00',
               '2020-05-19 22:30:00', '2020-05-20 00:00:00'],
              dtype='datetime64[ns]', length=2241, freq='90T')

In [45]: # 当指定秒数的时候

In [46]: dt = pd.date_range(start='20200101', end='20200520',freq='1h30min30s')

In [47]: dt
Out[47]:
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 01:30:30',
               '2020-01-01 03:01:00', '2020-01-01 04:31:30',
               '2020-01-01 06:02:00', '2020-01-01 07:32:30',
               '2020-01-01 09:03:00', '2020-01-01 10:33:30',
               '2020-01-01 12:04:00', '2020-01-01 13:34:30',
               ...
               '2020-05-19 09:29:00', '2020-05-19 10:59:30',
               '2020-05-19 12:30:00', '2020-05-19 14:00:30',
               '2020-05-19 15:31:00', '2020-05-19 17:01:30',
               '2020-05-19 18:32:00', '2020-05-19 20:02:30',
               '2020-05-19 21:33:00', '2020-05-19 23:03:30'],
              dtype='datetime64[ns]', length=2228, freq='5430S')

In [48]: # 当指定为天

In [49]: dt = pd.date_range(start='20200101', end='20200520',freq='1D')

In [50]: dt
Out[50]:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2020-05-11', '2020-05-12', '2020-05-13', '2020-05-14',
               '2020-05-15', '2020-05-16', '2020-05-17', '2020-05-18',
               '2020-05-19', '2020-05-20'],
              dtype='datetime64[ns]', length=141, freq='D')

In [51]: # 当指定为周

In [52]: dt = pd.date_range(start='20200101', end='20200520',freq='1W')

In [53]: dt
Out[53]:
DatetimeIndex(['2020-01-05', '2020-01-12', '2020-01-19', '2020-01-26',
               '2020-02-02', '2020-02-09', '2020-02-16', '2020-02-23',
               '2020-03-01', '2020-03-08', '2020-03-15', '2020-03-22',
               '2020-03-29', '2020-04-05', '2020-04-12', '2020-04-19',
               '2020-04-26', '2020-05-03', '2020-05-10', '2020-05-17'],
              dtype='datetime64[ns]', freq='W-SUN')

In [54]: # 当指定为月

In [55]: dt = pd.date_range(start='20200101', end='20200520',freq='1M')

In [56]: dt
Out[56]: DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30'], dtype='datetime64[ns]', freq='M')

# periods 划分为5个区间
# 当不指定end值的时候，将会按照periods为划分区间，当我们不设置freq时，会采用默认参数d
In [57]: dt = pd.date_range(start='20200101',periods=5)

In [58]: dt
Out[58]:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05'],
              dtype='datetime64[ns]', freq='D')

In [21]: # periods 为固定时间序列

In [22]: # normalize 为标准化时间为0的时间戳

In [23]: df = pd.date_range(start='2020-05-21', periods=5, normalize=True)

In [24]: df
Out[24]:
DatetimeIndex(['2020-05-21', '2020-05-22', '2020-05-23', '2020-05-24',
               '2020-05-25'],
              dtype='datetime64[ns]', freq='D')

时间序列索引及选择数据

时间序列取值通过 [] 来进行取值
年份月份日之间需要使用空格来进行操作
也可以通过 - 进行桥接
也支持loc 和 iloc等操作

In [21]: # periods 为固定时间序列

In [22]: # normalize 为标准化时间为0的时间戳
In [25]: ts = pd.Series(np.random.randint(20,50,size=100),index=pd.date_range(start='20200521',periods=100))

In [26]: ts
Out[26]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
              ..
2020-08-24    34
2020-08-25    25
2020-08-26    44
2020-08-27    23
2020-08-28    41
Freq: D, Length: 100, dtype: int32

In [27]: # periods为时间间隔，由于不指定end,freq是以D来进行划分也就是一天

In [28]: # 进行时间序列索引操作

In [29]: # 选取2020的数据

In [30]: ts['2020']
Out[30]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
              ..
2020-08-24    34
2020-08-25    25
2020-08-26    44
2020-08-27    23
2020-08-28    41
Freq: D, Length: 100, dtype: int32

In [31]: # 选取2020 5 月的数据

In [33]: ts['2020 05']
Out[33]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
2020-05-26    29
2020-05-27    27
2020-05-28    32
2020-05-29    40
2020-05-30    38
2020-05-31    35
Freq: D, dtype: int32

In [34]: # 年份月份日之间要进行空格相隔

In [35]: # 取2020年5月01日至5月10日的数据

In [36]: ts['2020 05 01' : '2020 05 10']
Out[36]: Series([], Freq: D, dtype: int32)

In [37]: ts
Out[37]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
              ..
2020-08-24    34
2020-08-25    25
2020-08-26    44
2020-08-27    23
2020-08-28    41
Freq: D, Length: 100, dtype: int32

In [38]: # 取2020年5月的所有数据

In [39]: ts['2020 05 21':'2020 05 31']
Out[39]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
2020-05-26    29
2020-05-27    27
2020-05-28    32
2020-05-29    40
2020-05-30    38
2020-05-31    35
Freq: D, dtype: int32
        
In [40]: ts.loc['2020-05']
Out[40]:
2020-05-21    48
2020-05-22    49
2020-05-23    23
2020-05-24    26
2020-05-25    30
2020-05-26    29
2020-05-27    27
2020-05-28    32
2020-05-29    40
2020-05-30    38
2020-05-31    35
Freq: D, dtype: int32

时间序列也含有重复的索引

df.index.is_unique
- 检查索引是否有重复的值出现
- 当显示为True 表示为没有重复的索引
- 当显示为False 表示为有重复的索引

In [51]: dates = [datetime(2020, 5, 21),datetime(2020, 5, 21),datetime(2020, 5, 22),datetime(2020, 5, 23)]

In [52]: dates
Out[52]:
[datetime.datetime(2020, 5, 21, 0, 0),
 datetime.datetime(2020, 5, 21, 0, 0),
 datetime.datetime(2020, 5, 22, 0, 0),
 datetime.datetime(2020, 5, 23, 0, 0)]

In [53]: st = pd.Series(np.random.randint(20,30,size=4),index=dates)

In [54]: st
Out[54]:
2020-05-21    23
2020-05-21    28
2020-05-22    29
2020-05-23    24
dtype: int32

In [55]: # 检查是否有重复索引

In [56]: # 当为false 显示有重复

In [57]: # 当为true 显示没有重复

In [59]: st.index.is_unique
Out[59]: False
    
In [61]: # 当有重复索引获取值的时候也不会进行报错

In [62]: st.loc['2020-05-21']
Out[62]:
2020-05-21    23
2020-05-21    28
dtype: int32

重复索引进行分组运算

In [70]: dates = [datetime(2020, 5, 21),datetime(2020, 5, 21),datetime(2020, 5, 22),datetime(2020, 5, 22)]

In [71]: st = pd.Series(np.random.randint(20,30,size=4),index=dates)

In [72]: st
Out[72]:
2020-05-21    29
2020-05-21    20
2020-05-22    29
2020-05-22    25
dtype: int32

In [73]: # 重复索引进行分组在进行求和运算

In [74]: st = st.groupby(level=0).sum()

In [75]: st
Out[75]:
2020-05-21    49
2020-05-22    54
dtype: int32

移位日期

"移位"指的是将日期按时间向前移动或向后移动。Series和DataFrame都有一个shift方法用于进行简单的前向或后向移位而不改变索引

In [77]: import pandas as pd

In [78]: import numpy as np

In [79]: st = pd.Series(np.random.randint(20,30,size=100),index=pd.date_range(start='20200521',periods=100))

In [80]: st
Out[80]:
2020-05-21    27
2020-05-22    25
2020-05-23    21
2020-05-24    23
2020-05-25    23
              ..
2020-08-24    25
2020-08-25    21
2020-08-26    27
2020-08-27    21
2020-08-28    25
Freq: D, Length: 100, dtype: int32

In [81]: # 当我进行指定向前进行移位,向前移动时，由于前面没数据，使用nan填充

In [82]: st.shift(2)
Out[82]:
2020-05-21     NaN
2020-05-22     NaN
2020-05-23    27.0
2020-05-24    25.0
2020-05-25    21.0
              ...
2020-08-24    21.0
2020-08-25    28.0
2020-08-26    25.0
2020-08-27    21.0
2020-08-28    27.0
Freq: D, Length: 100, dtype: float64

In [83]: # 也可以进行向后进行移位

In [84]: st.shift(-2)
Out[84]:
2020-05-21    21.0
2020-05-22    23.0
2020-05-23    23.0
2020-05-24    22.0
2020-05-25    22.0
              ...
2020-08-24    27.0
2020-08-25    21.0
2020-08-26    25.0
2020-08-27     NaN
2020-08-28     NaN
Freq: D, Length: 100, dtype: float64

应用场景

计算增长率
- （后一天-前一天）/ 前一天
- 后一天/前天 -1
- pd.pct_chang()

In [85]: st.pct_change()
Out[85]:
2020-05-21         NaN
2020-05-22   -0.074074
2020-05-23   -0.160000
2020-05-24    0.095238
2020-05-25    0.000000
                ...
2020-08-24   -0.107143
2020-08-25   -0.160000
2020-08-26    0.285714
2020-08-27   -0.222222
2020-08-28    0.190476
Freq: D, Length: 100, dtype: float64
            
 # 通过shift 也可以实现          
In [86]: st/st.shift(1)-1
Out[86]:
2020-05-21         NaN
2020-05-22   -0.074074
2020-05-23   -0.160000
2020-05-24    0.095238
2020-05-25    0.000000
                ...
2020-08-24   -0.107143
2020-08-25   -0.160000
2020-08-26    0.285714
2020-08-27   -0.222222
2020-08-28    0.190476
Freq: D, Length: 100, dtype: float64

5-7 重采样

重采样介绍

重采样：指的是将时间序列从一个频率转化为另一个频率进行处理的过程，将高频率数据转化为低频率数据为降采样，低频率转化为高频率为升采样。

In [87]: import pandas as pd

In [88]: import numpy as np

In [89]: df = pd.DataFrame(np.random.randint(20,30,size=10),index=pd.date_range(start='20200521',periods=10))

In [90]: df
Out[90]:
             0
2020-05-21  25
2020-05-22  25
2020-05-23  23
2020-05-24  23
2020-05-25  20
2020-05-26  22
2020-05-27  23
2020-05-28  28
2020-05-29  23
2020-05-30  22

In [91]: # 采用重采样 resample 可以指定类型

In [92]: df.resample('d').mean()
Out[92]:
             0
2020-05-21  25
2020-05-22  25
2020-05-23  23
2020-05-24  23
2020-05-25  20
2020-05-26  22
2020-05-27  23
2020-05-28  28
2020-05-29  23
2020-05-30  22

# 当以星期来进行操作
In [93]: df.resample('w').mean()
Out[93]:
             0
2020-05-24  24
2020-05-31  23

练习

北上广深与沈阳5个城市空气质量数据，绘制出北京的PM2.5随时间的变化情况

# @Time : 2020/5/21 14:25 
# @Author : SmallJ 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

# 读取csv文件
df = pd.read_csv('PM2.5/BeijingPM20100101_20151231.csv')

# 显示所有的数据
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# 读取一行数据
df.head(1)

# PeriodIndex 为时间段
datetime = pd.PeriodIndex(year=df.year, month=df.month, day=df.day, hour=df.hour, freq="h")

# 添加一列值
df['datetime'] = datetime

# 设置datetime为索引,在原数据上进行修改
df.set_index(df.datetime, inplace=True)

# freq : 以1小时为基础
# 采用重采样进行进行频率处理
df = df.resample('7D').mean()

# 处理缺失值
data = df['PM_US Post'].dropna()

# 绘制图片

x = data.index
y = data.values

# 中文显示设置
font = {'family': 'SimHei'}
matplotlib.rc('font', **font)

# 设置画布大小
plt.figure(figsize=(15, 8), dpi=80)

# 显示title
plt.title('北京的PM2.5天气情况')

# 绘制折线图
# 这里并不能直接采用x 为什么呢，因为x的数据类型为 period[7D]
plt.plot(range(len(x)), y, color='blue')

# 设置x轴的刻度
# ticks=None, labels=None
# ticks 为刻度
# labels 为标签
plt.xticks(ticks=range(0, len(x))[::10], labels=x[::10], rotation=45)

# 绘图
plt.savefig('beijingpm.png')

# 展示图例
plt.show()

在这里插入图片描述

Pandas使用 （五）