Pandas数据操作详解-总结

pandas简介

pandas 是基于NumPy 的一种工具，该工具是为解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。pandas 是 Python 的核心数据分析支持库，提供了快速、灵活、明确的数据结构，旨在简单、直观地处理关系型、标记型数据。

1.数据读取

首先，pip install pandas 安装Pandas库。

引用pandas库，通常简称为pd，如下：

import pandas as pd

1.1获取样本数据-以波士顿房价数据为例

从sklearn.datasets数据集中下载波士顿房价数据：

from sklearn.datasets import load_boston
boston = load_boston()
# 输出对boston数据集的描述
print("波士顿房价的数据集描述是\n", boston.DESCR)

运行结果：

波士顿房价的数据集描述是
 .. _boston_dataset:

Boston house prices dataset
---------------------------
**Data Set Characteristics:**  

    :Number of Instances: 506 
    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's
    :Missing Attribute Values: None
    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

波士顿房价数据集的特征共有14种，分别是CRIM（城镇人均犯罪率）、ZN（占地面积超过25000平方英尺的住宅用地比例）、INDUS（非零售商业用地占比）、CHAS（是否临河）、NOX（氮氧化物浓度）、RM（房屋房间数）、AGE（房屋年龄）、DIS（和就业中心的距离）、RAD（是否容易上高速路）、TAX（税率）、PTRATTO（学生人数比老师人数）、B（城镇黑人比例计算的统计值）、LSTAT（低收入人群比例）和MEDV（房价中位数）。原文链接：https://blog.csdn.net/f18896984569/article/details/127759937。

这个数据下载到哪里了呢？我们可以通过打印boston获取位置信息（print(boston)),这里列出部分信息：位置在：D:\\pythonProject\\venv\\lib\\site-packages\\sklearn\\datasets\\data\\boston_house_prices.csv

 per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n.. topic:: References\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n", 'filename': 'D:\\pythonProject\\venv\\lib\\site-packages\\sklearn\\datasets\\data\\boston_house_prices.csv'}

Process finished with exit code 0

我们打开路径可以看到：

显示时间不是当前时间，说明之前已经下载过。

打开数据如下，显示前面11行：

CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0.00632	18	2.31	0	0.538	6.575	65.2	4.09	1	296	15.3	396.9	4.98	24
0.02731	0	7.07	0	0.469	6.421	78.9	4.9671	2	242	17.8	396.9	9.14	21.6
0.02729	0	7.07	0	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0.03237	0	2.18	0	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
0.06905	0	2.18	0	0.458	7.147	54.2	6.0622	3	222	18.7	396.9	5.33	36.2
0.02985	0	2.18	0	0.458	6.43	58.7	6.0622	3	222	18.7	394.12	5.21	28.7
0.08829	12.5	7.87	0	0.524	6.012	66.6	5.5605	5	311	15.2	395.6	12.43	22.9
0.14455	12.5	7.87	0	0.524	6.172	96.1	5.9505	5	311	15.2	396.9	19.15	27.1
0.21124	12.5	7.87	0	0.524	5.631	100	6.0821	5	311	15.2	386.63	29.93	16.5
0.17004	12.5	7.87	0	0.524	6.004	85.9	6.5921	5	311	15.2	386.71	17.1	18.9
0.22489	12.5	7.87	0	0.524	6.377	94.3	6.3467	5	311	15.2	392.52	20.45	15

第一行显示数据有506行记录，13个变量，最后一列为房价中位数。我们将第一行删除掉便于数据操作。把文件复制到当前路径下与操作，另存为一份Excel格式。

excel文件读取

def read_excel(io: {engine, parse},
               sheet_name: int = 0,
               header: int = 0,
               names: Any = None,
               index_col: Any = None,
               usecols: Any = None,
               squeeze: bool = False,
               dtype: Any = None,
               engine: {__ne__} = None,
               converters: Any = None,
               true_values: Any = None,
               false_values: Any = None,
               skiprows: Any = None,
               nrows: Any = None,
               na_values: Any = None,
               keep_default_na: bool = True,
               na_filter: bool = True,
               verbose: bool = False,
               parse_dates: bool = False,
               date_parser: Any = None,
               thousands: Any = None,
               comment: Any = None,
               skipfooter: int = 0,
               convert_float: bool = True,
               mangle_dupe_cols: bool = True,
               storage_options: Optional[Dict[str, Any]] = None)

示例：读取excel文件数据，默认读取所有数据：

df=pd.read_excel('boston_house_prices.xls')
print(df)

csv文件读取

read_csv函数中参数更多：

def read_csv(filepath_or_buffer: PathLike[str],
             sep: Any = lib.no_default,
             delimiter: Any = None,
             header: str = "infer",
             names: Any = None,
             index_col: Any = None,
             usecols: Any = None,
             squeeze: bool = False,
             prefix: Any = None,
             mangle_dupe_cols: bool = True,
             dtype: Any = None,
             engine: Any = None,
             converters: Any = None,
             true_values: Any = None,
             false_values: Any = None,
             skipinitialspace: bool = False,
             skiprows: Any = None,
             skipfooter: int = 0,
             nrows: Any = None,
             na_values: Any = None,
             keep_default_na: bool = True,
             na_filter: bool = True,
             verbose: bool = False,
             skip_blank_lines: bool = True,
             parse_dates: bool = False,
             infer_datetime_format: bool = False,
             keep_date_col: bool = False,
             date_parser: Any = None,
             dayfirst: bool = False,
             cache_dates: bool = True,
             iterator: bool = False,
             chunksize: Any = None,
             compression: str = "infer",
             thousands: Any = None,
             decimal: str = ".",
             lineterminator: Any = None,
             quotechar: str = '\"',
             quoting: int = csv.QUOTE_MINIMAL,
             doublequote: bool = True,
             escapechar: Any = None,
             comment: Any = None,
             encoding: Any = None,
             dialect: Any = None,
             error_bad_lines: bool = True,
             warn_bad_lines: bool = True,
             delim_whitespace: bool = False,
             low_memory: Optional[bool] = _c_parser_defaults["low_memory"],
             memory_map: bool = False,
             float_precision: Any = None,
             storage_options: Optional[Dict[str, Any]] = None)

示例：读取csv数据，默认读取前5行：

df = pd.read_csv(
    # 该参数为数据在电脑中的路径，可以不填写
    filepath_or_buffer='boston_house_prices.csv',
    # 该参数代表数据的分隔符，csv文件默认是逗号。其他常见的是'\t'
    sep=',',
    # 该参数代表跳过数据文件的的第1行不读入
    # skiprows=1,
    # nrows，只读取前n行数据，若不指定，读入全部的数据
    nrows=5,
)

2.数据保存

excel文件保存,需要import xlwt

df.to_excel('boston_part.xls')

csv文件保存

df.to_csv('boston_part.csv')

3.数据指定位置读取与切片

可通过iloc方法来实现

newdf=df.iloc[:,:] ，索引从0开始

示例：读取指定位置数据，比如第5行第5列数据

df = pd.read_csv('boston_house_prices.csv')
df=df.iloc[4,4]

读取5行5列数据：

df = pd.read_csv('boston_house_prices.csv')
df=df.iloc[:5,:5]
print(df)

结果如下：

      CRIM    ZN  INDUS  CHAS    NOX
0  0.00632  18.0   2.31     0  0.538
1  0.02731   0.0   7.07     0  0.469
2  0.02729   0.0   7.07     0  0.469
3  0.03237   0.0   2.18     0  0.458
4  0.06905   0.0   2.18     0  0.458

读取指定位置5行数据所有列：

df = pd.read_csv('boston_house_prices.csv')
df=df.iloc[10:15,:]
print(df)

运行结果：

       CRIM    ZN  INDUS  CHAS    NOX  ...  TAX  PTRATIO       B  LSTAT  MEDV
10  0.22489  12.5   7.87     0  0.524  ...  311     15.2  392.52  20.45  15.0
11  0.11747  12.5   7.87     0  0.524  ...  311     15.2  396.90  13.27  18.9
12  0.09378  12.5   7.87     0  0.524  ...  311     15.2  390.50  15.71  21.7
13  0.62976   0.0   8.14     0  0.538  ...  307     21.0  396.90   8.26  20.4
14  0.63796   0.0   8.14     0  0.538  ...  307     21.0  380.02  10.26  18.2

同样的，读取指定列所有行也是一样的。

4.数据合并连接

pd.concat([df1,df2],axis=1) 横向合并数据

df = pd.read_csv('boston_house_prices.csv')
df1=df.iloc[:,:13]
df2=df.iloc[:,13]
print(df1,df2)
df3=pd.concat([df1,df2],axis=1)
print(df3)

纵向合并数据：

df = pd.read_csv('boston_house_prices.csv')
df1=df.iloc[:5,:]
df2=df.iloc[5:10,:]
print(df1,df2)
df3=pd.concat([df1,df2],axis=0)
print(df3)

5.根据条件读取数据

只选择中位数房价大于30的数据。df['MEDV']>30

df = pd.read_csv('boston_house_prices.csv')
df=df[df['MEDV']>30]
print(df)

6.根据条件删除数据

删除房价大于30的数据：

indexname=df[df['MEDV']>30].index
df.drop(index,Inplace=True)

7.统计函数

df = pd.read_csv('boston_house_prices.csv')
print(df['MEDV'].mean())  # 求一整列的均值，返回一个数。会自动排除空值。
print(df[['MEDV', 'LSTAT']].mean())  # 求两列的均值，返回两个数，Series
print(df[['MEDV', 'LSTAT']])
print(df[['MEDV', 'LSTAT']].mean(axis=1))  # 求两列的均值，返回DataFrame。axis=0或者1要搞清楚。
#axis=1，代表对整几列进行操作。axis=0（默认）代表对几行进行操作。实际中弄混很正常，到时候试一下就知道了。
print(df['MEDV'].max())  # 最大值
print(df['MEDV'].min())  # 最小值
print(df['MEDV'].std())  # 标准差
print(df['MEDV'].count())  # 非空的数据的数量
print(df['MEDV'].median())  # 中位数
print(df['MEDV'].quantile(0.25))  # 25%分位数

8.数据排序

8.1 按索引排序

函数：sort_index（）是 pandas 中按索引排序的函数，默认情况下， sort_index 是按行索引升序排序。

df = pd.read_csv('boston_house_prices.csv',nrows=5,
                  index_col=['CRIM'],#设置该属性为索引列
                  usecols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS'])
print(df)
df1=df.sort_index()
print('sort_index:')
print(df1)

运行结果：

         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622
sort_index:
         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622

默认索引就是从小到达排序的.我们反序排列：

df = pd.read_csv('boston_house_prices.csv',nrows=5,
                  index_col=['CRIM'],
                  usecols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS'])
print(df)
df1=df.sort_index(ascending=False)
print('sort_index:')
print(df1)

         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622
sort_index:
         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900

8.2按数值排序

sort_values() 中设置单个列的列名称，可以对单个列进行排序，通过设置参数 ascending 可以设置升序或降序排列,默认升序排序。

df = pd.read_csv('boston_house_prices.csv',nrows=5,
                  index_col=['CRIM'],
                  usecols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS'])
print(df)
df1=df.sort_values('NOX')
print('sort_values:')
print(df1)

         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622
sort_values:
         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900

9.修改数值

根据条件修改某个指定字段的数值。

df1.loc[df1["NOX"]>0.50,"DIS"] = 0  #NOX大于0.50的数据0修改为0