文章目录
1.重复值处理
- DataFrame.duplicated 计算是否有重复值
DataFrame.duplicated(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first')
- DataFrame.drop_duplicates 删除重复值
参考《https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html》
DataFrame.drop_duplicates(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] ='first', inplace: bool = False, ignore_index: bool = False)
以下是我根据视频完整的操作记录,仅稍作整理,以备后续查看。
import pandas as pd
import numpy as np
import os
进入文档所在路径
os.chdir(r'C:\代码和数据')
#路径前不加r的话需要将单斜杠\变为双斜杠\\
读取文档
df =pd.read_csv('MotorcycleData.csv',encoding='gbk',na_values='Na')
#将数据为‘Na’的当作缺失值处理,注意不要写成na_value,应为na_values
查看前三行
df.head(3)
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Used | mint!!! very low miles | $11,412 | McHenry, Illinois, United States | 2013.0 | 16,000 | Black | Harley-Davidson | Unspecified | Touring | ... | NaN | FALSE | 8.1 | NaN | 2427 | Private Seller | Clear | True | FALSE | 28.0 |
1 | Used | Perfect condition | $17,200 | Fort Recovery, Ohio, United States | 2016.0 | 60 | Black | Harley-Davidson | Vehicle has an existing warranty | Touring | ... | NaN | FALSE | 100 | 17 | 657 | Private Seller | Clear | True | TRUE | 0.0 |
2 | Used | NaN | $3,872 | Chicago, Illinois, United States | 1970.0 | 25,763 | Silver/Blue | BMW | Vehicle does NOT have an existing warranty | R-Series | ... | NaN | FALSE | 100 | NaN | 136 | NaN | Clear | True | FALSE | 26.0 |
3 rows × 22 columns
自定义一个函数用于去掉Price和Mileage中的字符,留下数字,并将数值转为浮点型
def f(x):
if '$' in str(x): #去掉Price中的$和,
x = str(x).strip('$')
x = str(x).replace(',','')
else: #去掉Mileage中的,
x = str(x).replace(',','')
return float(x)
对Price和Mileage两个字段用自定义函数f进行处理
df['Price']=df['Price'].apply(f)
df['Mileage']=df['Mileage'].apply(f)
查看处理后的字段数值
df[['Price','Mileage']].head(3)
Price | Mileage | |
---|---|---|
0 | 11412.0 | 16000.0 |
1 | 17200.0 | 60.0 |
2 | 3872.0 | 25763.0 |
#查看处理后的字段类型
df[['Price','Mileage']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 2 columns):
Price 7493 non-null float64
Mileage 7467 non-null float64
dtypes: float64(2)
memory usage: 117.2 KB
df.duplicated()函数,有重复值时该行显示为TRUE否则为FALSE,默认axis=0判断显示
any(df.duplicated())#判断df中是否含有重复值,一旦有的话就是TRUE
True
df[df.duplicated()].head(3) #展示df重复的数据
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
57 | Used | NaN | 4050.0 | Gilberts, Illinois, United States | 2006.0 | 6650.0 | Black | Harley-Davidson | Vehicle does NOT have an existing warranty | Softail | ... | NaN | FALSE | NaN | 7< | 58 | Private Seller | Clear | True | TRUE | 3.0 |
63 | Used | NaN | 7300.0 | Rolling Meadows, Illinois, United States | 1997.0 | 20000.0 | Black | Harley-Davidson | Vehicle does NOT have an existing warranty | Sportster | ... | NaN | TRUE | 100 | 5< | 111 | Private Seller | Clear | False | TRUE | NaN |
64 | Used | Dent and scratch free. Paint and chrome in exc... | 5000.0 | South Bend, Indiana, United States | 2003.0 | 1350.0 | Black | Harley-Davidson | Vehicle does NOT have an existing warranty | Sportster | ... | NaN | FALSE | 100 | 14 | 37 | Private Seller | Clear | False | TRUE | NaN |
3 rows × 22 columns
np.sum(df.duplicated())#计算重复的数量
1221
drop_duplicates()函数,删除重复数据
df.drop_duplicates().head(3)#删除重复的数据,并返回删除后的视图。
# inplace=True 时才会对原数据进行操作
#这是是drop_duplicates不是drop_duplicated
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Used | mint!!! very low miles | 11412.0 | McHenry, Illinois, United States | 2013.0 | 16000.0 | Black | Harley-Davidson | Unspecified | Touring | ... | NaN | FALSE | 8.1 | NaN | 2427 | Private Seller | Clear | True | FALSE | 28.0 |
1 | Used | Perfect condition | 17200.0 | Fort Recovery, Ohio, United States | 2016.0 | 60.0 | Black | Harley-Davidson | Vehicle has an existing warranty | Touring | ... | NaN | FALSE | 100 | 17 | 657 | Private Seller | Clear | True | TRUE | 0.0 |
2 | Used | NaN | 3872.0 | Chicago, Illinois, United States | 1970.0 | 25763.0 | Silver/Blue | BMW | Vehicle does NOT have an existing warranty | R-Series | ... | NaN | FALSE | 100 | NaN | 136 | NaN | Clear | True | FALSE | 26.0 |
3 rows × 22 columns
查看行与列的数量
df.shape
(7493, 22)
查看每一列的名称
df.columns
Index(['Condition', 'Condition_Desc', 'Price', 'Location', 'Model_Year',
'Mileage', 'Exterior_Color', 'Make', 'Warranty', 'Model', 'Sub_Model',
'Type', 'Vehicle_Title', 'OBO', 'Feedback_Perc', 'Watch_Count',
'N_Reviews', 'Seller_Status', 'Vehicle_Tile', 'Auction', 'Buy_Now',
'Bid_Count'],
dtype='object')
删除列’Condition’, ‘Condition_Desc’, ‘Price’, 'Location’重复的值
df.drop_duplicates(subset=['Condition', 'Condition_Desc', 'Price', 'Location'],inplace=True)
查看输出后的列数,明显减少;若未加inplace=True则不会减少
df.shape
(5356, 22)
测试用:选取前两行,只查看第一行
df.head(2)[[True,False]]
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Used | mint!!! very low miles | 11412.0 | McHenry, Illinois, United States | 2013.0 | 16000.0 | Black | Harley-Davidson | Unspecified | Touring | ... | NaN | FALSE | 8.1 | NaN | 2427 | Private Seller | Clear | True | FALSE | 28.0 |
1 rows × 22 columns