重复值处理

数据清洗一般先从重复值和缺失值开始处理
重复值一般采取删除法来处理
但有些重复值不能删除，例如订单明细数据或交易明细数据等

import pandas as pd
import numpy as np
import os

os.getcwd()

'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据预处理'

os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')

df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')

df.head(5)

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	...	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
0	Used	mint!!! very low miles	$11,412	McHenry, Illinois, United States	2013.0	16,000	Black	Harley-Davidson	Unspecified	Touring	...	NaN	FALSE	8.1	NaN	2427	Private Seller	Clear	True	FALSE	28.0
1	Used	Perfect condition	$17,200	Fort Recovery, Ohio, United States	2016.0	60	Black	Harley-Davidson	Vehicle has an existing warranty	Touring	...	NaN	FALSE	100	17	657	Private Seller	Clear	True	TRUE	0.0
2	Used	NaN	$3,872	Chicago, Illinois, United States	1970.0	25,763	Silver/Blue	BMW	Vehicle does NOT have an existing warranty	R-Series	...	NaN	FALSE	100	NaN	136	NaN	Clear	True	FALSE	26.0
3	Used	CLEAN TITLE READY TO RIDE HOME	$6,575	Green Bay, Wisconsin, United States	2009.0	33,142	Red	Harley-Davidson	NaN	Touring	...	NaN	FALSE	100	NaN	2920	Dealer	Clear	True	FALSE	11.0
4	Used	NaN	$10,000	West Bend, Wisconsin, United States	2012.0	17,800	Blue	Harley-Davidson	NO WARRANTY	Touring	...	NaN	FALSE	100	13	271	OWNER	Clear	True	TRUE	0.0

5 rows × 22 columns

def f(x):
    if '$' in str(x):
        x = str(x).strip('$')
        x = str(x).replace(',', '')
    else:
        x = str(x).replace(',', '')
    return float(x)

df['Price'] = df['Price'].apply(f)

df['Mileage'] = df['Mileage'].apply(f)

df.head(5)

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	...	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
0	Used	mint!!! very low miles	11412.0	McHenry, Illinois, United States	2013.0	16000.0	Black	Harley-Davidson	Unspecified	Touring	...	NaN	FALSE	8.1	NaN	2427	Private Seller	Clear	True	FALSE	28.0
1	Used	Perfect condition	17200.0	Fort Recovery, Ohio, United States	2016.0	60.0	Black	Harley-Davidson	Vehicle has an existing warranty	Touring	...	NaN	FALSE	100	17	657	Private Seller	Clear	True	TRUE	0.0
2	Used	NaN	3872.0	Chicago, Illinois, United States	1970.0	25763.0	Silver/Blue	BMW	Vehicle does NOT have an existing warranty	R-Series	...	NaN	FALSE	100	NaN	136	NaN	Clear	True	FALSE	26.0
3	Used	CLEAN TITLE READY TO RIDE HOME	6575.0	Green Bay, Wisconsin, United States	2009.0	33142.0	Red	Harley-Davidson	NaN	Touring	...	NaN	FALSE	100	NaN	2920	Dealer	Clear	True	FALSE	11.0
4	Used	NaN	10000.0	West Bend, Wisconsin, United States	2012.0	17800.0	Blue	Harley-Davidson	NO WARRANTY	Touring	...	NaN	FALSE	100	13	271	OWNER	Clear	True	TRUE	0.0

5 rows × 22 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 22 columns):
Condition         7493 non-null object
Condition_Desc    1656 non-null object
Price             7493 non-null float64
Location          7491 non-null object
Model_Year        7489 non-null float64
Mileage           7467 non-null float64
Exterior_Color    6778 non-null object
Make              7489 non-null object
Warranty          5108 non-null object
Model             7370 non-null object
Sub_Model         2426 non-null object
Type              6011 non-null object
Vehicle_Title     268 non-null object
OBO               7427 non-null object
Feedback_Perc     6611 non-null object
Watch_Count       3517 non-null object
N_Reviews         7487 non-null object
Seller_Status     6868 non-null object
Vehicle_Tile      7439 non-null object
Auction           7476 non-null object
Buy_Now           7256 non-null object
Bid_Count         2190 non-null float64
dtypes: float64(4), object(18)
memory usage: 1.3+ MB

any(df.duplicated())

True

# 显示重复数据
# df[df.duplicated()]

# 统计重复数据
np.sum(df.duplicated())

# 删除重复值
df.drop_duplicates(inplace=True)

df.columns

Index(['Condition', 'Condition_Desc', 'Price', 'Location', 'Model_Year',
       'Mileage', 'Exterior_Color', 'Make', 'Warranty', 'Model', 'Sub_Model',
       'Type', 'Vehicle_Title', 'OBO', 'Feedback_Perc', 'Watch_Count',
       'N_Reviews', 'Seller_Status', 'Vehicle_Tile', 'Auction', 'Buy_Now',
       'Bid_Count'],
      dtype='object')

# 根据指定变量判断重复值
df.drop_duplicates(subset=['Condition', 'Condition_Desc', 'Price', 'Location'], inplace=True)

# 重复已经被删除
np.sum(df.duplicated())

若尘

发布了266 篇原创文章 · 获赞 335 · 访问量 4万+

私信关注

数据清洗之重复值处理

重复值处理

猜你喜欢

数据清洗之 重复值处理

重复值处理

猜你喜欢

数据清洗之重复值处理