字符串数据处理
- Pandas中提供了字符串的函数,但只能对字符型变量进行使用
- 通过str方法访问相关属性
- 可以使用字符串的相关方法进行数据处理
函数名称 |
说明 |
contains() |
返回表示各str是否含有指定模式的字符串 |
replace() |
替换字符串 |
lower() |
返回字符串的副本,其中所有字母都转换为小写 |
upper() |
返回字符串的副本,其中所有字母都转换为大写 |
split() |
返回字符串中的单词列表 |
strip() |
删除前导和后置空格 |
join() |
返回一个字符串,该字符串是给定序列中所有字符串的连接 |
import pandas as pd
import numpy as np
import os
os.getcwd()
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据转换'
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
df = pd.read_csv('MotorcycleData.csv', encoding='gbk')
df.head(5)
|
Condition |
Condition_Desc |
Price |
Location |
Model_Year |
Mileage |
Exterior_Color |
Make |
Warranty |
Model |
... |
Vehicle_Title |
OBO |
Feedback_Perc |
Watch_Count |
N_Reviews |
Seller_Status |
Vehicle_Tile |
Auction |
Buy_Now |
Bid_Count |
0 |
Used |
mint!!! very low miles |
$11,412 |
McHenry, Illinois, United States |
2013.0 |
16,000 |
Black |
Harley-Davidson |
Unspecified |
Touring |
... |
NaN |
FALSE |
8.1 |
NaN |
2427 |
Private Seller |
Clear |
True |
FALSE |
28.0 |
1 |
Used |
Perfect condition |
$17,200 |
Fort Recovery, Ohio, United States |
2016.0 |
60 |
Black |
Harley-Davidson |
Vehicle has an existing warranty |
Touring |
... |
NaN |
FALSE |
100 |
17 |
657 |
Private Seller |
Clear |
True |
TRUE |
0.0 |
2 |
Used |
NaN |
$3,872 |
Chicago, Illinois, United States |
1970.0 |
25,763 |
Silver/Blue |
BMW |
Vehicle does NOT have an existing warranty |
R-Series |
... |
NaN |
FALSE |
100 |
NaN |
136 |
NaN |
Clear |
True |
FALSE |
26.0 |
3 |
Used |
CLEAN TITLE READY TO RIDE HOME |
$6,575 |
Green Bay, Wisconsin, United States |
2009.0 |
33,142 |
Red |
Harley-Davidson |
NaN |
Touring |
... |
NaN |
FALSE |
100 |
NaN |
2920 |
Dealer |
Clear |
True |
FALSE |
11.0 |
4 |
Used |
NaN |
$10,000 |
West Bend, Wisconsin, United States |
2012.0 |
17,800 |
Blue |
Harley-Davidson |
NO WARRANTY |
Touring |
... |
NaN |
FALSE |
100 |
13 |
271 |
OWNER |
Clear |
True |
TRUE |
0.0 |
5 rows × 22 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 22 columns):
Condition 7493 non-null object
Condition_Desc 1656 non-null object
Price 7493 non-null object
Location 7491 non-null object
Model_Year 7489 non-null float64
Mileage 7468 non-null object
Exterior_Color 6778 non-null object
Make 7489 non-null object
Warranty 5109 non-null object
Model 7370 non-null object
Sub_Model 2426 non-null object
Type 6011 non-null object
Vehicle_Title 268 non-null object
OBO 7427 non-null object
Feedback_Perc 6611 non-null object
Watch_Count 3517 non-null object
N_Reviews 7487 non-null object
Seller_Status 6868 non-null object
Vehicle_Tile 7439 non-null object
Auction 7476 non-null object
Buy_Now 7256 non-null object
Bid_Count 2190 non-null float64
dtypes: float64(2), object(20)
memory usage: 1.3+ MB
df['Price'].str[1:3].head(5)
0 11
1 17
2 3,
3 6,
4 10
Name: Price, dtype: object
df['价格'] = df['Price'].str.strip('$')
df['价格'].head(5)
0 11,412
1 17,200
2 3,872
3 6,575
4 10,000
Name: 价格, dtype: object
df['价格'] = df['价格'].str.replace(',', '')
df['价格'].head(5)
0 11412
1 17200
2 3872
3 6575
4 10000
Name: 价格, dtype: object
df['价格'] = df['价格'].astype(float)
df['价格'].head(5)
0 11412.0
1 17200.0
2 3872.0
3 6575.0
4 10000.0
Name: 价格, dtype: float64
df.dtypes
Condition object
Condition_Desc object
Price object
Location object
Model_Year float64
Mileage object
Exterior_Color object
Make object
Warranty object
Model object
Sub_Model object
Type object
Vehicle_Title object
OBO object
Feedback_Perc object
Watch_Count object
N_Reviews object
Seller_Status object
Vehicle_Tile object
Auction object
Buy_Now object
Bid_Count float64
价格 float64
dtype: object
df['Location'].str.split(',').str[0].head(5)
0 McHenry
1 Fort Recovery
2 Chicago
3 Green Bay
4 West Bend
Name: Location, dtype: object
df['Location'].str.len().head(5)
0 32.0
1 34.0
2 32.0
3 35.0
4 35.0
Name: Location, dtype: float64