利用spyder3.5编程
我们研究的一个mn.csv文件,标题为:
HH1 HH2 LN MWM1 MWM2 MWM4 MWM5 MWM6D MWM6M MWM6Y MWM7 MWM8 MWM9
这种形式
我们找到这些缩写标题的具体含义,并存于mn_headers.csv文件中
那么如何将这些标题与调查数据一一对应,使得文件更可读?下面研究。
替换标题
提高标题的可读性,最简单的将短标题换为长标题
利用csv.DictReader读成字典,再列表解析
from csv import DictReader
data_r = DictReader(open('C:/Users/elenawang/Documents/data/mn.csv','r',encoding="utf-8"))
header_r = DictReader(open('C:/Users/elenawang/Documents/data/mn_headers.csv','r',encoding="utf-8"))
data_rows = [d for d in data_r]
header_rows = [h for h in header_r]
将header_rows 中的在label的长标题 替换到 data_rows的字典标题
new_rows=[]
for data_dict in data_rows:
new_row={}
for dkey,dval in data_dict.items():
for header_dict in header_rows:
if dkey in header_dict.values():
new_row[header_dict.get('Label')]=dval #找到匹配行,添加到新的字典中
new_rows.append(new_row)
另一种方法:
利用列表zip方法
读入数据,由于是列表,所以用reader就可以
from csv import reader
data_r = reader(open('C:/Users/elenawang/Documents/data/mn.csv','r',encoding="utf-8"))
header_r = reader(open('C:/Users/elenawang/Documents/data/mn_headers.csv','r',encoding="utf-8"))
data_rows = [d for d in data_r]
header_rows = [h for h in header_r]
print (len(data_rows[0]))
print (len(header_rows))
#159
#210
发现长度不一致,研究原因:
data_rows[0]
Out[30]:
['',
'HH1',
'HH2',
'LN',
'MWM1',
'MWM2',
'MWM4',
'MWM5',
'MWM6D',
'MWM6M',
header_rows[:2]
Out[31]: [['Name', 'Label', 'Question'], ['HH1', 'Cluster number', '']]
发现,data_rows[0]的第二行与header_rows索引为1的相对应
找出不匹配的行:
bad_rows=[]
for h in header_rows:
if h[0] not in data_rows[0]:
bad_rows.append(h)
#删除不匹配的行
for h in bad_rows:
header_rows.remove(h)
print (len(header_rows))
#150
少了9个标题值,我们看看这九个为什么不需要?
all_short_headers=[h[0] for h in header_rows]
for header in data_rows[0]:
if header not in all_short_headers:
print('mismatch',header)
mismatch
mismatch MDV1F
mismatch MTA8E
mismatch mwelevel
mismatch mnweight
mismatch wscoreu
mismatch windex5u
mismatch wscorer
mismatch windex5r
处理原始数据的时候,我们要将原始数据换成可用的格式,有时余姚舍弃不需要的数据。这取决与这个数据对你是否重要。
我们查询MDV1F, MTA8E比较重要,将其他删除。
from csv import reader
data_r = reader(open('C:/Users/elenawang/Documents/data/mn.csv','r',encoding="utf-8"))
header_r = reader(open('C:/Users/elenawang/Documents/data/mn_headers_updated.csv','r',encoding="utf-8"))
data_rows = [d for d in data_r]
header_rows = [h for h in header_r]
print (len(data_rows[0]))
print (len(header_rows))
all_short_headers=[h[0] for h in header_rows]
skip_index=[]
for header in data_rows[0]:
if header not in all_short_headers:
index=data_rows[0].index(header)
skip_index.append(index)
new_data=[]
for row in data_rows[1:]:
new_row=[]
for i,d in enumerate(row):
if i not in skip_index:
new_row.append(d)
new_data.append(new_row)
zipped=[]
for drow in new_data:
zipped.append(zip(header_rows,drow))