[本次数据分析所用到的数据集链接]
(http://github.com/wesm/pydata-book)
先使用pandas.read_table将每个表加载到一个pandas.DataFrame对象中:
import pandas as pd
#让展示的内容少一点
pd.options.display.max_rows = 10
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('datasets/movielens/users.dat',sep = '::',header = None,names = unames)
rnames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table('datasets/movielens/ratings.dat',sep = '::',header = None,names = rnames)
mnames = ['movie_id','title','genres']
movies = pd.read_table('datasets/movielens/movies.dat',sep = '::',header = None,names = mnames)
然后首先将ratings表与users表合并,然后将该结果与movies表数据合并:
data = pd.merge(pd.merge(ratings,users),movies)
print(data)
使用pivot_table方法或得按性别分级的每部电影的平均电影评分:
mean_ratings = data.pivot_table('rating',index = 'title',columns='gender',aggfunc='mean')
print(mean_ratings[:5])
过滤掉少于250个评分的电影,并使用size()为每个标题获取一个元素是各分组大小的Series,然后评分多于250个的电影标题的索引之后可以用于从mean_ratings中选出所需的行:
ratings_by_title = data.groupby('title').size()
print(ratings_by_title[:10])
active_titles = ratings_by_title.index[ratings_by_title >= 250]
print(active_titles)
mean_ratings = mean_ratings.loc[active_titles]
print(mean_ratings)
要看到女性观众的top电影,我们可以按F列降序排序:
top_female_ratings = mean_ratings.sort_values(by = 'F',ascending = False)
print(top_female_ratings[:10])
如果想要找到男性和女性观众之间最具有分歧性的电影,一种方法是添加一列到含有均值差的mean_ratings中:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
按照’diff’排序产生评分差异最大的电影,以便我们可以看到哪些是女性首选的:
sorted_by_diff = mean_ratings.sort_values(by = 'diff')
print(sorted_by_diff[:10])
转换行的顺序,并切片出top10的行,我们就可以获得男性更喜欢但女性评分不高的电影:
print(sorted_by_diff[::-1][:10])
如果你想要的是不依赖于性别标识而在观众中引起最大异议的电影。异议可以通过评分的方差或者标准差来衡量:
rating_std_by_title = data.groupby('title')['rating'].std()
ratings_std_by_title = rating_std_by_title.loc[active_titles]
print(rating_std_by_title.sort_values(ascending = False)[:10])