基于内容的推荐系统的核心思想是:推荐给用户 x 那些与 x 给出高评价的物品近似的物品。
具体方法为:
为物品简历“档案” item profiles
根据用户对物品的打分建立用户“档案” user profiles
推荐时,根据用户档案与物品档案之间的相似程度进行推荐
用之前的文档做例子,TF-IDF矩阵可以视为一个item profiles,
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.preprocessing import normalize from scipy.sparse import csr_matrix items = ['this is the first document, this really is', 'nothing will stop this from been the second doument, second is not a bad order', 'I wonder if three documents would be ok as an example, example like this is stupid', 'ok I think four documents is enough, I I I I think so.'] # will simply using tfidf as the item - profile # row = item(documents) column = feature(term) vectorizer = CountVectorizer(min_df=1) counts = vectorizer.fit_transform(items) # column = item(documents) row = feature(term) transformer = TfidfTransformer() tfidf = transformer.fit_transform(counts).transpose() print shape(tfidf) (32,4)
例子中的4个文档是四个待推荐给用户的物品,分别对应矩阵的4列,矩阵的每一行代表一个词语,矩阵的元素是TF-IDF值,代表的是各个词语在文档中占得重要程度。
如果推荐的物品是电影,那么可以每一列代表一个电影,每一行代表一个演员或者一个导演,或者电影的一种风格,元素值为1或0,代表电影中是否有该演员出演。同样的也需要对该二元矩阵进行标准化处理,例如:每一行的元素都除以该行的总和。
下面来了4个用户,分别对这四个物品中的一个至多个进行打分,分值在0~5范围内,每一列代表一个用户,每一行代表一篇文档,用我们上面item profile的矩阵与其做点积,获得的矩阵便是用户档案。可以理解为各个用户对文档的各个成分的喜好程度,如果是电影推荐的例子,则可以理解为是用户对各个演员、导演或者风格的喜好。
# user's rating for this four documents # cloumn = user row = items(documents) ratings = csr_matrix([[5, 0, 0, 2], [3, 3, 0, 0], [2, 1, 1, 1], [0, 0, 1, 1]], dtype = u'double') # normalize usersn = ratings - ratings.mean(0) userprofile = tfidf.dot(usersn) userprofile = csr_matrix(userprofile)
推荐时,我们需要一个距离的测量,来计算用户档案 i 与物品档案 x 之间的相似程度,例如:
U(X,I)=cos(a)=X·I/normal(X)/normal(I)
# smaller score suggest more similarity between user and item for i in range(4): # iterative over 4 users scores = [] u = userprofile[:,i] for j in range(4): # iterative over 4 documents v = tfidf[:,j] scores.append(sum(u.transpose().dot(v).todense())/np.linalg.norm(u.todense())/np.linalg.norm(v.todense())) print "document recommended for user {0} is document number {1}".format(i+1, scores.index(max(scores))+1) document recommended for user 1 is document number 1 document recommended for user 2 is document number 2 document recommended for user 3 is document number 4 document recommended for user 4 is document number 1
另外,根据item profile我们可以计算物品对之间的相似度(例如同上面一样的使用cosine距离),因而,可以找出用户打分很高的物品,然后找到与该物品相似度较高的物品作为推荐。
例如有三部电影,以及一位用户对3部电影的打分,计算两两电影之间的近似程度。矩阵的每一行代表一部电影,前5列代表电影的特征(演员、导演一类),最后一列代表用户A的评分。计算相似度时我们将用户的打分也算入其中,并且乘以一个系数a:
avals = [0,0.5,1,2] for aval in avals: df =DataFrame([[1,0,1,0,1,2],[1,1,0,0,1,6],[0,1,0,1,0,2]], index = ['A', 'B', 'C']) df[5] = df[5]*aval a = df.loc['A'] b = df.loc['B'] c = df.loc['C'] ab = dot(a,b)/norm(a)/norm(b) ac = dot(a,c)/norm(a)/norm(c) bc = dot(b,c)/norm(c)/norm(b) print 'alpha is {0}'.format(aval) print 'AB {} AC {} BC {}'.format(arccos(clip(ab, -1, 1)),arccos(clip(ac, -1, 1)),arccos(clip(bc, -1, 1)))
alpha is 0
Distance between AB is0.841068670568
Distance between AC is1.57079632679
Distance between BC is1.15026199151
alpha is 0.5
Distance between AB is0.764558797186
Distance between AC is1.27795355507
Distance between BC is0.841068670568
alpha is 1
Distance between AB is0.559880567744
Distance between AC is0.905600271782
Distance between BC is0.555121167557
alpha is 2
Distance between AB is0.329838880928
Distance between AC is0.525285295294
Distance between BC is0.309193320943