[转载] 基于内容的推荐系统(含python代码）-简练

http://www.ryanzhang.info/archives/2594
基于内容的推荐系统的核心思想是：推荐给用户 x 那些与 x 给出高评价的物品近似的物品。

具体方法为：

为物品简历“档案” item profiles
根据用户对物品的打分建立用户“档案” user profiles
推荐时，根据用户档案与物品档案之间的相似程度进行推荐
用之前的文档做例子，TF-IDF矩阵可以视为一个item profiles，

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import normalize
from scipy.sparse import csr_matrix

items = ['this is the first document, this really is',
             'nothing will stop this from been the second doument, second is not a bad order',
             'I wonder if three documents would be ok as an example, example like this is stupid',
             'ok I think four documents is enough, I I I I think so.']

# will simply using tfidf as the item - profile
# row = item(documents) column = feature(term)
vectorizer = CountVectorizer(min_df=1)
counts = vectorizer.fit_transform(items)
# column  = item(documents) row = feature(term)
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(counts).transpose()

print shape(tfidf)

(32,4)

例子中的4个文档是四个待推荐给用户的物品，分别对应矩阵的4列，矩阵的每一行代表一个词语，矩阵的元素是TF-IDF值，代表的是各个词语在文档中占得重要程度。

如果推荐的物品是电影，那么可以每一列代表一个电影，每一行代表一个演员或者一个导演，或者电影的一种风格，元素值为1或0，代表电影中是否有该演员出演。同样的也需要对该二元矩阵进行标准化处理，例如：每一行的元素都除以该行的总和。

下面来了4个用户，分别对这四个物品中的一个至多个进行打分，分值在0~5范围内，每一列代表一个用户，每一行代表一篇文档，用我们上面item profile的矩阵与其做点积，获得的矩阵便是用户档案。可以理解为各个用户对文档的各个成分的喜好程度，如果是电影推荐的例子，则可以理解为是用户对各个演员、导演或者风格的喜好。

# user's rating for this four documents 
# cloumn = user row = items(documents) 
ratings = csr_matrix([[5, 0, 0, 2], [3, 3, 0, 0], [2, 1, 1, 1], [0, 0, 1, 1]], dtype = u'double') 
# normalize
usersn = ratings - ratings.mean(0)

userprofile =  tfidf.dot(usersn)

userprofile = csr_matrix(userprofile)

推荐时，我们需要一个距离的测量，来计算用户档案 i 与物品档案 x 之间的相似程度，例如：
U(X,I)=cos(a)=X·I/normal(X)/normal(I)

# smaller score suggest more similarity between user and item
for i in range(4):
    # iterative over 4 users
    scores = []
    u = userprofile[:,i]
    for j in range(4):
        # iterative over 4 documents
        v = tfidf[:,j]
        scores.append(sum(u.transpose().dot(v).todense())/np.linalg.norm(u.todense())/np.linalg.norm(v.todense()))
    print "document recommended for user {0} is document number {1}".format(i+1, scores.index(max(scores))+1)

document recommended for user 1 is document number 1
document recommended for user 2 is document number 2
document recommended for user 3 is document number 4
document recommended for user 4 is document number 1

另外，根据item profile我们可以计算物品对之间的相似度（例如同上面一样的使用cosine距离），因而，可以找出用户打分很高的物品，然后找到与该物品相似度较高的物品作为推荐。
例如有三部电影，以及一位用户对3部电影的打分，计算两两电影之间的近似程度。矩阵的每一行代表一部电影，前5列代表电影的特征（演员、导演一类），最后一列代表用户A的评分。计算相似度时我们将用户的打分也算入其中，并且乘以一个系数a：

avals = [0,0.5,1,2]
for aval in avals:
    df =DataFrame([[1,0,1,0,1,2],[1,1,0,0,1,6],[0,1,0,1,0,2]],
                index = ['A', 'B', 'C'])
    df[5] = df[5]*aval
    a = df.loc['A']
    b = df.loc['B']
    c = df.loc['C']
    ab = dot(a,b)/norm(a)/norm(b)
    ac = dot(a,c)/norm(a)/norm(c)
    bc = dot(b,c)/norm(c)/norm(b)
    print 'alpha is {0}'.format(aval)
    print 'AB {} AC {} BC {}'.format(arccos(clip(ab, -1, 1)),arccos(clip(ac, -1, 1)),arccos(clip(bc, -1, 1)))

alpha is 0
    Distance between AB is0.841068670568
     Distance between AC is1.57079632679
     Distance between BC is1.15026199151

alpha is 0.5
    Distance between AB is0.764558797186
     Distance between AC is1.27795355507
     Distance between BC is0.841068670568

alpha is 1
    Distance between AB is0.559880567744
     Distance between AC is0.905600271782
     Distance between BC is0.555121167557

alpha is 2
    Distance between AB is0.329838880928
     Distance between AC is0.525285295294
     Distance between BC is0.309193320943

[转载] 基于内容的推荐系统(含python代码）-简练

猜你喜欢