1、导入库和数据
import graphlab
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)
products = graphlab.SFrame('amazon_baby.gl/')
products.head()
2、建立word_count矢量
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
products.head()
3、检查最受欢迎商品之一的Giraffe评价
giraffe_reviews = products[products['name'] == 'Vulli Sophie the Giraffe Teether']
4、将评价分为positive和negtive
# ignore all 3* reviews
products = products[products['rating'] != 3]
# positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >=4
products.head()
5、切分数据集、训练、评估模型
train_data,test_data = products.random_split(.8, seed=0)
sentiment_model = graphlab.logistic_classifier.create(train_data,
target='sentiment',
features=['word_count'],
validation_set=test_data)
sentiment_model.evaluate(test_data, metric='roc_curve')
6、根据模型预测Giraffe商品
giraffe_reviews['predicted_sentiment'] = sentiment_model.predict(giraffe_reviews, output_type='probability')
giraffe_reviews.head()
7、测试
要查看word_count中,最常用词的排序,使用如下代码:
products['word_count'].show()
the and to i a it this is for my of