特征选择
sklearn 中有两大单变量特征选择工具:SelectPercentile 和 SelectKBest。 两者之间的区别从名字就可以看出:SelectPercentile 选择最强大的 X% 特征(X 是参数),而 SelectKBest 选择 K 个最强大的特征(K 是参数)。
from sklearn.feature_selection import SelectPercentile
selector = SelectPercentile(f_classif, percentile=1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
features_test_transformed = selector.transform(features_test_transformed).toarray()
##正则化
##套索回归
from sklearn.linear_model.Lasso features, labels = Get My Data() regression = Lasso() regression.fit(features, labels) regression.predict([2, 4]) print(regression.coef_)
##特征数量和过拟合
features_train = features_train[:150].toarray() labels_train = labels_train[:150]
##识别最强大特征
#!/usr/bin/python import pickle import numpy numpy.random.seed(42) ### The words (features) and authors (labels), already largely processed. ### These files should have been created from the previous (Lesson 10) ### mini-project. words_file = "../text_learning/your_word_data.pkl" authors_file = "../text_learning/your_email_authors.pkl" word_data = pickle.load( open(words_file, "r")) authors = pickle.load( open(authors_file, "r") ) ### test_size is the percentage of events assigned to the test set (the ### remainder go into training) ### feature matrices changed to dense representations for compatibility with ### classifier functions in versions 0.15.2 and earlier from sklearn import cross_validation features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42) from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english') features_train = vectorizer.fit_transform(features_train) features_test = vectorizer.transform(features_test).toarray() words = vectorizer.get_feature_names() ### a classic way to overfit is to use a small number ### of data points and a large number of features; ### train on only 150 events to put ourselves in this regime features_train = features_train[:150].toarray() labels_train = labels_train[:150] ### your code goes here from sklearn import tree from sklearn.metrics import accuracy_score clf = tree.DecisionTreeClassifier() clf.fit(features_train,labels_train) #accuracy_score method 1 acc = clf.score(features_test,labels_test) print acc #accuracy_score method 2 pred = clf.predict(features_test) print "Accuracy:", accuracy_score(labels_test, pred) fi=clf.feature_importances_ print "Important features:" for index, feature in enumerate(clf.feature_importances_): if feature>0.2: print "feature no", index print "importance", feature print "word", words[index]
主成分分析(PCA)
主成分是由数据中具有最大方差的方向决定的,因为可以最大程度的保留信息量
相当于降维,也就是将特征通过降维的方式减少
方差最大化相当于将所有的距离最小化,这个方差和平时理解的方差不太一样
PCA可以帮助你发现数据中的隐藏特征,比如说得到总体上有两个因素推动房价变动
相当于降维,也就是将特征通过降维的方式减少
方差最大化相当于将所有的距离最小化,这个方差和平时理解的方差不太一样
PCA可以帮助你发现数据中的隐藏特征,比如说得到总体上有两个因素推动房价变动
主成分分类最大值等于训练点数量和特征数量的最小值
什么时候使用PCA
想要访问隐藏的特征,这些特征可能隐藏显示在你的数据图案中
降维
可视化高维数据,比如说可视化只能看两个特征,但是当特征很多的时候就用得上了
怀疑数据中存在噪声,可以帮助抛弃重要性小的特征,去除噪声
让算法在少量的特征上更有效,比如说人脸识别,可以将维度降低1/10,然后使用svm来进行训练,之后发现被拍摄人的身份