LR:就是特征线性加权加sigmoid激活函数,与线性回归分开,原理不同,后者最小二乘法,前者是最大似然概率
公式推导主要基于最大似然,最后采用梯度下降法对W权重列表进行迭代
超参如下,具体可以参考https://zhuanlan.zhihu.com/p/39780207
1、正则化,L1和L2
L1:适合特征较多的高维数据
L2:适合非高维的数据
2、正则化系数
一般是C,sklearn中取C的倒数
3、solver,梯度下降算法
样本量较大时,可以考虑批量梯度下降法
样本量较小时,可以使用拟牛顿法、lbfgs
L1正则化时只能用liblinear来对损失函数优化,因为l1是权重系数绝对值之和,在0处不可导,也就没有二阶导。
线上应用时直接把系数直接存下来,server加载使用
在排序中,更在意模型的排序能力,即auc得分。实际场景中大于0.7即认为是可用的模型
特征组合:
很少对连续特征作特征组合,一般是对离散特征作组合,注意新特征的特征取值为两个特征原离散化后长度之积
具体看代码
def add(str_one, str_two):
"""
Args:
str_one:"0,0,1,0"
str_two:"1,0,0,0"
Return:
str such as"0,0,1,0,0"
"""
list_one = str_one.split(",")
list_two = str_two.split(",")
list_one_len = len(list_one)
list_two_len = len(list_two)
return_list = [0]*(list_one_len*list_two_len)
try:
index_one = list_one.index("1")
except:
index_one = 0
try:
index_two = list_two.index("1")
except:
index_two = 0
return_list[index_one*list_two_len + index_two] = 1
return ",".join([str(ele) for ele in return_list])
def combine_feature(feature_one, feature_two, new_feature, train_data_df, test_data_df, feature_num_dict):
"""
Args:
feature_one:
feature_two:
new_feature: combine feature name
train_data_df:
test_data_df:
feature_num_dict: ndim of every feature, key feature name value len of the dim
Return:
new_feature_num
"""
train_data_df[new_feature] = train_data_df.apply(lambda row: add(row[feature_one], row[feature_two]), axis=1)
test_data_df[new_feature] = test_data_df.apply(lambda row: add(row[feature_one], row[feature_two]), axis=1)
if feature_one not in feature_num_dict:
print "error"
sys.exit()
if feature_two not in feature_num_dict:
print "error"
sys.exit()
return feature_num_dict[feature_one]*feature_num_dict[feature_two]
#调用示例
new_feature_len = combine_feature("age", "capital-gain", "age_gain", train_data_df, test_data_df, feature_num_dict)