学习笔记（七）：朴素贝叶斯在Web安全中的六个应用

一、检测Web异常操作

1.数据搜集：一样

2.特征化

使用词集模型，统计全部操作命令，去重后形成字典或词汇表：

with open(filename) as f:
    for line in f:
        line=line.strip('\n')
        dist.append(line)
fdist = FreqDist(dist).keys()

以此词汇表作为向量空间，将每个命令序列转换成对应的向量：

def get_user_cmd_feature_new(user_cmd_list,dist):
    user_cmd_feature=[]
    for cmd_list in user_cmd_list:
        v = [0]*len(dist)
        for i in range(0, len(dist)):
            if dist[i] in cmd_list:
                v[i] +=1
        user_cmd_feature.append(v)
    return user_cmd_feature

3.训练模型

clf = GaussianNB().fit(x_train,y_train)

4.效果验证

二、检测WebShell（一）

将互联网上搜集到的Webshell作为黑样本，当前最新的wordpress原码作为白样本。将一个PHP文件作为字符串处理，以基于单词的2-gram切割，遍历全部文件形成基于2-gram的词汇表，再将每个PHP向量化。

def load_file(file_path):
    t=""
    with open(file_path) as f:
        for line in f:
            line = line.strip('\n')
            t+=line
    return t

def load_file(path):
    files_list =[]
    for r,d,files in os.walk(path):
        for file in files:
            if file.endwith('.php'):
                file_path = path+file
                print "Load %s" %file_path
                t = load_file(file_path)
                files_list.append(t)
    return files_list

webshell_bigram_vectorizer = CountVectorizer(ngram_range=(2,2),decode_error="ignor",token_pattern=r'\b\w+\b')
webshell_files_list = load_files("...")
x1 = webshell_bigram_vectorizer.fit_transform(webshell_files_list).toarray()
vocabulary = webshell_bigram_vectorizer.vocabulary_

2.特征化

wp_bigram_vectorizer = CountVectorizer(ngram_range=(2,2),decode_error="ignore",token_pattern=r'\b\w+\b',min_df=1,vocabulary = vocabulary)
wp_files_list = load_files("...")
x2 = wp.bigram_vectorzier.fit_transform(wp_files_list).toarray()
y2 = [0]*len(x2)
x=np.concatenate((x1,x2))
y=np.concatenate((y1,y2))

3训练样本、4效果验证原理同上

三、检测WebShell（二）

webshell本质上是一个远程管理工具，一系列管理功能其实是一系列函数调用，我们尝试针对函数调用建立特征。

1.数据搜集

针对黑样本集合，以1-gram算法生成全局的词汇表。

webshell_bigram_vectorizer=CountVectorizer(ngram_range=(1,1),decode_error="ignore",token_pattern=r'\b\w+\b\(|\'\w+\")
webshell_files_list = load_files("...")
x1=webshell_bigram_vectorizer.fit_transform(webshell_files_list).toarray()
vocabulary = webshell_bigram_vectorizer.vocabulary_

2.特征化

wp_bigram_vectorizer = CountVectorizer(ngram_range=(2,2),decode_error="ignore",token_pattern=r'\b\w+\b\(|\'\w+\",min_df=1,vocabulary = vocabulary)
wp_files_list = load_files("...")
x2 = wp.bigram_vectorzier.fit_transform(wp_files_list).toarray()

3和4同上

四、检测DGA域名

1.数据搜集：加载alexa前1000的域名作为白样本，标记为0，分别加载cryptolocker和post-tovar-goz家族的DGA域名，分别标记为2和3

x1_domain_list = load_alexa("...")
x2_domain_list = load_dga("...")
x3_domain_list = load_dga("...")
x_domain_list = np.concatenate(x1_domain_list,x2_domain_list,x3_domain_list)
y1=[0]*len(x1_domain_list)
y2=[1]*len(x2_domain_list)
y3=[2]*len(x3_domain_list)
y = np.concatenate((y1,y2,y3))

2.特征化

cv=CountVectorizer(ngram_range=(2,2),decode_error="ignore",token_pattern=r"\w",min_df=1)
x=cv.fit_transform(x_domain_list).toarray()

3和4原理同上

五、检测对Apache的DDoS攻击

KDD 99数据集中与DDoS相关的特征主要为：

·网络连接基本特征：包括duration, src_bytes, dst_bytes, land, wrong_fragment,urgent

·基于时间的网络流量统计特征：包括count, srv_count, serror_rate, srv_serror_rate, rerror_rate, srv_rerror_rate, same_srv_rate, deff_srv_rate, rv_diff_host_rate

·基于主机的网络流量统计：包括dst_host_count, dst_houst_srv_count, dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_port_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate, dst_host_rerror_rate, dst_host_srv_rerror_rate

1.数据搜索：加载KDD 99数据集中的数据，以及筛选出apache2和normal，以及满足http协议的数据，同（检测ROOTKIT）

2.特征化：挑选与DDoS有关的特征，同（检测ROOTKIT）

3和4原理同上

六、识别验证码

1.数据搜集：MNIST是一个入门级计算机视觉数据集，包含各种手写数字图片及其对应标签

def load_data():
    with gzip.open('..') as fp:
        training_data, valid_data, test_data = pickle.load(fp)
    return training_data, valid_data, test_data

2.特征化：MNIST已经将24*24的图片特征成长度为784的一维向量。

3.训练模型

training_data, valid_data, test_data = load_data()
x1,y1=training_data
x2,y2=test_data
clf=GaussianNB()
clf.fix(x1,y1)

4.效果验证原理同上

学习笔记（七）：朴素贝叶斯在Web安全中的六个应用

猜你喜欢