引用

LaTex

@article{WANG201621,
title = “Feature selection methods for big data bioinformatics: A survey from the search perspective”,
journal = “Methods”,
volume = “111”,
pages = “21 - 31”,
year = “2016”,
note = “Big Data Bioinformatics”,
issn = “1046-2023”,
doi = “https://doi.org/10.1016/j.ymeth.2016.08.014“,
url = “http://www.sciencedirect.com/science/article/pii/S1046202316302742“,
author = “Lipo Wang and Yaoli Wang and Qing Chang”,
keywords = “Biomarkers, Classification, Clustering, Computational biology, Computational intelligence, Data mining, Evolutionary computation, Evolutionary algorithms, Fuzzy logic, Genetic algorithms, Machine learning, Microarray, Neural networks, Particle swarm optimization, Pattern recognition, Random forests, Rough sets, Soft computing, Swarm intelligence, Support vector machines”
}

Normal

Lipo Wang, Yaoli Wang, Qing Chang,
Feature selection methods for big data bioinformatics: A survey from the search perspective,
Methods,
Volume 111,
2016,
Pages 21-31,
ISSN 1046-2023,
https://doi.org/10.1016/j.ymeth.2016.08.014.
(http://www.sciencedirect.com/science/article/pii/S1046202316302742)
Keywords: Biomarkers; Classification; Clustering; Computational biology; Computational intelligence; Data mining; Evolutionary computation; Evolutionary algorithms; Fuzzy logic; Genetic algorithms; Machine learning; Microarray; Neural networks; Particle swarm optimization; Pattern recognition; Random forests; Rough sets; Soft computing; Swarm intelligence; Support vector machines

摘要

大数据生物信息学

特征选择应用

传统分类：
1. filter
2. wrapper
3. embedded

新分类：（看做组合优化/搜索问题）
1. exhaustive search 穷举搜索
2. heuristic search 启发式搜索 — 有或没有数据提取特征排序方法
3. hybrid methods 混合方法

1 真正最优特征选择：穷举搜索

分类器：

random forests 随机森林
support vector machines (SVMs) 支持向量机
cluster-oriented ensemble classifiers 面向簇的集成分类器
random vector functional link (RVFL) 随机向量泛函链
radial basis function (RBF) neural networks 径向基函数（RBF）神经网络

真正最优的特征子集的搜索 — 计算昂贵 — NP-难

穷尽所有可能的特征组合

‘‘combinatorial explosion” — “组合爆炸” — 指数地

原始特征数 $>30$ — impossible

2 次优特征选择：启发式搜索

‘‘Heuristic search” — “启发式搜索”：由“经验”或“明智选择”指导，期望找到好的次优甚至全局最优。

优于随机搜索

算法的必要成分：
1. 局部改进
2. 创新

simulated annealing 模拟退火算法 — 有一定概率接收较差解 — 有助于跳出局部最优

genetic algorithm (GA) 遗传算法
ant colony optimization (ACC) 蚁群算法
particle swarm optimization (PSO) 粒子群优化
chaotic simulated annealing 混沌模拟退火
tabu search 禁忌搜索
noisy chaotic simulated annealing 噪声混沌模拟退火
branch-and-bound 分支定界

A 无数据提取特征重要性排序的基于启发式搜索的特征选择

二进制向量 — 是否选择相应特征
The nearest neighbor classifier 最近邻分类器
case-based reasoning 案例推理
a leave-one-out procedure 留一法
succinct rules 简明规则
silhouette statistics 轮廓统计
microarray 微阵列
peak tree 峰值树
输入权重 — SVM 或神经网络 – embedded — 特征重要性排序（非直接来自于数据）
权重统计分析
K-means + SVM
margin influence analysis (MIA) 边际影响分析 + SVM
Mann–Whitney U test — nonparametric test method 非参数检验法 + no distribution-related assumptions 无分布相关的假设
在混合描述符空间
Blocking — 模块化
聚合多种学习算法的输出 — 评估基因子集 — 效果明显提升 — 独立于使用的分类算法
Quantitative structure–activity relationships (QSARs) 定量结构-活性关系（QSARS）：biological activities of chemical compounds 化合物的生物活性 + their physicochemical descriptors 它们的物理化学描述符
lexico-semantic event structures 词汇语义事件结构
a noun argument structure 名词论据结构
corpus 语料库 SRL系统
nonparallel plane proximal classifiers 非平行平面近似分类器
SVM + $L_p$ 正则化 — 高维
The support feature machine (SFM) 支持特征机
fuzzy-rough sets 模糊粗糙集

feature evaluation criteria 特征评价标准：
1. dependency
2. relevance
3. redundancy
4. significance

the signal-to-noise ratio (SNR)

a Laplace naive Bayes model — Laplace朴素贝叶斯模型
Laplace distribution — normal distribution
拉普拉斯分布——正态分布

Array comparative genomic hybridization (aCGH)
阵列比较基因组杂交

V. Metsis, F. Makedon, D. Shen, H. Huang, Dna copy number selection using robust structured sparsity-inducing norms, IEEE/ACM Trans. Comput. Biol. Bioinf. 11 (1) (2014) 168–181,
http://dx.doi.org/10.1109/TCBB.2013.141.

B 数据提取特征重要性排序的贪婪搜索

先评估每个特征的重要性

一个对于某种分类器最好的特征子集不见得对另一个效果好。

重要性度量：（直接由输入数据得出）
1. t-test
2. fold-change difference
3. Z-score
4. Pearson correlation coefficient
5. relative entropy
6. mutual information
7. separability-correlation measure
8. feature relevance
9. label changes produced by each feature
10. information gain

维度约简方法：

class-separability measure
Fisher ratio
principal components analysis (PCA)
t-test

4种特征选择（feature Selection，FS）方法：

t-test
significance analysis of microarrays (SAM)
rank products (RP)
random forest (RF)

3 混合特征选择技术

A 半穷举搜索

1 挑选一些重要特征
- 特征重要性排序测度
- Fisher-Markov selector
- equal-width discretization scheme
- 多种传统统计方法的集合
- high predictive power
2 利用较少特征进行进一步搜索
- exhaustive search
- Multi-objective optimization
- an embedded GA, Tabu Search (TS), and SVM
- graph optimization model

B 其他混合特征选择方法

特征提取方法

spectral biclustering
sparse component analysis
Poisson model
scatter matrix
singular value decomposition
weighted PCA
robust principal component analysis
linear discriminant analysis
Laplacian linear discriminant analysis (LLDA)
Laplacian score
SVD-entropy
nonnegative matrix factorization (NMF)
sparse NMF (SNMF)
artificial neural network classification scheme

4 总结与展望

大数据生物信息学

A 小样本问题

维度（基因）非常高 — >20,000
样本大小 — ~50个病人

overfitting and overoptimism — 过拟合和过优化

B 非平衡数据

各个类别的数据数目不一

up-sampling classes with fewer data, down-sampling classes with more data
上采样带较少数据的类，下采样具有更多数据的类

making classification errors sensitive to classes (cost-sensitive learning)
使分类错误对类敏感（成本敏感的学习）

signal-to-noise correlation coefficient (S2N)
Feature Assessment by Sliding Thresholds (FAST)

empirical mutual information — the data sparseness issue

multivariate normal distributions

C 类相关特征选择

每个类选择不同的特征子集

class-independent FS
class-dependent FS

class distributions
RBF neural classifier — the clustering property
GA
SVM
the multi-layer perceptron (MLP) neural network
the probability density function (PDF) projection theorem
principle component analysis (PCA) from class-specific subspaces

a C-class classification problem — C 2-class classifiers

feature importance measures：

RELIEF
class separability
minimal-redundancy-maximal-relevancy

full class relevant (FCR) and partial class relevant (PCR) features

Markov blanket

multiclass ranking statistics
class-specific statistics
Pareto-front — alleviates the bias
F-score and KW-score

a binary tree of simpler classification subproblems

feature subsets of every class

大数据生物信息学特征选择方法：基于搜索的视角

引用