0.前言

在回归，分类，聚类等机器学习算法中，各个特征之间的距离(相似度)计算是非常重要的，然而常用的距离计算都是在欧式空间内计算，例如计算余弦相似性。但是在欧式空间内计算相似性要求数据是连续的，有序的。在很多机器学习的任务中，数据都是离散的，例如星期一，星期二，···，星期天，人的性别有男女，祖国有中国，美国，法国等。这些特征值并不是连续的，而是离散的，无序的。
如果要作为机器学习算法的输入，通常我们需要对其进行特征数字化。什么是特征数字化呢？例如：

       性别特征：["男"，"女"]

       祖国特征：["中国"，"美国，"法国"]

       运动特征：["足球"，"篮球"，"羽毛球"，"乒乓球"]

怎么将上诉特征数字化呢？有个人他的特征是 [“男”,“中国”,“乒乓球”]，怎么表示他呢？

1. 独热编码

独热编码即 One-Hot Encoding，又称一位有效编码，其方法是使用N位状态寄存器来对N个状态进行编码，每个状态都由他独立的寄存器位，并且在任意时候，其中只有一位有效。one-hot向量将类别变量转换为机器学习算法易于利用的一种形式的过程，这个向量的表示为一项属性的特征向量，也就是同一时间只有一个激活点（不为0），这个向量只有一个特征是不为0的，其他都是0，特别稀疏。

1.1 独热编码例子

例1：
我们有四个样本，每个样本有三个特征，如图：

	特征1	特征2	特征3
样本1	1	4	3
样本2	2	3	2
样本3	1	2	2
样本4	2	1	1

上诉样本特征1有两种可能的取值，若代表性别，比如1代表男性2代表女性，特征2有4种，可以代表另一种特征，同样的特征3也可以有他的含义。
独热编码保证每个样本中的单个特征只有1位数字为1，其余全部为0，编码后表示为：

	特征1	特征2	特征3
样本1	01	1000	100
样本2	10	0100	010
样本3	01	0010	010
样本4	10	0001	001

对每个特征都使用独热编码表示，特征有2种取值就用两位表示，4种取值就用4位表示
对于前言中的例子，可以将特征与具体的特征对应：

性别特征：[“男”,“女”] （这里只有两个特征，所以 N=2）：
男 => 10
女 => 01
祖国特征：[“中国”，"美国，“法国”]（N=3）：
中国 => 100
美国 => 010
法国 => 001
运动特征：[“足球”，“篮球”，“羽毛球”，“乒乓球”]（N=4）：
足球 => 1000
篮球 => 0100
羽毛球 => 0010
乒乓球 => 0001
所以，当一个样本为 [“男”,“中国”,“乒乓球”] 的时候，完整的特征数字化的结果为：
[1，0，1，0，0，0，0，0，1]
前两位代表性别，中间三位代表国家，后四位代表运动。

1.2 独热编码的优点

能够处理机器学习算法不好处理的离散特征值。
在一定程度上增加了特征的维度，比如性别本身是一个特征，经过one hot编码以后，就变成了男或女两个特征。
将离散特征的取值扩展到了欧式空间，离散特征的某个取值就对应欧式空间的某个点。将离散型特征使用one-hot编码，可以会让特征之间的距离计算更加合理。

1.3 独热编码的缺点

如果原本的标签编码是有序的，那么one-hot编码就会丢失顺序信息。
如果特征值的数目特别多，就会产生大量冗余的稀疏矩阵
维度（单词）间的关系没有得到体现，每个单词都是一个维度，彼此相互独立，然而每个单词彼此无关这个特点明显不符合现实情况。大量的单词都是有关的。比如：
- 语义：girl和woman虽然用在不同年龄上，但指的都是女性。
- 复数：word和words仅仅是复数和单数的差别。
- 时态：buy和bought表达的都是“买”，但发生的时间不同。
- 所以用one hot representation的编码方式，上面的特性都没有被考虑到。

1.4 独热编码适用的情况

One Hot Encoding用来解决类别数据的离散值问题，如果特征是离散的，并且不用One Hot Encoding就可以很合理的计算出距离，那么就没必要进行One Hot Encoding。
有些基于树的算法在处理变量时，并不是基于向量空间度量，数值只是类别符号，即没有偏序关系，所以不用One Hot Encoding，树模型不太需要One Hot Encoding，对于决策树来说，没有特征大小的概念，只有特征处于哪个部分的概念，One Hot Encoding的本质是增加树的深度。如GBDT处理高维稀疏矩阵的时候效果并不好，即使是低维的稀疏矩阵也未必比SVM好。

2. 独热编码的实现

2.1 python简单实现one-hot编码

import numpy as np
samples = ['I like playing basketball', 'I played football yesterday morning']
token_index = {}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index)+1

max_length = 10
results = np.zeros(shape=(len(samples),
                          max_length,
                          max(token_index.values())+1))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        print(j)
        index = token_index.get(word)
        results[i, j, index] = 1
print(results)

2.2 sklearn

通过sklearn的OneHotEncoder()来得到独热编码，但是只适用于数值型的数据。OneHotEncoder()的 feature_indices_ 可以知道哪几列对应哪个原来的特征。
使用 numpy.hstack() 将多次结果拼接起来得到变换后的结果
问题：不能直接编码字符串类型数据（LabelEncoder() + OneHotEncoder() 可实现，但需数据格式转换）

from sklearn import preprocessing

enc = OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1],[1, 0, 2]])
print("enc.n_values_ is:",enc.n_values_)
print("enc.feature_indices_ is:",enc.feature_indices_)
print(enc.transform([[0, 1, 1]]).toarray())
print(enc.transform([[1, 1, 1]]).toarray())
print(enc.transform([[1, 2, 1]]).toarray())

输出的结果：

enc.n_values_ is: [2 3 4]          
enc.feature_indices_ is: [0 2 5 9]    #特征坐标
[[1. 0. 0. 1. 0. 0. 1. 0. 0.]]
[[0. 1. 0. 1. 0. 0. 1. 0. 0.]]

enc.n_values_ is ：每个特征值的特征数目，第一个特征数目是2，第二个特征数目是3，第三个特征数目是4。
enc.feature_indices_ is ：表明每个特征在one-hot向量中的坐标范围，0-2 是第一个特征，2-5就是第二个特征，5-9是第三个特征。
后面三个就是把特征值转换为 one-hot编码，我们可以对比结果看看one-hot差别。

2.3 Keras

from keras.preprocessing.text import Tokenizer
samples = ['I like playing basketball', 'I played football yesterday morning']
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)
sequences = tokenizer.texts_to_sequences(samples)
one_hot_results = tokenizer.text_to_matrix(samples, mode="binary")
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

2.4 tensorflow

官方文档：

tf.one_hot(
    indices,
    depth,
    on_value=None,
    off_value=None,
    axis=None,
    dtype=None,
    name=None
)
Returns a one-hot tensor(返回一个one_hot张量).
 
The locations represented by indices in indices take value on_value, while all other locations take value off_value.
(由indices指定的位置将被on_value填充, 其他位置被off_value填充).
 
on_value and off_value must have matching data types. If dtype is also provided, they must be the same data type as specified by dtype.
(on_value和off_value必须具有相同的数据类型).
 
If on_value is not provided, it will default to the value 1 with type dtype.
 
If off_value is not provided, it will default to the value 0 with type dtype.
 
If the input indices is rank N, the output will have rank N+1. The new axis is created at dimension axis (default: the new axis is appended at the end).
(如果indices是N维张量，那么函数输出将是N+1维张量,默认在最后一维添加新的维度).
 
If indices is a scalar the output shape will be a vector of length depth.
(如果indices是一个标量, 函数输出将是一个长度为depth的向量)
 
If indices is a vector of length features, the output shape will be:
 
  features x depth if axis == -1.
(如果indices是一个长度为features的向量,则默认输出一个features*depth形状的张量)
  depth x features if axis == 0.
(如果indices是一个长度为features的向量,axis=0,则输出一个depth*features形状的张量)
 
If indices is a matrix (batch) with shape [batch, features], the output shape will be:
 
  batch x features x depth if axis == -1
(如果indices是一个形状为[batch, features]的矩阵,axis=-1(默认),则输出一个batch * features * depth形状的张量)
 
  batch x depth x features if axis == 1
(如果indices是一个形状为[batch, features]的矩阵,axis=1,则输出一个batch * depth * features形状的张量)
  depth x batch x features if axis == 0
(如果indices是一个形状为[batch, features]的矩阵,axis=0,则输出一个depth * batch * features形状的张量)

实现：

indices = [0, 1, 2]  #输入数据(是个向量)需要编码的索引是[0,1,2]
depth = 3
tf.one_hot(indices, depth)  # output: [3 x 3]
# [[1., 0., 0.],
#  [0., 1., 0.],
#  [0., 0., 1.]]
 
indices = [0, 2, -1, 1]  #输入数据(是个向量)的需要编码的索引是[0,2,-1,1]
depth = 3
tf.one_hot(indices, depth,
           on_value=5.0, off_value=0.0,
           axis=-1)  # output: [4 x 3]
# [[5.0, 0.0, 0.0],  # one_hot(0)  对位置0处的数据进行one_hot编码
#  [0.0, 0.0, 5.0],  # one_hot(2)  对位置2处的数据进行one_hot编码
#  [0.0, 0.0, 0.0],  # one_hot(-1) 对位置-1处的数据进行one_hot编码
#  [0.0, 5.0, 0.0]]  # one_hot(1)  对位置1处的数据进行one_hot编码
 
indices = [[0, 2], [1, -1]]   #输入数据是个矩阵
depth = 3
tf.one_hot(indices, depth,
           on_value=1.0, off_value=0.0,
           axis=-1)  # output: [2 x 2 x 3]
# [[[1.0, 0.0, 0.0],   # one_hot(0)  对位置(0,0)处的数据进行one_hot编码
#   [0.0, 0.0, 1.0]],  # one_hot(2)  对位置(0,2)处的数据进行one_hot编码
#  [[0.0, 1.0, 0.0],   # one_hot(1)  对位置(1,1)处的数据进行one_hot编码
#   [0.0, 0.0, 0.0]]]  # one_hot(-1) 对位置(1,-1)处的数据进行one_hot编码

3 NLP中的独热表示

独热表示以往在NLP中很流行，但是随着TF-IDF以及词向量的出现，已经渐渐变得不再适用了，主要的缺点：

不考虑词与词之间的顺序（文本中词的顺序信息也是很重要的）；
假设词与词相互独立，每个词之间的距离都是 $\sqrt 2$ 。（在大多数情况下，词与词是相互影响的）；
它得到的特征是离散稀疏的，词表多少个单词，向量的维度就是多少。 (这个问题最严重)。

Elenstone

发布了21 篇原创文章 · 获赞 1 · 访问量 1121

私信关注

词向量系列之One-Hot编码详解

目录