VectorIndexer
主要作用:提高决策树或随机森林等ML方法的分类效果。
VectorIndexer是对数据集特征向量中的类别(离散值)特征(index categorical features categorical features )进行编号。
它能够自动判断那些特征是离散值型的特征,并对他们进行编号,具体做法是通过设置一个maxCategories,特征向量中某一个特征不重复取值个数小于maxCategories,则被重新编号为0~K(K<=maxCategories-1)。某一个特征不重复取值个数大于maxCategories,则该特征视为连续值,不会重新编号(不会发生任何改变)。结合例子看吧,实在太绕了。
- <code> VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:
-
- Take an input column of type Vector and a parameter maxCategories. Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical.
- Compute 0-based category indices for each categorical feature.
- Index categorical features and transform original feature values to indices.
-
- Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.
-
- This transformed data could then be passed to algorithms such as DecisionTreeRegressor that handle categorical features.
- </code>
用一个简单的数据集举例如下:
- <code>
-
- VectorIndexerModel featureIndexerModel=new VectorIndexer()
- .setInputCol("features")
- .setMaxCategories(5)
- .setOutputCol("indexedFeatures")
- .fit(rawData);
-
- Pipeline pipeline=new Pipeline()
- .setStages(new PipelineStage[]
- {labelIndexerModel,
- featureIndexerModel,
- dtClassifier,
- converter});
- pipeline.fit(rawData).transform(rawData).select("features","indexedFeatures").show(20,false);
-
- +-------------------------+-------------------------+
- |features |indexedFeatures |
- +-------------------------+-------------------------+
- |(3,[0,1,2],[2.0,5.0,7.0])|(3,[0,1,2],[2.0,1.0,1.0])|
- |(3,[0,1,2],[3.0,5.0,9.0])|(3,[0,1,2],[3.0,1.0,2.0])|
- |(3,[0,1,2],[4.0,7.0,9.0])|(3,[0,1,2],[4.0,3.0,2.0])|
- |(3,[0,1,2],[2.0,4.0,9.0])|(3,[0,1,2],[2.0,0.0,2.0])|
- |(3,[0,1,2],[9.0,5.0,7.0])|(3,[0,1,2],[9.0,1.0,1.0])|
- |(3,[0,1,2],[2.0,5.0,9.0])|(3,[0,1,2],[2.0,1.0,2.0])|
- |(3,[0,1,2],[3.0,4.0,9.0])|(3,[0,1,2],[3.0,0.0,2.0])|
- |(3,[0,1,2],[8.0,4.0,9.0])|(3,[0,1,2],[8.0,0.0,2.0])|
- |(3,[0,1,2],[3.0,6.0,2.0])|(3,[0,1,2],[3.0,2.0,0.0])|
- |(3,[0,1,2],[5.0,9.0,2.0])|(3,[0,1,2],[5.0,4.0,0.0])|
- +-------------------------+-------------------------+
- 结果分析:特征向量包含3个特征,即特征0,特征1,特征2。如Row=1,对应的特征分别是2.0,5.0,7.0.被转换为2.0,1.0,1.0。
- 我们发现只有特征1,特征2被转换了,特征0没有被转换。这是因为特征0有6中取值(2,3,4,5,8,9),多于前面的设置setMaxCategories(5)
- ,因此被视为连续值了,不会被转换。
- 特征1中,(4,5,6,7,9)-->(0,1,2,3,4,5)
- 特征2中, (2,7,9)-->(0,1,2)
-
- 输出DataFrame格式说明(Row=1):
- 3个特征 特征0,1,2 转换前的值
- |(3, [0,1,2], [2.0,5.0,7.0])
- 3个特征 特征1,1,2 转换后的值
- |(3, [0,1,2], [2.0,1.0,1.0])|</code>
StringIndexer
理解了前面的VectorIndexer之后,StringIndexer对数据集的label进行重新编号就很容易理解了,都是采用类似的转换思路,看下面的例子就可以了。
- <code>
- StringIndexerModel labelIndexerModel=new StringIndexer().
- setInputCol("label")
- .setOutputCol("indexedLabel")
- .fit(rawData);
-
- Pipeline pipeline=new Pipeline()
- .setStages(new PipelineStage[]
- {labelIndexerModel,
- featureIndexerModel,
- dtClassifier,
- converter});
-
- pipeline.fit(rawData).transform(rawData).select("label","indexedLabel").show(20,false);
-
- 按label出现的频次,转换成0~num numOfLabels-1(分类个数),频次最高的转换为0,以此类推:
- label=3,出现次数最多,出现了4次,转换(编号)为0
- 其次是label=2,出现了3次,编号为1,以此类推
- +-----+------------+
- |label|indexedLabel|
- +-----+------------+
- |3.0 |0.0 |
- |4.0 |3.0 |
- |1.0 |2.0 |
- |3.0 |0.0 |
- |2.0 |1.0 |
- |3.0 |0.0 |
- |2.0 |1.0 |
- |3.0 |0.0 |
- |2.0 |1.0 |
- |1.0 |2.0 |
- +-----+------------+
- </code>
在其它地方应用StringIndexer时还需要注意两个问题:
(1)StringIndexer本质上是对String类型–>index( number);如果是:数值(numeric)–>index(number),实际上是对把数值先进行了类型转换( cast numeric to string and then index the string values.),也就是说无论是String,还是数值,都可以重新编号(Index);
(2)利用获得的模型转化新数据集时,可能遇到异常情况,见下面例子。
- <code>StringIndexer对String按频次进行编号
- id | category | categoryIndex
- ----|----------|---------------
- 0 | a | 0.0
- 1 | b | 2.0
- 2 | c | 1.0
- 3 | a | 0.0
- 4 | a | 0.0
- 5 | c | 1.0
- 如果转换模型(关系)是基于上面数据得到的 (a,b,c)->(0.0,2.0,1.0),如果用此模型转换category多于(a,b,c)的数据,比如多了d,e,就会遇到麻烦:
- id | category | categoryIndex
- ----|----------|---------------
- 0 | a | 0.0
- 1 | b | 2.0
- 2 | d | ?
- 3 | e | ?
- 4 | a | 0.0
- 5 | c | 1.0
- Spark提供了两种处理方式:
- StringIndexerModel labelIndexerModel=new StringIndexer().
- setInputCol("label")
- .setOutputCol("indexedLabel")
-
- .setHandleInvalid("skip")
- .fit(rawData);
- (1)默认设置,也就是.setHandleInvalid("error"):会抛出异常
- org.apache.spark.SparkException: Unseen label: d,e
- (2).setHandleInvalid("skip") 忽略这些label所在行的数据,正常运行,将输出如下结果:
- id | category | categoryIndex
- ----|----------|---------------
- 0 | a | 0.0
- 1 | b | 2.0
- 4 | a | 0.0
- 5 | c | 1.0
- </code>
IndexToString
相应的,有StringIndexer,就应该有IndexToString。在应用StringIndexer对labels进行重新编号后,带着这些编号后的label对数据进行了训练,并接着对其他数据进行了预测,得到预测结果,预测结果的label也是重新编号过的,因此需要转换回来。见下面例子,转换回来的convetedPrediction才和原始的label对应。
- <code> Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings. A common use case is to produce indices from labels with StringIndexer, train a model with those indices and retrieve the original labels from the column of predicted indices with IndexToString.
- </code>
- <code>IndexToString converter=new IndexToString()
- .setInputCol("prediction")
- .setOutputCol("convetedPrediction")
- .setLabels(labelIndexerModel.labels());
- Pipeline pipeline=new Pipeline()
- .setStages(new PipelineStage[]
- {labelIndexerModel,
- featureIndexerModel,
- dtClassifier,
- converter});
- pipeline.fit(rawData).transform(rawData)
- .select("label","prediction","convetedPrediction").show(20,false);
- |label|prediction|convetedPrediction|
- +-----+----------+------------------+
- |3.0 |0.0 |3.0 |
- |4.0 |1.0 |2.0 |
- |1.0 |2.0 |1.0 |
- |3.0 |0.0 |3.0 |
- |2.0 |1.0 |2.0 |
- |3.0 |0.0 |3.0 |
- |2.0 |1.0 |2.0 |
- |3.0 |0.0 |3.0 |
- |2.0 |1.0 |2.0 |
- |1.0 |2.0 |1.0 |
- +-----+----------+------------------+</code>
离散<->连续特征或Label相互转换
oneHotEncoder
独热编码将类别特征(离散的,已经转换为数字编号形式),映射成独热编码。这样在诸如Logistic回归这样需要连续数值值作为特征输入的分类器中也可以使用类别(离散)特征。
独热编码即 One-Hot 编码,又称一位有效编码,其方法是使用N位 状态寄存
器来对N个状态进行编码,每个状态都由他独立的寄存器 位,并且在任意
时候,其 中只有一位有效。
例如: 自然状态码为:000,001,010,011,100,101
独热编码为:000001,000010,000100,001000,010000,100000
可以这样理解,对于每一个特征,如果它有m个可能值,那么经过独 热编码
后,就变成了m个二元特征。并且,这些特征互斥,每次只有 一个激活。因
此,数据会变成稀疏的。
这样做的好处主要有:
解决了分类器不好处理属性数据的问题
在一定程度上也起到了扩充特征的作用
One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
- <code><code>
- Dataset indexedDf=new StringIndexer()
- .setInputCol("category")
- .setOutputCol("indexCategory")
- .fit(df)
- .transform(df);
-
- Dataset coderDf=new OneHotEncoder()
- .setInputCol("indexCategory")
- .setOutputCol("ontHotCategory")
- .transform(indexedDf);</code></code>
Bucketizer
分箱(分段处理):将连续数值转换为离散类别
比如特征是年龄,是一个连续数值,需要将其转换为离散类别(未成年人、青年人、中年人、老年人),就要用到Bucketizer了。
分类的标准是自己定义的,在Spark中为split参数,定义如下:
double[] splits = {0, 18, 35,50, Double.PositiveInfinity}
将数值年龄分为四类0-18,18-35,35-50,55+四个段。
如果左右边界拿不准,就设置为,Double.NegativeInfinity, Double.PositiveInfinity,不会有错的。
Bucketizer transforms a column of continuous features to a column of
feature buckets, where the buckets are specified by users.
- <code><code>
- double[] splits={0,18,35,55,Double.POSITIVE_INFINITY};Dataset bucketDf=new Bucketizer()
- .setInputCol("ages")
- .setOutputCol("bucketCategory")
- .setSplits(splits)
- .transform(df);
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </code></code>
QuantileDiscretizer
分位树为数离散化,和Bucketizer(分箱处理)一样也是:将连续数值特征转换为离散类别特征。实际上Class QuantileDiscretizer extends (继承自) Class(Bucketizer)。
扫描二维码关注公众号,回复:
2786580 查看本文章
参数1:不同的是这里不再自己定义splits(分类标准),而是定义分几箱(段)就可以了。QuantileDiscretizer自己调用函数计算分位数,并完成离散化。
-参数2: 另外一个参数是精度,如果设置为0,则计算最精确的分位数,这是一个高时间代价的操作。另外上下边界将设置为正负无穷,覆盖所有实数范围。
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins is set by the numBuckets parameter. The bin ranges are chosen using an approximate algorithm (see the documentation for approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. When set to zero, exact quantiles are calculated (Note: Computing exact quantiles is an expensive operation). The lower and upper bin bounds will be -Infinity and +Infinity covering all real values.
- <code>new QuantileDiscretizer()
- .setInputCol("ages")
- .setOutputCol("qdCategory")
- .setNumBuckets(4)
- .setRelativeError(0.1)
- .fit(df)
- .transform(df)
- .show(10,false);
-
- +---+----+----------+
- |id |ages|qdCategory|
- +---+----+----------+
- |0.0|2.0 |0.0 |
- |1.0|67.0|3.0 |
- |2.0|36.0|2.0 |
- |3.0|14.0|1.0 |
- |4.0|5.0 |0.0 |
- |5.0|98.0|3.0 |
- |6.0|65.0|2.0 |
- |7.0|23.0|1.0 |
- |8.0|37.0|2.0 |
- |9.0|76.0|3.0 |
- +---+----+----------+</code>