一、前言

Spark2.0 机器学习 ML 库：数据分析方法小结（Scala 版）
Spark2.0 机器学习 ML 库：机器学习工作流、交叉 - 验证方法（Scala 版）

二、代码

1.TF-IDF（词频-逆向文档频率）

TF（词频Term Frequency）：HashingTF不CountVectorizer用于生成词频TF向量。

HashingTF是一个特征词集的转换器（Transformer），它可以将返些集合转换成固定长度
的特征向量。HashingTF利用hashing trick，原始特征通过应用哈希函数映射到索引中。然
后根据映射的索引计算词频。返种斱法避免了计算全局特征词对索引映射的需要，返对于大
型诧料库来说可能是昂贵的，但是它具有潜在的哈希冲突，其中丌同的原始特征可以在散列
乊后发成相同的特征词。为了减少碰撞的机会，我们可以增加目标特征维度，即哈希表的桶
数。由于使用简单的模数将散列函数转换为列索引，建议使用两个幂作为特征维，否则丌会
将特征均匀地映射到列。默认功能维度为2^18=262144。可选的二迕制切换参数控制词频计
数。当设置为true时，所有非零频率计数设置为1。返对于模拟二迕制而丌是整数的离散概率
模型尤其有用。

import org.apache.spark.ml.feature._
import org.apache.spark.sql.SparkSession

case class Love(id: Long, text: String, label: Double)

case class Test(id: Long, text: String)

/**
  * 1、TF-IDF（词频-逆向文档频率）
  */
object FeaturesTest {

  def main(args: Array[String]): Unit = {

    // 0.构建 Spark 对象
    val spark = SparkSession
      .builder()
      .master("local") // 本地测试，否则报错 A master URL must be set in your configuration at org.apache.spark.SparkContext.
      .appName("test")
      .enableHiveSupport()
      .getOrCreate() // 有就获取无则创建

    spark.sparkContext.setCheckpointDir("C:\\LLLLLLLLLLLLLLLLLLL\\BigData_AI\\sparkmlTest") //设置文件读取、存储的目录，HDFS最佳

    // 1.训练样本
    val sentenceData = spark.createDataFrame(
      Seq(
        Love(1L, "I love you", 1.0),
        Love(2L, "There is nothing to do", 0.0),
        Love(3L, "Work hard and you will success", 0.0),
        Love(4L, "We love each other", 1.0),
        Love(5L, "Where there is love, there are always wishes", 1.0),
        Love(6L, "I love you not because who you are,but because who I am when I am with you", 1.0),
        Love(7L, "Never frown,even when you are sad,because youn ever know who is falling in love with your smile", 1.0),
        Love(8L, "Whatever is worth doing is worth doing well", 0.0),
        Love(9L, "The hard part isn’t making the decision. It’s living with it", 0.0),
        Love(10L, "Your happy passer-by all knows, my distressed there is no place hides", 0.0),
        Love(11L, "When the whole world is about to rain, let’s make it clear in our heart together", 0.0)
      )
    ).toDF()
    sentenceData.show(false)

    /**
      * +---+-----------------------------------------------------------------------------------------------+-----+
      * |id |text                                                                                           |label|
      * +---+-----------------------------------------------------------------------------------------------+-----+
      * |1  |I love you                                                                                     |1.0  |
      * |2  |There is nothing to do                                                                         |0.0  |
      * |3  |Work hard and you will success                                                                 |0.0  |
      * |4  |We love each other                                                                             |1.0  |
      * |5  |Where there is love, there are always wishes                                                   |1.0  |
      * |6  |I love you not because who you are,but because who I am when I am with you                     |1.0  |
      * |7  |Never frown,even when you are sad,because youn ever know who is falling in love with your smile|1.0  |
      * |8  |Whatever is worth doing is worth doing well                                                    |0.0  |
      * |9  |The hard part isn’t making the decision. It’s living with it                                   |0.0  |
      * |10 |Your happy passer-by all knows, my distressed there is no place hides                          |0.0  |
      * |11 |When the whole world is about to rain, let’s make it clear in our heart together               |0.0  |
      * +---+-----------------------------------------------------------------------------------------------+-----+
      */

    // 2.参数设置：tokenizer、hashingTF、idf
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setNumFeatures(20)
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("rawFeatures")
    val idf = new IDF() // 通过CountVectorizer也可以获得词频向量
      .setInputCol(hashingTF.getOutputCol)
      .setOutputCol("features")

    val wordsData = tokenizer.transform(sentenceData)
    val featurizedData = hashingTF.transform(wordsData)
    val idfModel = idf.fit(featurizedData)

    // 3. 文档的向量化表示
    val rescaledData = idfModel.transform(featurizedData)
    rescaledData
      .select("label", "features")
      .show(false)

    /** 可见：句子越长，单词越多，则特征向量越多
      * +-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |label|features                                                                                                                                                                                                                                                                                        |
      * +-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |1.0  |(20,[0,5,9],[0.28768207245178085,0.4054651081081644,0.8754687373538999])                                                                                                                                                                                                                        |
      * |0.0  |(20,[1,4,8,11,14],[0.4054651081081644,1.3862943611198906,1.0986122886681098,1.0986122886681098,0.8754687373538999])                                                                                                                                                                             |
      * |0.0  |(20,[0,5,7,13],[0.28768207245178085,1.2163953243244932,1.3862943611198906,0.8754687373538999])                                                                                                                                                                                                  |
      * |1.0  |(20,[0,5,13,14],[0.28768207245178085,0.4054651081081644,0.8754687373538999,0.8754687373538999])                                                                                                                                                                                                 |
      * |1.0  |(20,[1,11,13,14,17,18,19],[0.4054651081081644,2.1972245773362196,0.8754687373538999,0.8754687373538999,0.6931471805599453,1.0986122886681098,1.0986122886681098])                                                                                                                               |
      * |1.0  |(20,[0,1,5,9,10,13,15,16,17,18],[0.28768207245178085,0.8109302162163288,1.2163953243244932,2.6264062120616996,0.8754687373538999,1.7509374747077997,0.8754687373538999,0.6931471805599453,1.3862943611198906,1.0986122886681098])                                                               |
      * |1.0  |(20,[0,1,2,3,5,6,9,10,14,16,17,18,19],[0.28768207245178085,0.4054651081081644,1.3862943611198906,0.8754687373538999,0.8109302162163288,1.3862943611198906,0.8754687373538999,1.7509374747077997,0.8754687373538999,0.6931471805599453,1.3862943611198906,1.0986122886681098,2.1972245773362196])|
      * |0.0  |(20,[0,1,3,15,17],[0.5753641449035617,0.8109302162163288,1.7509374747077997,0.8754687373538999,0.6931471805599453])                                                                                                                                                                             |
      * |0.0  |(20,[0,5,7,10,15,16,19],[0.5753641449035617,0.8109302162163288,1.3862943611198906,2.6264062120616996,0.8754687373538999,0.6931471805599453,1.0986122886681098])                                                                                                                                 |
      * |0.0  |(20,[1,2,3,6,8,9,11,16],[1.2163953243244932,1.3862943611198906,0.8754687373538999,2.772588722239781,1.0986122886681098,0.8754687373538999,2.1972245773362196,0.6931471805599453])                                                                                                               |
      * |0.0  |(20,[0,1,3,4,5,8,10,12,15,16,17],[0.28768207245178085,0.4054651081081644,0.8754687373538999,2.772588722239781,0.8109302162163288,1.0986122886681098,1.7509374747077997,1.791759469228055,0.8754687373538999,2.0794415416798357,0.6931471805599453])                                             |
      * +-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      */

  }

}

2.Word2Vec（文档相似性）

Word2Vec是一个Estimator(评估器)，它采用表示文档的单词序列，并训练一个
Word2VecModel。该模型将每个单词映射到一个唯一的固定大小向量。 Word2VecModel
使用文档中所有单词的平均值将每个文档转换为向量; 该向量然后可用作预测，文档相似性计
算等功能。

import org.apache.spark.ml.feature._
import org.apache.spark.sql.SparkSession

/**
  * 2、Word2Vec
  */
object FeaturesTest {

  def main(args: Array[String]): Unit = {

    // 0.构建 Spark 对象
    val spark = SparkSession
      .builder()
      .master("local") // 本地测试，否则报错 A master URL must be set in your configuration at org.apache.spark.SparkContext.
      .appName("test")
      .enableHiveSupport()
      .getOrCreate() // 有就获取无则创建

    spark.sparkContext.setCheckpointDir("C:\\LLLLLLLLLLLLLLLLLLL\\BigData_AI\\sparkmlTest") //设置文件读取、存储的目录，HDFS最佳

    // 1.训练样本
    val documentDF = spark.createDataFrame(
      Seq(
        "I love you".split(" "),
        "There is nothing to do".split(" "),
        "Work hard and you will success".split(" "),
        "We love each other".split(" "),
        "Where there is love, there are always wishes".split(" "),
        "I love you not because who you are,but because who I am when I am with you".split(" "),
        "Never frown,even when you are sad,because youn ever know who is falling in love with your smile".split(" "),
        "Whatever is worth doing is worth doing well".split(" "),
        "The hard part isn’t making the decision. It’s living with it".split(" "),
        "Your happy passer-by all knows, my distressed there is no place hides".split(" "),
        "When the whole world is about to rain, let’s make it clear in our heart together".split(" ")
      ).map(Tuple1.apply)
    ).toDF("text") // scala 版本为 2.11+ 才可以，否则报错：No TypeTag available
    documentDF.show(false)
    /**
      * +-----------------------------------------------------------------------------------------------------------------+
      * |text                                                                                                             |
      * +-----------------------------------------------------------------------------------------------------------------+
      * |[I, love, you]                                                                                                   |
      * |[There, is, nothing, to, do]                                                                                     |
      * |[Work, hard, and, you, will, success]                                                                            |
      * |[We, love, each, other]                                                                                          |
      * |[Where, there, is, love, , there, are, always, wishes]                                                            |
      * |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |
      * |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|
      * |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |
      * |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |
      * |[Your, happy, passer-by, all, knows, , my, distressed, there, is, no, place, hides]                               |
      * |[When, the, whole, world, is, about, to, rain, , let’s, make, it, clear, in, our, heart, together]                |
      * +-----------------------------------------------------------------------------------------------------------------+
      **/

    // 2. word2Vec
    val word2VecModel = new Word2Vec()
      .setInputCol("text") // 要求输入的数据，单位是数组
      .setOutputCol("result")
      .setVectorSize(3)
      .setMinCount(0)
      .fit(documentDF)

    // 3. 文档的向量化表示
    val result = word2VecModel.transform(documentDF)
    result
      .select("result","text")
      .show(false)

    /**
      * +--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
      * |result                                                              |text                                                                                                             |
      * +--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
      * |[-0.05712633579969406,0.01896375169356664,-0.021923241515954334]    |[I, love, you]                                                                                                   |
      * |[0.006795959174633027,-0.05859951674938202,-0.02231040205806494]    |[There, is, nothing, to, do]                                                                                     |
      * |[-0.01718233898282051,-0.044684726279228926,0.022707909112796187]   |[Work, hard, and, you, will, success]                                                                            |
      * |[0.014710488263517618,0.04914409201592207,-0.0535422433167696]      |[We, love, each, other]                                                                                          |
      * |[0.056647833436727524,-0.013540415093302727,-0.007903479505330324]  |[Where, there, is, love, , there, are, always, wishes]                                                            |
      * |[-0.012073692482183962,0.0068947237587588675,-0.007010678075911368] |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |
      * |[-0.009022715939756702,0.007438146413358695,-0.00402127337806365]   |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|
      * |[-0.007301235804334283,-0.025249323691241443,0.05116166779771447]   |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |
      * |[0.055422113192352386,0.04088194024833766,-0.008757691322402521]    |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |
      * |[0.0017315041817103822,0.026252828383197386,-0.004247877125938733]  |[Your, happy, passer-by, all, knows, , my, distressed, there, is, no, place, hides]                               |
      * |[-0.013085987884551287,-3.071942483074963E-4,-0.0029873197781853378]|[When, the, whole, world, is, about, to, rain, , let’s, make, it, clear, in, our, heart, together]                |
      * +--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
      **/

  }

}

3、CountVectorizer

CountVectorizer和CountVectorizerModel是将文本文档集合转换为向量。当先验词典丌可
用时，CountVectorizer可以用作估计器来提叏词汇表，并生成CountVectorizerModel。该
模型通过词汇生成文档的秲疏表示，然后可以将其传递给其他算法，如LDA。

在拟合过程中，CountVectorizer将选择通过诧料库按术诧频率排序的top前几vocabSize词
。可选参数minDF迓通过指定术诧必须出现以包含在词汇表中的文档的最小数量（戒小于
1.0）来影响拟合过秳。另一个可选的二迕制切换参数控制输出向量。如果设置为true，则
所有非零计数都设置为1.对于模拟二迕制而丌是整数的离散概率模型，返是非常有用的。

import org.apache.spark.ml.feature._
import org.apache.spark.sql.SparkSession

/**
  * 3、CountVectorizer
  * 获取词频
  */
object FeaturesTests {

  def main(args: Array[String]): Unit = {

    // 0.构建 Spark 对象
    val spark = SparkSession
      .builder()
      .master("local") // 本地测试，否则报错 A master URL must be set in your configuration at org.apache.spark.SparkContext.
      .appName("test")
      .enableHiveSupport()
      .getOrCreate() // 有就获取无则创建

    spark.sparkContext.setCheckpointDir("C:\\LLLLLLLLLLLLLLLLLLL\\BigData_AI\\sparkmlTest") //设置文件读取、存储的目录，HDFS最佳

    // 1.训练样本
    val documentDF = spark.createDataFrame(
      Seq(
        "I love you".split(" "),
        "There is nothing to do".split(" "),
        "Work hard and you will success".split(" "),
        "We love each other".split(" "),
        "Where there is love, there are always wishes".split(" "),
        "I love you not because who you are,but because who I am when I am with you".split(" "),
        "Never frown,even when you are sad,because youn ever know who is falling in love with your smile".split(" "),
        "Whatever is worth doing is worth doing well".split(" "),
        "The hard part isn’t making the decision. It’s living with it".split(" "),
        "Your happy passer-by all knows, my distressed there is no place hides".split(" "),
        "When the whole world is about to rain, let’s make it clear in our heart together".split(" ")
      ).map(Tuple1.apply)
    ).toDF("words") // scala 版本为 2.11+ 才可以，否则报错：No TypeTag available
    documentDF.show(false)

    /**
      * +-----------------------------------------------------------------------------------------------------------------+
      * |words                                                                                                            |
      * +-----------------------------------------------------------------------------------------------------------------+
      * |[I, love, you]                                                                                                   |
      * |[There, is, nothing, to, do]                                                                                     |
      * |[Work, hard, and, you, will, success]                                                                            |
      * |[We, love, each, other]                                                                                          |
      * |[Where, there, is, love,, there, are, always, wishes]                                                            |
      * |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |
      * |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|
      * |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |
      * |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |
      * |[Your, happy, passer-by, all, knows,, my, distressed, there, is, no, place, hides]                               |
      * |[When, the, whole, world, is, about, to, rain,, let’s, make, it, clear, in, our, heart, together]                |
      * +-----------------------------------------------------------------------------------------------------------------+
      **/

    // 2. CountVectorizer
    val cvModel = new CountVectorizer()
      .setInputCol("words")
      .setOutputCol("features")
      .setVocabSize(3)
      .setMinDF(2)
      .fit(documentDF)

    // 3. 文档的向量化表示
    cvModel.transform(documentDF).show(false)

    /**
      * +-----------------------------------------------------------------------------------------------------------------+-------------------------+
      * |words                                                                                                            |features                 |
      * +-----------------------------------------------------------------------------------------------------------------+-------------------------+
      * |[I, love, you]                                                                                                   |(3,[1,2],[1.0,1.0])      |
      * |[There, is, nothing, to, do]                                                                                     |(3,[0],[1.0])            |
      * |[Work, hard, and, you, will, success]                                                                            |(3,[1],[1.0])            |
      * |[We, love, each, other]                                                                                          |(3,[2],[1.0])            |
      * |[Where, there, is, love, , there, are, always, wishes]                                                            |(3,[0],[1.0])            |
      * |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |(3,[1,2],[3.0,1.0])      |
      * |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|(3,[0,1,2],[1.0,1.0,1.0])|
      * |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |(3,[0],[2.0])            |
      * |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |(3,[],[])                |
      * |[Your, happy, passer-by, all, knows, , my, distressed, there, is, no, place, hides]                               |(3,[0],[1.0])            |
      * |[When, the, whole, world, is, about, to, rain, , let’s, make, it, clear, in, our, heart, together]                |(3,[0],[1.0])            |
      * +-----------------------------------------------------------------------------------------------------------------+-------------------------+
      **/

  }

}

Spark 2.0 机器学习 ML 库：特征提取、转化、选取（Scala 版）

一、前言

二、代码

1.TF-IDF（词频-逆向文档频率）

2.Word2Vec（文档相似性）

3、CountVectorizer

猜你喜欢