Spark 2.0 机器学习 ML 库:特征提取、转化、选取(Scala 版)


Spark2.0 机器学习 ML 库:数据分析方法小结(Scala 版)
Spark2.0 机器学习 ML 库:机器学习工作流、交叉 - 验证方法(Scala 版)



TF(词频Term Frequency):HashingTF不CountVectorizer用于生成词频TF向量。

的特征向量。HashingTF利用hashing trick,原始特征通过应用哈希函数映射到索引中。然

import org.apache.spark.sql.SparkSession

case class Love(id: Long, text: String, label: Double)

case class Test(id: Long, text: String)

  * 1、TF-IDF(词频-逆向文档频率)
object FeaturesTest {

  def main(args: Array[String]): Unit = {

    // 0.构建 Spark 对象
    val spark = SparkSession
      .master("local") // 本地测试,否则报错 A master URL must be set in your configuration at org.apache.spark.SparkContext.
      .getOrCreate() // 有就获取无则创建

    spark.sparkContext.setCheckpointDir("C:\\LLLLLLLLLLLLLLLLLLL\\BigData_AI\\sparkmlTest") //设置文件读取、存储的目录,HDFS最佳

    // 1.训练样本
    val sentenceData = spark.createDataFrame(
        Love(1L, "I love you", 1.0),
        Love(2L, "There is nothing to do", 0.0),
        Love(3L, "Work hard and you will success", 0.0),
        Love(4L, "We love each other", 1.0),
        Love(5L, "Where there is love, there are always wishes", 1.0),
        Love(6L, "I love you not because who you are,but because who I am when I am with you", 1.0),
        Love(7L, "Never frown,even when you are sad,because youn ever know who is falling in love with your smile", 1.0),
        Love(8L, "Whatever is worth doing is worth doing well", 0.0),
        Love(9L, "The hard part isn’t making the decision. It’s living with it", 0.0),
        Love(10L, "Your happy passer-by all knows, my distressed there is no place hides", 0.0),
        Love(11L, "When the whole world is about to rain, let’s make it clear in our heart together", 0.0)

      * +---+-----------------------------------------------------------------------------------------------+-----+
      * |id |text                                                                                           |label|
      * +---+-----------------------------------------------------------------------------------------------+-----+
      * |1  |I love you                                                                                     |1.0  |
      * |2  |There is nothing to do                                                                         |0.0  |
      * |3  |Work hard and you will success                                                                 |0.0  |
      * |4  |We love each other                                                                             |1.0  |
      * |5  |Where there is love, there are always wishes                                                   |1.0  |
      * |6  |I love you not because who you are,but because who I am when I am with you                     |1.0  |
      * |7  |Never frown,even when you are sad,because youn ever know who is falling in love with your smile|1.0  |
      * |8  |Whatever is worth doing is worth doing well                                                    |0.0  |
      * |9  |The hard part isn’t making the decision. It’s living with it                                   |0.0  |
      * |10 |Your happy passer-by all knows, my distressed there is no place hides                          |0.0  |
      * |11 |When the whole world is about to rain, let’s make it clear in our heart together               |0.0  |
      * +---+-----------------------------------------------------------------------------------------------+-----+

    // 2.参数设置:tokenizer、hashingTF、idf
    val tokenizer = new Tokenizer()
    val hashingTF = new HashingTF()
    val idf = new IDF() // 通过CountVectorizer也可以获得词频向量

    val wordsData = tokenizer.transform(sentenceData)
    val featurizedData = hashingTF.transform(wordsData)
    val idfModel =

    // 3. 文档的向量化表示
    val rescaledData = idfModel.transform(featurizedData)
      .select("label", "features")

    /** 可见:句子越长,单词越多,则特征向量越多
      * +-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |label|features                                                                                                                                                                                                                                                                                        |
      * +-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |1.0  |(20,[0,5,9],[0.28768207245178085,0.4054651081081644,0.8754687373538999])                                                                                                                                                                                                                        |
      * |0.0  |(20,[1,4,8,11,14],[0.4054651081081644,1.3862943611198906,1.0986122886681098,1.0986122886681098,0.8754687373538999])                                                                                                                                                                             |
      * |0.0  |(20,[0,5,7,13],[0.28768207245178085,1.2163953243244932,1.3862943611198906,0.8754687373538999])                                                                                                                                                                                                  |
      * |1.0  |(20,[0,5,13,14],[0.28768207245178085,0.4054651081081644,0.8754687373538999,0.8754687373538999])                                                                                                                                                                                                 |
      * |1.0  |(20,[1,11,13,14,17,18,19],[0.4054651081081644,2.1972245773362196,0.8754687373538999,0.8754687373538999,0.6931471805599453,1.0986122886681098,1.0986122886681098])                                                                                                                               |
      * |1.0  |(20,[0,1,5,9,10,13,15,16,17,18],[0.28768207245178085,0.8109302162163288,1.2163953243244932,2.6264062120616996,0.8754687373538999,1.7509374747077997,0.8754687373538999,0.6931471805599453,1.3862943611198906,1.0986122886681098])                                                               |
      * |1.0  |(20,[0,1,2,3,5,6,9,10,14,16,17,18,19],[0.28768207245178085,0.4054651081081644,1.3862943611198906,0.8754687373538999,0.8109302162163288,1.3862943611198906,0.8754687373538999,1.7509374747077997,0.8754687373538999,0.6931471805599453,1.3862943611198906,1.0986122886681098,2.1972245773362196])|
      * |0.0  |(20,[0,1,3,15,17],[0.5753641449035617,0.8109302162163288,1.7509374747077997,0.8754687373538999,0.6931471805599453])                                                                                                                                                                             |
      * |0.0  |(20,[0,5,7,10,15,16,19],[0.5753641449035617,0.8109302162163288,1.3862943611198906,2.6264062120616996,0.8754687373538999,0.6931471805599453,1.0986122886681098])                                                                                                                                 |
      * |0.0  |(20,[1,2,3,6,8,9,11,16],[1.2163953243244932,1.3862943611198906,0.8754687373538999,2.772588722239781,1.0986122886681098,0.8754687373538999,2.1972245773362196,0.6931471805599453])                                                                                                               |
      * |0.0  |(20,[0,1,3,4,5,8,10,12,15,16,17],[0.28768207245178085,0.4054651081081644,0.8754687373538999,2.772588722239781,0.8109302162163288,1.0986122886681098,1.7509374747077997,1.791759469228055,0.8754687373538999,2.0794415416798357,0.6931471805599453])                                             |
      * +-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+




Word2VecModel。 该模型将每个单词映射到一个唯一的固定大小向量。 Word2VecModel
使用文档中所有单词的平均值将每个文档转换为向量; 该向量然后可用作预测,文档相似性计

import org.apache.spark.sql.SparkSession

  * 2、Word2Vec
object FeaturesTest {

  def main(args: Array[String]): Unit = {

    // 0.构建 Spark 对象
    val spark = SparkSession
      .master("local") // 本地测试,否则报错 A master URL must be set in your configuration at org.apache.spark.SparkContext.
      .getOrCreate() // 有就获取无则创建

    spark.sparkContext.setCheckpointDir("C:\\LLLLLLLLLLLLLLLLLLL\\BigData_AI\\sparkmlTest") //设置文件读取、存储的目录,HDFS最佳

    // 1.训练样本
    val documentDF = spark.createDataFrame(
        "I love you".split(" "),
        "There is nothing to do".split(" "),
        "Work hard and you will success".split(" "),
        "We love each other".split(" "),
        "Where there is love, there are always wishes".split(" "),
        "I love you not because who you are,but because who I am when I am with you".split(" "),
        "Never frown,even when you are sad,because youn ever know who is falling in love with your smile".split(" "),
        "Whatever is worth doing is worth doing well".split(" "),
        "The hard part isn’t making the decision. It’s living with it".split(" "),
        "Your happy passer-by all knows, my distressed there is no place hides".split(" "),
        "When the whole world is about to rain, let’s make it clear in our heart together".split(" ")
    ).toDF("text") // scala 版本为 2.11+ 才可以,否则报错:No TypeTag available
      * +-----------------------------------------------------------------------------------------------------------------+
      * |text                                                                                                             |
      * +-----------------------------------------------------------------------------------------------------------------+
      * |[I, love, you]                                                                                                   |
      * |[There, is, nothing, to, do]                                                                                     |
      * |[Work, hard, and, you, will, success]                                                                            |
      * |[We, love, each, other]                                                                                          |
      * |[Where, there, is, love, , there, are, always, wishes]                                                            |
      * |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |
      * |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|
      * |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |
      * |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |
      * |[Your, happy, passer-by, all, knows, , my, distressed, there, is, no, place, hides]                               |
      * |[When, the, whole, world, is, about, to, rain, , let’s, make, it, clear, in, our, heart, together]                |
      * +-----------------------------------------------------------------------------------------------------------------+

    // 2. word2Vec
    val word2VecModel = new Word2Vec()
      .setInputCol("text") // 要求输入的数据,单位是数组

    // 3. 文档的向量化表示
    val result = word2VecModel.transform(documentDF)

      * +--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
      * |result                                                              |text                                                                                                             |
      * +--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
      * |[-0.05712633579969406,0.01896375169356664,-0.021923241515954334]    |[I, love, you]                                                                                                   |
      * |[0.006795959174633027,-0.05859951674938202,-0.02231040205806494]    |[There, is, nothing, to, do]                                                                                     |
      * |[-0.01718233898282051,-0.044684726279228926,0.022707909112796187]   |[Work, hard, and, you, will, success]                                                                            |
      * |[0.014710488263517618,0.04914409201592207,-0.0535422433167696]      |[We, love, each, other]                                                                                          |
      * |[0.056647833436727524,-0.013540415093302727,-0.007903479505330324]  |[Where, there, is, love, , there, are, always, wishes]                                                            |
      * |[-0.012073692482183962,0.0068947237587588675,-0.007010678075911368] |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |
      * |[-0.009022715939756702,0.007438146413358695,-0.00402127337806365]   |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|
      * |[-0.007301235804334283,-0.025249323691241443,0.05116166779771447]   |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |
      * |[0.055422113192352386,0.04088194024833766,-0.008757691322402521]    |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |
      * |[0.0017315041817103822,0.026252828383197386,-0.004247877125938733]  |[Your, happy, passer-by, all, knows, , my, distressed, there, is, no, place, hides]                               |
      * |[-0.013085987884551287,-3.071942483074963E-4,-0.0029873197781853378]|[When, the, whole, world, is, about, to, rain, , let’s, make, it, clear, in, our, heart, together]                |
      * +--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+




CountVectorizer和CountVectorizerModel是将文本文档集合转换为向量。 当先验词典丌可
用时,CountVectorizer可以用作估计器来提叏词汇表,并生成CountVectorizerModel。 该

。 可选参数minDF迓通过指定术诧必须出现以包含在词汇表中的文档的最小数量(戒小于
1.0)来影响拟合过秳。 另一个可选的二迕制切换参数控制输出向量。 如果设置为true,则

import org.apache.spark.sql.SparkSession

  * 3、CountVectorizer
  * 获取词频
object FeaturesTests {

  def main(args: Array[String]): Unit = {

    // 0.构建 Spark 对象
    val spark = SparkSession
      .master("local") // 本地测试,否则报错 A master URL must be set in your configuration at org.apache.spark.SparkContext.
      .getOrCreate() // 有就获取无则创建

    spark.sparkContext.setCheckpointDir("C:\\LLLLLLLLLLLLLLLLLLL\\BigData_AI\\sparkmlTest") //设置文件读取、存储的目录,HDFS最佳

    // 1.训练样本
    val documentDF = spark.createDataFrame(
        "I love you".split(" "),
        "There is nothing to do".split(" "),
        "Work hard and you will success".split(" "),
        "We love each other".split(" "),
        "Where there is love, there are always wishes".split(" "),
        "I love you not because who you are,but because who I am when I am with you".split(" "),
        "Never frown,even when you are sad,because youn ever know who is falling in love with your smile".split(" "),
        "Whatever is worth doing is worth doing well".split(" "),
        "The hard part isn’t making the decision. It’s living with it".split(" "),
        "Your happy passer-by all knows, my distressed there is no place hides".split(" "),
        "When the whole world is about to rain, let’s make it clear in our heart together".split(" ")
    ).toDF("words") // scala 版本为 2.11+ 才可以,否则报错:No TypeTag available

      * +-----------------------------------------------------------------------------------------------------------------+
      * |words                                                                                                            |
      * +-----------------------------------------------------------------------------------------------------------------+
      * |[I, love, you]                                                                                                   |
      * |[There, is, nothing, to, do]                                                                                     |
      * |[Work, hard, and, you, will, success]                                                                            |
      * |[We, love, each, other]                                                                                          |
      * |[Where, there, is, love,, there, are, always, wishes]                                                            |
      * |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |
      * |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|
      * |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |
      * |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |
      * |[Your, happy, passer-by, all, knows,, my, distressed, there, is, no, place, hides]                               |
      * |[When, the, whole, world, is, about, to, rain,, let’s, make, it, clear, in, our, heart, together]                |
      * +-----------------------------------------------------------------------------------------------------------------+

    // 2. CountVectorizer
    val cvModel = new CountVectorizer()

    // 3. 文档的向量化表示

      * +-----------------------------------------------------------------------------------------------------------------+-------------------------+
      * |words                                                                                                            |features                 |
      * +-----------------------------------------------------------------------------------------------------------------+-------------------------+
      * |[I, love, you]                                                                                                   |(3,[1,2],[1.0,1.0])      |
      * |[There, is, nothing, to, do]                                                                                     |(3,[0],[1.0])            |
      * |[Work, hard, and, you, will, success]                                                                            |(3,[1],[1.0])            |
      * |[We, love, each, other]                                                                                          |(3,[2],[1.0])            |
      * |[Where, there, is, love, , there, are, always, wishes]                                                            |(3,[0],[1.0])            |
      * |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |(3,[1,2],[3.0,1.0])      |
      * |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|(3,[0,1,2],[1.0,1.0,1.0])|
      * |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |(3,[0],[2.0])            |
      * |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |(3,[],[])                |
      * |[Your, happy, passer-by, all, knows, , my, distressed, there, is, no, place, hides]                               |(3,[0],[1.0])            |
      * |[When, the, whole, world, is, about, to, rain, , let’s, make, it, clear, in, our, heart, together]                |(3,[0],[1.0])            |
      * +-----------------------------------------------------------------------------------------------------------------+-------------------------+



