wordcount基本原理深度剖析
在学习spark编程过程中,我想大多数人写的第一个spark程序应该是WordCount程序。今天我对WordCount程序做深入剖析。
WordCount程序代码
/**
* Created by cuiyufei on 2018/2/13.
*/
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object WordCount {
private val master = "spark://spark1:7077"
private val remote_file = "F:\\spark\\spark.txt"
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("WordCount")
.setMaster(master)
val sc = new SparkContext(conf)
val lines = sc.textFile(remote_file)
val words = lines.flatMap(line => line.split(" "))
val pairs = words.map(word => (word,1))
val wordCounts = pairs.reduceByKey((a,b) => a + b)
//val wordCounts = pairs.reduceByKey(_+_)
wordCounts.foreach(println)
}
}
spark中RDD的变化过程如下图所示。
这就是所谓的spark的分布式、内存式迭代式的计算模型,也是spark之所以速度比MapReduce更快的原因,如果是MapReduce,就必须走磁盘读写,速度必然下降