Spark-Task not serializable错误解析
在学习SparkStreaming的时候偶然出现的一个问题,先看下面一段代码:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by Administrator on 2017/11/6.
*/
object ForEachTest { val checkpointDirectory="hdfs://hadoop1:9000/streamingchekpoint4" def functionToCreateContext(): StreamingContext = { //程序入口 val conf = new SparkConf().setMaster("local[2]").setAppName(s"${this.getClass.getSimpleName}") val sc = new SparkContext(conf) sc.setLogLevel("ERROR") val ssc = new StreamingContext(sc,Seconds(1)) //数据的输入 val dStream = ssc.socketTextStream("192.168.32.10",9999) //数据的处理 val resultDStream = dStream.flatMap(_.split(",")) .map((_, 1)) .updateStateByKey((values: Seq[Int], valuesState: Option[Int]) => { val currentCount = values.sum val lastCount = valuesState.getOrElse(0) Some(currentCount + lastCount) }) //程序的输出 resultDStream.foreachRDD( rdd =>{ //Driver val jdbcCoon = MysqlPool.getJdbcCoon() val statement = jdbcCoon.createStatement() rdd.foreachPartition( partition =>{ //Executor partition.foreach( recored =>{ //Executor val word = recored._1 val count = recored._2 val sql=s"insert into aura.1706wordcount values(now(),'${word}',${count})" statement.execute(sql) }) MysqlPool.releaseConn(jdbcCoon) }) }) //设置检查点 ssc.checkpoint(checkpointDirectory) ssc } def main(args: Array[String]): Unit = { val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _) //启动程序 ssc.start() ssc.awaitTermination() } }
这段代码是一个SparkStraming与mysql交互的Demo,用到了foreachRDD算子,mysql连接池的代码这里先省略,因为不是重点,会在另一片专门写SparkStreaming的博客中给出。这段代码看似没有问题,但是运行报错:
org.apache.spark.SparkException: Task not serializable Caused by: java.io.NotSerializableException: java.lang.Object
表示任务没有被序列化,那么这个序列化到底是指哪里呢?通过查阅官网,发现在介绍foreachRDD的时候有过这么一个介绍:
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
这个说明foreachRDD是在driver端执行的,而foreach是在worker端执行的。我们知道我们在提交代码的时候,提交这个动作是在driver端执行的,提交的这台服务器就是driver,那么哪些代码是在drvier端执行的呢?
val conf = new SparkConf()
conf.setAppName(s"${this.getClass.getSimpleName}").setMaster("local[2]")
val sc = new SparkContext(conf) val ssc: StreamingContext = new StreamingContext(sc, Seconds(1))
以上的这些初始化的代码和:textfile、foreachRDD都是在driver端执行的;
而map、flatmap、reduceByKey、foreach、foreachPartition...这类算子都是在worker端执行的。
从driver到worker是要先序列化再可以传输的,所以你如果要在foreachRDD里面写代码,如果没有经过序列化,就会报错。那么怎么解决呢?
1、让它序列化啊
2、如果这个对象不支持序列化,那就不要写在foreachRDD里面啊
所以,原文的这段代码应该修改为:
resultDStream.foreachRDD( rdd =>{
//Driver
rdd.foreachPartition( partition =>{
//Executor
val jdbcCoon = MysqlPool.getJdbcCoon()
val statement = jdbcCoon.createStatement()
partition.foreach( recored =>{
//Executor val word = recored._1 val count = recored._2 val sql=s"insert into aura.1706wordcount values(now(),'${word}',${count})" statement.execute(sql) }) MysqlPool.releaseConn(jdbcCoon) }) })