版权声明:本文为博主九师兄(QQ群:spark源代码 198279782 欢迎来探讨技术)原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_21383435/article/details/89493393
1. 背景
Spark Structured Streaming 读取kafka,然后进行转换,最后写入到kafka中,中间运行的时候出现这个,但是清除checkpoin目录后,就可以正常使用。
错误代码:
assertion failed:concurrent update to the log .mutiple streaming jobs delete
2. 源码定位
spark 2.3版本
org.apache.spark.sql.execution.streaming.MicroBatchExecution#constructNextBatch
updateStatusMessage("Writing offsets to log")
reportTimeTaken("walCommit") {
assert(offsetLog.add(
currentBatchId,
availableOffsets.toOffsetSeq(sources, offsetSeqMetadata)),
s"Concurrent update to the log. Multiple streaming jobs detected for $currentBatchId")
logInfo(s"Committed offsets for batch $currentBatchId. " +
s"Metadata ${offsetSeqMetadata.toString}")
// NOTE: The following code is correct because runStream() processes exactly one
// batch at a time. If we add pipeline parallelism (multiple batches in flight at
// the same time), this cleanup logic will need to change.
// Now that we've updated the scheduler's persistent checkpoint, it is safe for the
// sources to discard data from the previous batch.
if (currentBatchId != 0) {
val prevBatchOff = offsetLog.get(currentBatchId - 1)
if (prevBatchOff.isDefined) {
prevBatchOff.get.toStreamProgress(sources).foreach {
case (src: Source, off) => src.commit(off)
case (reader: MicroBatchReader, off) =>
reader.commit(reader.deserializeOffset(off.json))
}
} else {
throw new IllegalStateException(s"batch $currentBatchId doesn't exist")
}
}
3.问题解决
暂无