1、repartition
repartition会根据用户传入的分区数重新通过网络分区所有数据,它会产生shuffle过程,所以是一个重型操作。
val kv1: RDD[(String, Int)] = sc.parallelize(List(
("zhangsan", 11),
("zhangsan", 12),
("lisi", 13),
("wangwu", 14)
))
val kv2: RDD[(String, Int)] = sc.parallelize(List(
("zhangsan", 21),
("zhangsan", 12),
("zhangsan", 22),
("lisi", 23),
("zhaoliu", 28)
))
val value1: RDD[(String, Int)] = kv1.repartition(3) ##结果:3
println(value1.partitions.length)
2、coalesce
coalesce同样对用户传入的分区数进行分区,但是它不会产生shuffle过程。我们知道,DAGScheduler创建Task的数量取决于Stage的最后一个RDD的分区数,如果不进行shuffle,那么coalesce根本无法精准控制分区数。
val kv1: RDD[(String, Int)] = sc.parallelize(List(
("zhangsan", 11),
("zhangsan", 12),
("lisi", 13),
("wangwu", 14)
))
val kv2: RDD[(String, Int)] = sc.parallelize(List(
("zhangsan", 21),
("zhangsan", 12),
("zhangsan", 22),
("lisi", 23),
("zhaoliu", 28)
))
val value: RDD[(String, Int)] = kv1.coalesce(5)
println(value.partitions.length) ##结果:1