Spark广播变量用于将一个小的数据集广播到每个节点上,从而使得每个task直接从对应的节点上拉取数据,而不用从Driver端拉取数据,可以节省数据的网络传输和带宽,提高job的运行效率
Spark 广播变量的主要使用是:使用sc.broadcast() 将数据广播出去,在Driver端直接.value 获取值
具体案例如下:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("BroadCastApp")
val sc = new SparkContext(conf)
val flights = sc.parallelize(List(
("SEA", "JFK", "DL", "7:00"),
("SFO", "LAX", "AA", "7:05"),
("SFO", "JFK", "VX", "7:05"),
("JFK", "LAX", "DL", "7:10"),
("LAX", "SEA", "DL", "7:10")
))
val airports = sc.parallelize(List(
("JFK", "John F .Kennedy International Airport", "New York", "NY"),
("LAX", "Los Angels International Airport", "Los Angeles", "CA"),
("SEA", "Seattle-Tacoma International Airport", "Seattle", "WA"),
("SFO", "San Francisco International Airport", "San Francisco", "CA")
))
val airlines = sc.parallelize(List(
("AA", "American Airlines"),
("DL", "Delta Airlines"),
("VX", "Virgin America")
))
// 将数据广播到每个节点上面
val airportsBC = sc.broadcast(airports.map(x => (x._1, x._3)).collectAsMap())
val airlinesBC = sc.broadcast(airlines.collectAsMap())
// 使用广播变量的值
flights.map {
case (a, b, c, d) => (airportsBC.value.get(a).get,
airportsBC.value.get(b).get,
airlinesBC.value.get(c).get,
d
)
}.foreach(println)
sc.stop()
}