RDD5大特点
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
特点一
A list of partitions
RDD的基本构成是有partitions构成的,
特点二
A function for computing each split
对每个分区都是用相通的函数记性计算
特点三
A list of dependencies on other RDDs
- RDD之间是有依赖关系的;
- RDDA==>RDDB==>RDDC==>RDDD
- 这几个RDD之间都是有相互依赖的关系,**假设RDDC数据丢失了,可以让RDDB重新计算给RDDC,**这就是RDD弹性的体现,
特点四
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
分区的时候,默认是对key做hash 进行分发
特点五
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file
数据本地化计算,数据在哪里我们的计算的task就应该在哪里计算,这样性能最好
RDD的5个特性,都体现出来了RDD(Resilient Distributed Dataset )