spark核心数据结构之RDD

RDD是Spark对数据的核心抽象，称为弹性分布式数据集(Resilient Distributed Dataset，简称RDD)，即为分布式的元素集合，其表示分布在多个计算节点上可以并行操作的元素集合。在Spark中，对数据的所有操作就是创建RDD、转化已有RDD以及调用RDD操作进行求值，然后Spark自动将RDD中的数据分发到集群上，将操作并行化执行。

/* Internally, each RDD is characterized by five main properties:
*
*  - A list of partitions
*  - A function for computing each split
*  - A list of dependencies on other RDDs
*  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
*  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
*/
// 分区列表，用于执行任务时并行计算
protected def getPartitions: Array[Partition]
// 分区计算函数，使用分区函数对每一个分区进行计算
def compute(split: Partition, context: TaskContext): Iterator[T]
// RDD之间的依赖关系，多个计算模型进行组合时，需要建立多个RDD的依赖关系
protected def getDependencies: Seq[Dependency[_]] = deps
// 分区器，可以通过设定分区器自定义数据的分区(可选)
@transient val partitioner: Option[Partitioner] = None
// 首选位置，计算数据时，可根据计算节点的状态选择不同的节点位置来进行计算，可以减少一些不必要的网络传输
protected def getPreferredLocations(split: Partition): Seq[String] = Nil