Spark 之 persist

persisit

/**
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet. Local checkpointing is an exception.
   */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
  • RDDs are re-computable on each action by default due to its behavior. This phenomenon can be overcome by persisting the RDDs. So, that whenever we call an action on RDD, no re-computation takes place. When we call persist ( ) method, each computation stores the result in their partitions.
  • Cache mechanism is one used to speed up the applications that access the same RDDs several times.
  • cache is a synonym of word persist or persist(MEMORY_ONLY), that signifies cache is nothing but persist with the default storage level MEMORY_ONLY.
when to use persist

总而言之,有需要重复使用的RDD,均可以考虑使用 persist
There are following situations in which we can use cache mechanism.

  • When we re-use RDD while working in iterative machine learning applications
  • While we re-use RDD in standalone spark applications
  • When RDD computations are expensive, we use caching mechanism. It helps in reducing the cost of recovery if, in case one executor fails.
Need of Persistence Mechanism

It allows us to use same RDD multiple times in apache spark. As we know as many times we use RDD or we repeat RDD evaluation, we need to call action to execute. This process consumes much time as well as memory, while we perform iterative algorithm we require looking at data many times that time, that consumes ample of memory and time. To overcome this issue of repeated computation, these techniques of persistence introduced.

How to un-persist RDD in Spark

Cached data overreach the volume of memory, spark automatically expel the old data. This is actually a process named LRU, LRU refers to Last Recently Used. This algorithm categorizes the data as less used or frequently used. Either, it happens automatically or we can do it on our own by using the method calls un-persist, this is RDD.unpersist( ) method.

Conclusion

Hence, Spark RDD persistence and caching mechanism are various optimization techniques, that help in storing the results of RDD evaluation techniques. These mechanisms help saving results for upcoming stages so that we can reuse it. After that, these results as RDD can be stored in memory and disk as well.

猜你喜欢

转载自blog.csdn.net/zhixingheyi_tian/article/details/85068697