spark优化之编程方式汇总

spark优化有两个方向，一是写好的代码，二是合理配置资源。本文讲述的是第一种思路，内容来源于Spark Performance Tuning & Best Practices，sparkbyexample是个很好的网站，除了是全英文，没有缺点。

以下为整理的spark程序优化思路：

使用DataFrame而不是RDD。这得益于钨丝计划项目和catalyst优化器的优化。钨丝计划能提高spark任务的内存和cpu效率，catalyst优化器是个整合的sql查询优化器。
使用coalesce()而不是repartition()。缩小分区是一般使用coalesce()，因为该方法默认没有shuffle操作，除非你想增加分区那只能用repartition，该方法有shuffle操作。
使用mapPartitions() 替代map()
使用序列化数据格式。比如Avro,kryo,Parquet格式而不是text,CSV,JSON格式。

Most of the Spark jobs run as a pipeline where one Spark job writes data into a File and another Spark jobs read the data, process it, and writes to another file for another Spark job to pick up. When you have such use case, prefer writing an intermediate file in Serialized and optimized formats like Avro, Kryo, Parquet e.t.c, any transformations on these formats performs better than text, CSV, and JSON.
尽量避免使用UDF（user defined functions），UDF对于spark是个黑盒，spark没有办法对UDF做优化，所以函数在sparksql中有内置就不要用udf。
在内存中持久化(persisting)和缓存(cache)数据

spark的cache(),persist(),unpersist()方法及需要注意的细节
减少昂贵的shuffle操作

Spark Shuffle is an expensive operation since it involves the following
- Disk I/O
- Involves data serialization and deserialization
- Network I/O
Disable DEBUG & INFO Logging