Apache Spark - Best Practices and Tuning

Introduction
RDD
Dataframe
- Joining a large and a small Dataset
- Joining a large and a medium size Dataset
Storage
- Use the Best Data Format
- Cache Judiciously and use Checkpointing
Parallelism
Serialization and GC
- Tuning Java Garbage Collection
- Serialization
References
- References