- Coalesce
- Coalesce is a narrow dependency (one input (parent) partition influences a single output (child) partition)
- Coalesce will still move some data. However, it is not a full shuffle, and always faster than a suffle from repartitioning.
- Repartition
- Repartition is a wide dependency (one input (parent) partition influences more than one output partitions)
- Repartition will evenly distribute the data and will always involve in a full shuffle.
- Optimal #partitions
- too few = =not ennough parallelism
- too many = thread context switch for executors
-
Optimal partition size = 10 to 100 MB of uncompressed data
-
Determining the size of data (3 methods)
- Cache Size: DF "native" size (compressed), uncompressed for RDDs.
- SizeEstimator: not super accurate, bbbut worth getting the order of magnitude.
- Query plan size in bytes: uncompressed data (DFs only)
- Partitioning determines the degree of parallelism in a job
- Each task processes one partition.
- Determines the degree of I/O parallelism
- Small partitions
- Data I/O overhead
- Large task launch overhead
- Easy to recompute if executor dies
- Large partitions
- More CPU uisage for actual data processing
- Few tasks/parallelism
- Long time to process
- Large amount of memory needed
- Hard to recompute if executor dies
Things to keep in mind
- Optimal partition size between 10 to 100 MB. (at max 200 MB)
- CPU cores must not be idle