Partitioning

Repartition vs Coalesce

Coalesce

Coalesce is a narrow dependency (one input (parent) partition influences a single output (child) partition)
Coalesce will still move some data. However, it is not a full shuffle, and always faster than a suffle from repartitioning.

Repartition

Repartition is a wide dependency (one input (parent) partition influences more than one output partitions)
Repartition will evenly distribute the data and will always involve in a full shuffle.

When to use what?

Partitioning Performance

Optimal #partitions

too few = =not ennough parallelism
too many = thread context switch for executors

Optimal partition size = 10 to 100 MB of uncompressed data
Determining the size of data (3 methods)

Cache Size: DF "native" size (compressed), uncompressed for RDDs.
SizeEstimator: not super accurate, bbbut worth getting the order of magnitude.
Query plan size in bytes: uncompressed data (DFs only)

Shuffle Partitioning

Partitioning determines the degree of parallelism in a job

Each task processes one partition.

Determines the degree of I/O parallelism
Small partitions

Data I/O overhead
Large task launch overhead
Easy to recompute if executor dies

Large partitions

More CPU uisage for actual data processing
Few tasks/parallelism
Long time to process
Large amount of memory needed
Hard to recompute if executor dies

Optimal Shuffle Partitions?

Things to keep in mind

Optimal partition size between 10 to 100 MB. (at max 200 MB)
CPU cores must not be idle

A Few Exercises

Partitioners