Skip to content

Latest commit

 

History

History
74 lines (48 loc) · 2.71 KB

5. Partitioning.md

File metadata and controls

74 lines (48 loc) · 2.71 KB

Partitioning

Repartition vs Coalesce

image

  1. Coalesce
  • Coalesce is a narrow dependency (one input (parent) partition influences a single output (child) partition)
  • Coalesce will still move some data. However, it is not a full shuffle, and always faster than a suffle from repartitioning.
  1. Repartition
  • Repartition is a wide dependency (one input (parent) partition influences more than one output partitions)
  • Repartition will evenly distribute the data and will always involve in a full shuffle.

When to use what?

image

Partitioning Performance

image

image

  1. Optimal #partitions
  • too few = =not ennough parallelism
  • too many = thread context switch for executors
  1. Optimal partition size = 10 to 100 MB of uncompressed data

  2. Determining the size of data (3 methods)

  • Cache Size: DF "native" size (compressed), uncompressed for RDDs.
  • SizeEstimator: not super accurate, bbbut worth getting the order of magnitude.
  • Query plan size in bytes: uncompressed data (DFs only)

Shuffle Partitioning

  1. Partitioning determines the degree of parallelism in a job
  • Each task processes one partition.
  1. Determines the degree of I/O parallelism
  2. Small partitions
  • Data I/O overhead
  • Large task launch overhead
  • Easy to recompute if executor dies
  1. Large partitions
  • More CPU uisage for actual data processing
  • Few tasks/parallelism
  • Long time to process
  • Large amount of memory needed
  • Hard to recompute if executor dies

Optimal Shuffle Partitions?

image

Things to keep in mind

  1. Optimal partition size between 10 to 100 MB. (at max 200 MB)
  2. CPU cores must not be idle

A Few Exercises

image

image

image

Partitioners

image

image