Description
Raising this issue to revisit the scalability of our pairwise distance calculation and whether it's worth returning to a map-reduce style implementation that would allow chunking along both dimensions.
In the work that @aktech is doing on early scalability demonstrations (#345) there are some memory usage difficulties emerging. @aktech is I believe trying to run a pairwise distance computation over data from Ag1000G phase 2, using all SNPs and samples from a single chromosome arm. This is of the order ~20 million variants and ~1000 samples. With the current implementation, it is hard to get this to run on systems with average memory/CPU ratios, below 12G/CPU. My understanding is that, essentially, this is because the pairwise distance implementation currently does not support chunking along the variants dimension, and so to reduce memory footprint you need short chunks along the samples dimension. Depending on how the input data have been chunked natively, this may be suboptimal, i.e., you may need to run the computation with sample chunks that are (much) shorter than the native storage.
If this is correct, then it raises two questions for discussion.
First, should we revisit the map-reduce implementation of pairwise distance? This would allow chunking on both samples and variants dimensions, and so could naturally make use of whatever the underlying chunking of the data in storage, without large memory footprint.
Secondly, do we ever really need to run pairwise distance on arrays that are large in the variants dimension? I.e., do we care about scaling this up to large numbers of variants? xref https://github.com/pystatgen/sgkit/pull/306#issuecomment-714654217