-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pairwise distance scalability #375
Comments
I think you mean in the samples dimension? (If so, I agree - ~10K is as many samples as we could hope to support for O(n^2) like this) |
On this, it is true that for many analyses it's entirely reasonable to work around memory limitations by downsampling data along the variants dimensions. E.g., often when doing a PCA we randomly downsample variants, and this gives a very reasonable approximation. I.e., typically information about population structure is very clear from just a fraction of the variants. In other words, the scalability test that @aktech is trying, running pairwise distance over all 20 million SNPs from a chromosome arm, is perhaps somewhat artificial, as you could reasonably downsample to 100,000 SNPs and get the information you need. On the other side, we do have use cases where we want to compute pairwise distance using all SNPs. E.g., during QC, we check for samples that are technically identical, and check these match our expectations for replicate samples. To do this we compute pairwise distance at all SNPs. (To avoid unnecessary computation, we do an initial pass with a downsampled set of SNPs to exclude pairs of individuals that are obviously not identical, then rerun with all SNPs on remaining samples. But that's a detail.) |
No, I did mean the variants dimension. Re the samples dimension, yes agreed, probably ~10k samples is the most we could hope to support due to computational complexity (although if you have access to GPUs then perhaps that allows you to reach a little further). My main point was about the variants dimension. Computation scales linearly on the variants dimension, there's no computational reason to expect any limit there. We currently have a practical limit due to memory requirements, which could be removed if we supported chunking along the variants dimension via the original map/reduce implementation. Should we revisit that? |
I see, thanks. Well, from my perspective I'd find it surprising as a user to be limited on the number of variants we could process by memory restrictions - since it's a sample x sample comparison. How have you used this in the past? Do you subset down to a representative set of sites, or just throw compute at it and let it run over the whole lot? |
Yes, exactly.
We've done both, more details above. |
Just to add, memory restrictions will be even tighter when moving this to GPU. |
@alimanfoo I'm curious, I understand that the O(n^2) complexity is vicious, but I'm curious how exactly have you arrived at ~10k (including potential horizontal scaling via dask)? Also a quick point regarding our current implementation, our current blockwise: x_distance = da.blockwise(
# Lambda wraps reshape for broadcast
lambda _x, _y: metric_ufunc(_x[:, None, :], _y),
"jk",
x,
"ji",
x,
"ki",
dtype="float64",
concatenate=True,
) we specify
So afaiu in #345 we are concatenating on the variants blocks, and thus potentially the memory issue(?). @aktech what's the reasoning behind |
I pulled this one out of thin air first. It was just an arbitrary number from experience - in practise, somewhere between 1K and 100K. |
@ravwojdyla The To summarise what I think of the current state is it's worth evaluating the map reduce implementation for scalability. |
Thanks @aktech, I was curious about the use of
which might suggest we have not evaluated it. +1 to trying the "map-reduce" implementation on the #345. Going forward iff the performance is important right now, we could also have both implementations that switch based on the shape of the data (and/or parameter), unfortunately that would have a real maintenance cost (so that's something we need to be cautious about). |
@ravwojdyla Yes, that's very true, I was also expecting the iterable of blocks to be lazy, which would make the memory consumption very low compared to the current implementation. I only realised this after I wrote a map reduce implementation with Also, one of the reasons that the iterable of blocks are not lazy is probably due to the fact, that the |
@aktech so in general it "seems" like for |
Yes, true.
What do you mean by "keep the chunks", those chunks are computed once, from the original array to make the input for
Yes, I think so. That would be worth mentioning. |
@aktech yea, per process (threads within process will reuse memory if needed), I just wonder at which stage exactly will Dask free up the memory required for chunks on each process and how exactly will it build up the input for Edit: but just to be clear, I agree that the |
Ah that's interesting, for me that is only in the first iteration on the func you'll see an empty numpy array, if you check on the next iteration it will be a Python list. That execution of the function happens even before you call |
@aktech I think I found the issue, if there is a contraction and |
Interesting, good to know that. By the way, I discovered something interesting about utilising the symmetry of matrix in dask. We had assumed so far that we can drop duplicate calculation via the following line, this is basically making the lower triangular matrix 0 as the distance matrix is a symmetric matrix. I was trying to prove this today, but from what is looks like, our assumption was not true, unless I am missing something. Here is a notebook of that analysis showing the the above line doesn't makes dask utilise the symmetry of a matrix to drop duplicate calculations, in-fact that takes more time. At the moment I am trying to evaluate the map-reduce approach, which does not use the |
FYI: https://stackoverflow.com/questions/64774771/does-blockwise-allow-iteration-over-out-of-core-arrays cc: @mrocklin, if you get a moment we could use your help understanding the scalability limits of |
Answered. I also hypothesize in my answer that you're running into
problems with tensordot style applications, which are heavily dependent on
chunking structure. See dask/dask#2225 . If
folks here were interested in implementing some of the automatic rechunking
heuristics mentioned in that issue that would be very welcome.
…On Wed, Nov 11, 2020 at 8:52 AM Eric Czech ***@***.***> wrote:
FYI:
https://stackoverflow.com/questions/64774771/does-blockwise-allow-iteration-over-out-of-core-arrays
cc: @mrocklin <https://github.com/mrocklin>, if you get a moment we could
use your help understanding the scalability limits of blockwise. A more
specific question is framed in the SO post.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/375#issuecomment-725535583>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHV4IZIDCSTXUBCNV3SPK6OPANCNFSM4TMOJR4Q>
.
|
Thanks @mrocklin. Would a simplistic example of this be to decrease chunking along non-contracted axes as the contracted axes get larger so that the concatenated chunks still fit in memory? As a simpler example, say we want to do matrix multiplication via blockwise on 100 x 100 arrays with 10 x 10 chunks. Now say we want to do the same on 1M x 1M arrays with 10 x 10 chunks -- we now need to fit 1M / 10 chunks in memory instead of 100 / 10 chunks so we could decrease the chunks in the first axes to try to compensate for this. This is what those heuristics would accomplish correct? |
If I recall correctly the heuristic was pretty unintuitive. I think that
one wanted intermediate result chunks that were as large as possible
because this cut down significantly on how much communication and
duplication of the input arrays that you had to do. I don't fully recall
though.
My recollection is that tensordot computations end up blowing up memory
more often because of the all-to-all nature of the computation. The input
dataset ends up being copied onto distributed memory many times when one
doesn't do things carefully.
…On Wed, Nov 11, 2020 at 10:36 AM Eric Czech ***@***.***> wrote:
If folks here were interested in implementing some of the automatic
rechunking heuristics mentioned in that issue that would be very welcome.
Thanks @mrocklin <https://github.com/mrocklin>. Would a simplistic
example of this be to decrease chunking along non-contracted axes as the
contracted axes get larger so that the concatenated chunks still fit in
memory? As a simpler example, say we want to do matrix multiplication via
blockwise on 100 x 100 arrays with 10 x 10 chunks. Now say we want to do
the same on 1M x 1M arrays with 10 x 10 chunks -- we now need to fit 1M /
10 chunks in memory instead of 100 / 10 chunks so we could decrease the
chunks in the first axes to try to compensate for this. This is what those
heuristics would accomplish correct?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/375#issuecomment-725591045>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTDR5T3JKTSNXNADYWLSPLKUPANCNFSM4TMOJR4Q>
.
|
I ran the map-reduce pairwise implementation on the MalariaGEN dataset on Coiled Cloud dask cluster with the following configuration:
Code:
Here are the results: 1. Chunk size: 100 MB
Takeaway: Everything is quick except the 2. (a) Chunk Size 100 MB, using Numba's
|
I was comparing it with scikit-allel as suggested by Eric in the dev call today. It doesn't seems to work with even a small array of the MalariaGEN data. Here is the notebook for the same. |
Hi @aktech
Looks like you need a petabyte of memory :-) Perhaps Eric was suggesting to compare with the scikit-allel v2 prototype? The pairwise distance function in scikit-allel v1 is just scipy pdist, not dask enabled. |
Yes, that's the one I tried. Here is the environment (Note that, I had to remove pinning from libraries to resolve conflicts) I used for workers. I also faced this issue, which was fixed by upgrading Although, I just realised that code was calling numpy backend for some reason, even though a dask array was passed to it. I will play with it again to see why. |
Sorry, I should've looked more carefully. I think the problem is transposing the input array. The skallel-stats pairwise distance function computes pairwise distance between columns, not rows. This is because the genotype data we have is typically arranged this way (columns are samples). |
Just to add I have run the skallel pairwise distance on a simulated dataset approaching the size of Ag1000G: https://github.com/scikit-allel/skallel-stats/blob/master/examples/pairwise_distance.ipynb Note that I created the simulated zarr dataset with chunking best for pairwise distance, i.e., no chunks along the samples dimension. I just tried running on Ag1000G on malariagen datalab and it's a little awkward because the chunks are quite tall (~500,000) and skallel rechunks the data to remove any chunks in the samples dimension, so memory is a challenge. I can get it to run if I use dask rechunk on the input data to make chunks shorter, which is then suboptimal for I/O (because the same chunk is being read multiple times) but does work. But it doesn't seem to be running particularly well. Perhaps try a simulated dataset for now? |
I also had hard time trying to run it on the MalariaGEN data, couldn't get it working.
I ran it on a simulated dataset, similar to the size of MalariaGEN. It seems to run very well, here is the notebook for the same. It took about 5min 32s, on 25 workers (same configuration as above). |
Also, to compare with the map-reduce implementation the same takes 18 minutes roughly, whereas it takes ~10 minutes on the MalariaGEN data. Also, note that the MalariaGEN data and simulated data are of nearly the same size (~28.5 GB)
@alimanfoo @eric-czech any thoughts on this? Is that fast enough? |
So, just to clarify, we're looking at about a 4X slow-down for the map-reduce implementation? This seems OK, if it allows us to scale. |
Hi @aktech, echoing @jeromekelleher, I would say that speed per se is less important than scalability. I.e., whether it takes 5 minutes or 18 minutes is less important than being able to run this robustly within reasonable memory limits, and demonstrating that increasing the cluster size makes the whole computation complete faster (up to some natural limit such as the number of chunks in the data). |
That is true for simulated data with random values, although on the actual data its ~10 mins, which is 2X slow, assuming the skallel implementation is able to run on the MalariaGEN data with same performance as it runs on the simulated data. |
So, this is reduced to 10 minutes with better chunking. @alimanfoo I think it is able to scale well and memory hasn't been a problem from what I have seen from various experiments, I did on Map-reduce implementation (if each worker has 8GB memory allowance, which seems reasonable to me). The chunk size was 100MB in all of these cases. 1. Macbook Pro 16 GB (2 workers, 4threads)I am able to run the simulated data of same size as MalariaGEN on my laptop as well, here is the notebook. Following are the links to reports: It took about 8 hours 19 minutes overnight, given the fact that there would be some background tasks running on my laptop too and I also suspect, it might have slept and paused the workflow at night, but the point is I was able to run it on my laptop. I chose two workers to give 8GB memory to each worker. I didn't had any memory issues. 8GB per worker seems like a good number for memory allowance. 2. Coiled cluster: 24 Worker (16 GB, 4 Cores each): MalariaGEN dataTime Taken: 10 min 48 sec 3. Coiled cluster: 50 Worker (16 GB, 4 Cores each): MalariaGEN dataTime Taken: 5 min 54 sec Let me know if this sounds reasonable. Also, thanks to @mrocklin's @coiled for generously increasing my cores limit to 200. |
Also, thanks to @mrocklin <https://github.com/mrocklin>'s
I don't remember doing this, but I'm glad that it happened regardless. If
you all want an team account with more capacity then do let me know.
…On Thu, Nov 19, 2020 at 1:44 AM Amit Kumar ***@***.***> wrote:
Also, to compare with the map-reduce implementation the same takes 18
minutes roughly.
So, this is reduced to 10 minutes with better chunking.
@alimanfoo <https://github.com/alimanfoo> I think it is able to scale
well and memory hasn't been a problem from what I have seen from various
experiments, I did on Map-reduce implementation (if each worker has 8GB
memory allowance, which seems reasonable to me).
The chunk size was 100MB in all of these cases.
1. Macbook Pro 16 GB (2 workers, 4threads)
I am able to run the simulated data of same size as MalariaGEN on my
laptop as well, here is the notebook
<https://nbviewer.jupyter.org/github/aktech/aktech.github.io/blob/master/work/sgkit/notebooks/sgkit_local_full.ipynb>.
Following are the links to reports:
- Dask Performance report
<https://iamit.in/work/sgkit/pairwise-analysis/dask-report-16G-4cpu-4threads-local-full.html>
- Dask Task Stream
<https://iamit.in/work/sgkit/pairwise-analysis/task-stream-16G-4cpu-4threads-local-full.html>
It took about *8 hours 19 minutes* overnight, given the fact that there
would be some background tasks running on my laptop too and I also suspect,
it might have slept and paused the workflow at night, but the point is I
was able to run it on my laptop.
I chose two workers to give 8GB memory to each worker. I didn't had any
memory issues. 8GB per worker seems like a good number for memory allowance.
2. Coiled cluster: 24 Worker (16 GB, 4 Cores each): MalariaGEN data
- Dask Performance report
<https://iamit.in/work/sgkit/pairwise-analysis/dask-report-16G-4cpu-4threads.html>
- Dask Task Stream
<https://iamit.in/work/sgkit/pairwise-analysis/task-stream-16G-4cpu-4threads.html>
Time Taken: *10 min 48 sec*
3. Coiled cluster: 50 Worker (16 GB, 4 Cores each): MalariaGEN data
- Dask Performance report
<https://iamit.in/work/sgkit/pairwise-analysis/dask-report-16G-4cpu-4threads-50W.html>
- Dask Task Stream
<https://iamit.in/work/sgkit/pairwise-analysis/task-stream-16G-4cpu-4threads-50W.html>
Time Taken: *5 min 54 sec*
Let me know if this sounds reasonable.
------------------------------
Also, thanks to @mrocklin <https://github.com/mrocklin>'s @coiled
<https://github.com/coiled> for generously increasing my cores limit to
200.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/375#issuecomment-730253208>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCB54ENY4M4CMDR4B3SQTSGFANCNFSM4TMOJR4Q>
.
|
Looking at the performance reports it seems like you're definitely not dask-communication bound, so I would not expect decreasing chunking to improve performance. This aligns with @aktech 's intutition from the meeting. |
That was me 😄 but yes, if more capacity is needed either of us is happy to help! |
I am surprised to see that matmul does not just reuse the tensordot
implementation, or at least use the same trick used in tensordot. That
trick is useful and definitely results in smoother computations.
…On Thu, Nov 19, 2020 at 2:41 PM Eric Czech ***@***.***> wrote:
Hey @mrocklin <https://github.com/mrocklin> looking at matmul
<https://github.com/dask/dask/blob/ff3ea6c74e8736a5823d648defcf9164f083d266/dask/array/routines.py#L313>,
it looks like it must have contracted axes .. right? It would be great to
know if it's possible to implement matrix multiplication using reductions
and no contractions instead since we could easily extend that to our more
general pairwise metrics.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/375#issuecomment-730682432>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTDTOAKWTCMCDDXIYZ3SQWNLHANCNFSM4TMOJR4Q>
.
|
Could we try replacing calls to |
Good idea @tomwhite, I tried swapping that in the notebook mentioned at https://github.com/pystatgen/sgkit/issues/390#issuecomment-730449672 and saw this on small simulated arrays: Do you think it is worth starting a conversation as a dask issue about changing the matmul implementation @mrocklin? |
Btw in the meantime for 2d input we could also use |
Do you think it is worth starting a conversation as a dask issue about
changing the matmul implementation @mrocklin <https://github.com/mrocklin>?
Definitely
…On Fri, Nov 20, 2020 at 4:41 AM Rafal Wojdyla ***@***.***> wrote:
Btw in the meantime for 2d input we could also use dot which in dask
under the hood uses tensordot (and should give the same results as matmul
for 2d). +1 to starting the discussion about matmul using the same trick
of blockwise product followed by reduction, and maybe potentially having a
common code for tensordot, matmul, dot, einsum (it looks like it's also
using this trick btw).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/375#issuecomment-731147873>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGCMENG3DBQ2NLTWDTSQZPX5ANCNFSM4TMOJR4Q>
.
|
Also, nice find
…On Fri, Nov 20, 2020 at 8:35 AM Matthew Rocklin ***@***.***> wrote:
> Do you think it is worth starting a conversation as a dask issue about
changing the matmul implementation @mrocklin <https://github.com/mrocklin>
?
Definitely
On Fri, Nov 20, 2020 at 4:41 AM Rafal Wojdyla ***@***.***>
wrote:
> Btw in the meantime for 2d input we could also use dot which in dask
> under the hood uses tensordot (and should give the same results as matmul
> for 2d). +1 to starting the discussion about matmul using the same trick
> of blockwise product followed by reduction, and maybe potentially having a
> common code for tensordot, matmul, dot, einsum (it looks like it's also
> using this trick btw).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/pystatgen/sgkit/issues/375#issuecomment-731147873>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTGCMENG3DBQ2NLTWDTSQZPX5ANCNFSM4TMOJR4Q>
> .
>
|
I have implemented a blockwise version of pairwise in this branch (based on @ravwojdyla’s implementation of matmul in this PR: dask/dask#7000). The blockwise implementation doesn’t seems to be any better than the blocks implementation (the one for which I shared the results above). I ran some benchmarks to compare them, here is the comparison of memory and CPU usage between the two implementations: Data:
In general the blockwise implementation was a bit slower than the blocks implementation, it took the following times:
Example notebook for this comparison is here. |
@aktech reviewed this version, I believe the blockwise implementation can perform as well as the bespoke blocks implementation, here are some comments: Furthermore (orthogonal to the blockwise investigation):
Btw for future reference:
Don't hesitate to ping me if something is not clear or problematic. |
When a pair of partial vectors are evaluated for the map step of the pairwise for To elaborate on this based on my understanding of blockwise, when we perform blockwise the actual output dimension is not known to dask, since it would never know what the function is doing to the dimension of the output. The only way for blockwise to know its dimension is via its arguments, which is Also, note that the Let me know if this doesn’t makes sense or seems inaccurate.
I see, I have never used that. I’ll have a look, do you reckon it will make it more efficient? Or is that a standard way of doing reduction in dask?
I think so, I'll give it a go.
Not much really, I think it started with having some for loops initially maybe, then its was never removed.
Yes, sure. Apologies for the missing information. I used this branch for comparison. I have now updated the notebook to reflect the actual experiment. |
Ah, I see, it's for the
Yes, here's a sketch of the ...
o = da.blockwise(
_pairwise,
'ijk',
x,
'ik',
x,
'jk',
adjust_chunks={'k': 1},
dtype=x.dtype,
concatenate=False,
h=np.empty(n_map_param, dtype=x.dtype),
)
r = da.reduction(o,
chunk=lambda x_chunk, axis, keepdims: x_chunk,
combine=lambda x_chunk, axis, keepdims: x_chunk.sum(-1)[..., np.newaxis],
aggregate=lambda x_chunk, axis, keepdims: metric_reduce_ufunc(x_chunk),
axis=-1,
dtype=np.float,
name="pairwise")
t = da.triu(r)
return t + t.T On my laptop, this performs as well as the custom blocks. Here it's also worth mentioned that
If there is not need for it, it should be removed?
Having notebook as GH gist isn't great, because it's hard to rerun it or track it, I assume the gist should be run using the code from the branch, but it would be a lot more clear if the notebook just lived in the branch for the time of the comparison, so it's easy to just clone it and run it or change stuff. Does that make sense @aktech? |
Oh, nice! Thanks, let me do the same for correlation too.
Yes, definitely.
Ah, I see what you mean. I did that because many times the IPython notebooks in repository are not loaded at all, so if people just want to see the results there isn't a way and nbviewer doesn't seems to load from a repo link. Although I completely agree with your point it makes review difficult. I'll keep that in mind for future, thanks! |
@ravwojdyla I have now used the
I'll run it again on a little larger dataset to compare and will share a notebook thereafter, meanwhile the implementation is in this branch. I have also removed |
Raising this issue to revisit the scalability of our pairwise distance calculation and whether it's worth returning to a map-reduce style implementation that would allow chunking along both dimensions.
In the work that @aktech is doing on early scalability demonstrations (#345) there are some memory usage difficulties emerging. @aktech is I believe trying to run a pairwise distance computation over data from Ag1000G phase 2, using all SNPs and samples from a single chromosome arm. This is of the order ~20 million variants and ~1000 samples. With the current implementation, it is hard to get this to run on systems with average memory/CPU ratios, below 12G/CPU. My understanding is that, essentially, this is because the pairwise distance implementation currently does not support chunking along the variants dimension, and so to reduce memory footprint you need short chunks along the samples dimension. Depending on how the input data have been chunked natively, this may be suboptimal, i.e., you may need to run the computation with sample chunks that are (much) shorter than the native storage.
If this is correct, then it raises two questions for discussion.
First, should we revisit the map-reduce implementation of pairwise distance? This would allow chunking on both samples and variants dimensions, and so could naturally make use of whatever the underlying chunking of the data in storage, without large memory footprint.
Secondly, do we ever really need to run pairwise distance on arrays that are large in the variants dimension? I.e., do we care about scaling this up to large numbers of variants? xref https://github.com/pystatgen/sgkit/pull/306#issuecomment-714654217
The text was updated successfully, but these errors were encountered: