-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement pairwise distance calculation #306
Conversation
Thanks @aktech! Looks good, nice work figuring out how to omit the blocks for one of the triangular matrices. I was feeling a little nervous about the map/reduce recommendation so I did some more research and found a couple key things:
It is certainly worth considering the fact that the level of scalability we're trying for here isn't something dask does in core operations so I put together a bunch of different options and compared them in this notebook. My takeaways from this:
tl;dr I think it still makes sense to move forward with this implementation, but we need to communicate the fact that performance at least somewhat comparable to functions like @guvectorize(['void(int8[:], int8[:], float64[:])'], '(n),(n)->()', fastmath=True)
def corrcoef(x, y, out):
# Equivalent to np.corrcoef(x, y)[0, 1] but 2x faster
cov = ((x - x.mean()) * (y - y.mean())).sum()
out[0] = cov / (x.std() * y.std()) / x.shape[0]
def pairwise(x, fn):
return da.blockwise(
# Lambda wraps reshape for broadcast
lambda x, y: fn(x[:, None, :], y), 'jk',
x, 'ji', x, 'ki',
dtype='float64',
concatenate=True
)
pairwise(x, corrcoef) I'm also thinking that there is a way to express the loop we have now that references |
@eric-czech Thanks for the analysis, that's really very useful. The option 6 is indeed much faster than 5 with a slight change, let me try to get rid of the column chunks, to incorporate that. Also, in the current implementation since we omit the blocks for one of the triangular matrices, it should be even more faster! Also, do you have any comments on the proposed approach for running on GPU? |
For posterity after talking on our dev call, it would be ideal if a user could initiate GPU computations like this: (
ds
.pipe(sg.count_call_alternate_alleles)
# Move to gpu as in https://docs.dask.org/en/latest/gpu.html#arrays
.assign(count_call_alternate_alleles=lambda ds: ds.count_call_alternate_alleles.map_blocks(cupy.asarray))
# pairwise_distance knows how to pick guvectorize functions based on type of array._meta
.pipe(sg.pairwise_distance, metric='euclidean')
) I think that would be best since many other dask functions will work automatically when cupy is the chunk backend. As far as detecting this goes, |
Codecov Report
@@ Coverage Diff @@
## master #306 +/- ##
==========================================
- Coverage 96.49% 95.33% -1.17%
==========================================
Files 29 31 +2
Lines 2053 2099 +46
==========================================
+ Hits 1981 2001 +20
- Misses 72 98 +26
Continue to review full report at Codecov.
|
@eric-czech I was updating the implementation to get rid of column chunks and realised that and the last dimension is not required anymore, i.e. the complete distance calculation can be done in the map step itself, because the map step receives a full pair of vectors which means the reduce step can be performed right after (together basically) it as there are no chunks for aggregation in the reduce step. Let me know if that doesn't makes sense. |
Makes sense, the custom functions can use full vectors if there are no column chunks (which may not necessarily be true). Is there a question I can help answer with that though? |
Since, we are getting rid of column chunks (via |
When I suggested that I meant it as default but optional behavior. There would definitely be no need for map/reduce functions if we made that mandatory. Two things would make sense here in moving forward:
In the interest of time/simplicity, I'll recommend switching to 2 now. We can just get something basic in that way and iterate from there. Does that make sense @aktech? Then all the custom functions can work on whole vectors. |
Yes, that makes perfect sense, but one quick question are you suggesting by moving forward with 2 now, we should skip 1 for now? i.e. in this PR. Also, things are overlapping in the 1st and 2nd point you made for example (Make |
Yes.
Yep -- you can omit any code or documentation related to chunking. I meant everything in the number 1 and 2 scenarios to be exclusive. I am imagining a single |
@eric-czech Thanks for clarification, I have pushed the changes. I'll update the docs in a bit. Can you confirm if the change seems appropriate for what you have suggested. Also, I did a comparison between Alistair's prototype with the map reduce implementation in a notebook, which I'll share in a bit. (The gist was his application of |
Thanks @aktech, that's more or less what I was picturing. I'll hold off on any specific questions/suggestions though until the PR comes out of draft status. |
Thanks, I'll bring this out of draft in an hour or so, after I update the documentation. |
Here is a brief comparison of speed, memory and CPU utilisation between this implementation, the previous implementation and Alistair's prototype on matrix of size 100 million: notebook. Note that the branch was changed for 3, 4, 5 in the above notebook, which means the function used is still |
@eric-czech @alimanfoo Thanks for enlightening me on this, that makes sense to me now. I have updated the metrics to handle missing values. Basically, negative and nan are considered as missing values as per the discussion above. I have also added tests for the same and for all the functionality so far. Also, I had to get rid of |
Thanks @aktech, this looks great! What ever happened with the GPU experiments? Can you drop some details on that into an issue if you weren't able to get it working?
Good to know, but does it matter if all the checks in the guvectorize functions are for values > 0? |
I am planning to working on it this week, I gave a quick try with GPU on a borrowed machine last week and it didn't work straightaway throwing some NotImplementedError from cupy, probably to due guvectorize functions, I didn't put much time on it last week to conclude anything. I don't have the access to that machine anymore, so now I am creating one properly in Update: GPU issue: https://github.com/pystatgen/sgkit/issues/338
I does actually for filtering |
For GPU experiments, have you tried Google colab? I think they give you an
Nvidia GPU for free.
…On Mon, 19 Oct 2020, 21:29 Amit Kumar, ***@***.***> wrote:
What ever happened with the GPU experiments? Can you drop some details on
that into an issue if you weren't able to get it working?
I am planning to working on it this week, I gave a quick try with GPU on a
borrowed machine last week and it didn't work straightaway throwing some
NotImplementedError from cupy, probably to due guvectorize functions, I
didn't put much time on it last week to conclude anything. I don't have the
access to that machine anymore, so now I am creating one properly in the
cloud on GCP to do proper testing, will share the results after that,
meanwhile, I'll create an issue for the same.
Good to know, but does it matter if all the checks in the guvectorize
functions are for values > 0?
I does actually for filtering nan values, because while comparing > 0, it
should know that np.nan > 0 should return False, which it doesn't in the
case of fastmath=True
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/pull/306#issuecomment-712423360>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQTNDOA6XCUCZ5DNIKDSLSOQ3ANCNFSM4SFB2I2A>
.
|
I gave a quick try yesterday and setting up the environment was a bit non-trivial, like installing conda and adding stuff to path and installing system requirements. I didn't thought it will be easily usable, so didn't put much effort into it, now that you have suggested, I'll give it an another go. |
@alimanfoo did you want to take another look at this? Otherwise, I'll set it to merge today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very minor suggestions, otherwise looks good. There really is some magic here! At some point I'd love to understand how this has eliminated the need for separate map and reduce steps.
""" | ||
|
||
try: | ||
metric_ufunc = getattr(metrics, f"{metric}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metric_ufunc = getattr(metrics, f"{metric}") | |
metric_ufunc = getattr(metrics, metric) |
x: ArrayLike, | ||
metric: str = "euclidean", | ||
) -> np.ndarray: | ||
"""Calculates the pairwise distance between all pairs of vectors in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Calculates the pairwise distance between all pairs of vectors in the | |
"""Calculates the pairwise distance between all pairs of row vectors in the |
Just to help make obvious this is computing distance between rows. (And thus when computing distance between samples, user will have to transpose input.)
[array-like, shape: (M, N)] | ||
A two dimensional distance matrix, which will be symmetric. The dimension | ||
will be (M, N). The (i, j) position in the resulting array | ||
(matrix) denotes the distance between ith and jth vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[array-like, shape: (M, N)] | |
A two dimensional distance matrix, which will be symmetric. The dimension | |
will be (M, N). The (i, j) position in the resulting array | |
(matrix) denotes the distance between ith and jth vectors. | |
[array-like, shape: (M, M)] | |
A two dimensional distance matrix, which will be symmetric. The dimension | |
will be (M, M). The (i, j) position in the resulting array | |
(matrix) denotes the distance between ith and jth row vectors in the input array. |
Oops, didn't notice this was merged already. Hope review comments are helpful anyway, perhaps a small follow-up PR to make the small suggested clarifications would be good. |
FYI @alimanfoo this implementation, like matrix multiplication in dask, assumes whole rows of blocks fit in memory. It isn't particularly scalable this way but is in greater alignment with how dask does similar things (e.g. |
@alimanfoo I'll fix these in a follow up PR. |
This is an attempt to implement: #241
It is based on @eric-czech 's suggestion here: https://github.com/pystatgen/sgkit/issues/241#issuecomment-698878378. It implements Euclidean and Pearson correlation using the Map/reduce idea. I have tried to document the algorithm in the pairwise function. It only calculates the upper triangular matrix, to avoid repeated calculations. For the public API, I have taken inspiration from
scipy.spatial.distance.pdist
. I haven't added it to theapi.rst
yet.I haven't ran this on GPU, but for the API, I am thinking to use,
target
(=cuda
) argument ofguvectorize
. From user's point of view, I think we can let the user set it either via environment variable like saySGKIT_USE_GPU
or via some kind of sgkit config, similar to dask's config object, and then we can compile theguvectorize
functions to run on GPU. I am open to opinions on both.cc @eric-czech @jeromekelleher