-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does transform() have exaggeration=2 by default? #62
Comments
So the rationale for using a higher exaggeration here is the same as why we use the "early exaggeration phase". If the initialization for the transformed points is not ideal, then having a slightly higher exaggeration would allow the new points to move more easily to their "correct" locations. But instead of using exaggeration 12, I found 2 to be completely fine. I figured that I could cut down the number of optimization iterations by a little bit, making When we do median initialization, the new point positions are basically on top of existing points, making the repulsive forces quite large. I found that using the interpolation based approximation would sometimes shoot points way off 1) making it slow because the interpolation grid would become huge and 2) it would take these points a lot of iterations to return to the main mass of points. This is why I set momentum to 0, so as to dampen the shooting off in case it does happen, and also why I added gradient clipping in I admit this was based mainly on intuition and like one hour of tinkering around, so I am by no means dead-set on these defaults.
Yes, it works. As for your last question, I usually use cosine distances then dealing with sc data, it acts as some kind of "rank" normalization, and I find it usually gives more well defined clusters than Euclidean distance. Actually, For your specific example, correlation might probably be better since full-length sequencing data tends to be quite different from UMI based methods. That would be a cool thing to test out actually. But typically when I do things, I usually do PCA first, then run t-SNE on that with cosine distance. How exactly does your case work? Do you run t-SNE on the entire data set? Or maybe do gene filtering? Or do you compute PCA on e.g. the smart-seq2 data then project the 10x data onto that basis (I imagine you might fail to capture some information that way?). And yes, you're right. Currently, we don't support changing the distance metric between the initial embedding and the "transformed" points in that way, because it could be very slow. I'd have to think about it but I may be reluctant to just add a You can actually achieve this by manually recomputing the nearest neighbors with the new distance metric and then reassigning the tsne = openTSNE.TSNE(metric="euclidean")
embedding = tsne.fit(x_train)
# Set the new metric
embedding.affinities = openTSNE.affinity.PerplexityBasedNN(x_train, metric="correlation")
# Do transform with the new metric
new_embedding = embedding.transform(x_new) That's not too bad, but it does require being familiar with some of the internals. Maybe putting this as an example somewhere about more advanced use cases would suffice? I feel like this would rarely come up, and if it does, I'd expect the person to really know what they're doing. |
Thanks for the reply! There are two separate issues here. First, I find exaggeration=2 confusing. I think it's okay to use early exaggeration for So my suggestion would be to have something like
For concreteness, let's say I have 10x reference with n=100k cells and n=100 smart-seq2 points to map on top of that. For the 10x data I would usually do feature selection to 1000 genes, then PCA down to 50 components, then t-SNE with Euclidean distance (I don't mind cosine distance, but I haven't experimented with it myself, so I'm just going with Euclidean by default). Now for the smart-seq cells, I select the same 1000 genes and use correlation distance to find k=10 neighbours in the 10x data and use the median location to position the point. That's the procedure I describe in my preprint. (I don't use 10x reference in the preprint, but I use it in some parallel work and it works fine.) To me it makes sense: I don't necessarily want to project the smart-seq points onto the reference PC space because I have no idea how the batch effect will look like... I have to admit that I did not really experiment with other ways of doing it, but I found the correlation distance to work reasonably well in this situation. Now for the revision I wanted to use How would you do it instead? |
I think I actually had it like that at some point, but then changed it so it was simpler. I'd have nothing against that. These parameters were chosen by running it on a few examples and seeing what worked and what didn't. It wasn't very systematic, so I'm open to changes. I wasn't entirely sure how to evaluate if it was doing "well". How do you define "well"? So I just looked at the plots and looked at what seemed reasonable. I'll do a little bit of tinkering, but I think just changing the exaggeration back to 1 and running it for the full 100 iterations should be fine. The reason being I want to avoid too many parameters. t-SNE already has a very large number of parameters, and cutting any of them out makes it a bit simpler to use. If you've done (or plan to do) any systematic testing, that would be even better.
Oh yeah, I remember now. That probably makes the most sense of all the approaches I've tried. I've found Euclidean distance is useless when dealing with batch effects, so I never use that. Would it be reasonable to use correlation distance also for the initial embedding? Does PCA interfere with this?
This should still be possible with openTSNE, using almost the exact same snippet as before: tsne = openTSNE.TSNE(metric="euclidean")
embedding = tsne.fit(x_train_pca)
# Set the new metric
embedding.affinities = openTSNE.affinity.PerplexityBasedNN(x_train_genes, metric="correlation")
# Do transform with the new metric
new_embedding = embedding.transform(x_new_genes) If this turns out to work well, then maybe adding a |
That's fine, but I personally would think that having early exaggeration in
I've never tried but I would guess that yes, one could use correlation distance for the initial embedding. I don't see how one could combine it with PCA though: correlation is supposed to be across genes, not across PCs, and definitely not across only 50 PCs. So with correlation distance I think one would need to use 1000-dimensional (or so) space.
Thanks! The snippet makes sense. I want to try it some time later this week or maybe next week, and will get back to you. Then we can discuss further. |
Apologies, I haven't got around trying this out so far, but it's high on my to-do list :-) |
No worries! I've been playing around with this a little bit, but it seems quite finicky. |
By the way, we have recently submitted the revision of my tSNE paper (https://www.biorxiv.org/content/10.1101/453449v2). I am still using FIt-SNE for all the analysis, but I explicitly mention openTSNE as an alternative. You might want to check if you think it's mentioned appropriately (suggestions welcome). Also, I am thinking of maybe adding some openTSNE examples to the As a side note, the manuscript has changed quite a bit in response to the critical comments we received. I had to streamline and "unify" the procedure, so now there is a specific choice of parameters etc. that we are suggesting (at least as a rule of thumb). Also, the visualization of the Macosko data set now looks a bit different (and unfortunately less striking), mostly because I realized that the "heart" shape was an artifact of my normalization procedure: I was normalizing to counts per million and this does not make much sense for the UMI data. I am now normalizing to counts per median library size (which for UMI datasets is a lot less than 1mln). As always, any comments on the preprint are very welcome, especially while we can still improve it. |
Oh, cool, I'll definitely give it a read! I glanced at it a little bit and the figures are really nice. Thanks for mentioning openTSNE - it should be fine for now. We'll be pushing out a short report on openTSNE soon, but I don't know how long that will take.
That would definitely be welcome. I noticed that you recommend a learning rate n/12 for larger data sets. That seems high for data sets with millions of points, and I had some issues with high learning rates with some points shooting way off. I need to look into this...
I had noticed that as well, but CPM is pretty standard, and normalizing to median library size when working with multiple data sets might not work. I also noticed that you have three couple of evaluation metrics for embeddings: that's neat. The three numbers each describe one aspect, but it might be easier to reason about a single number. KNC is easy to compute if we have the class labels. I suppose if there are no labels, you'd do some clustering to get them? |
I never had any problems with n/12, including the Cao dataset with n=2mln. The recommendation is taken from the Belkina et al. preprint, and they have n up to 20 mln. It seems always to work fine.
Not really. It's standard for full length sequencing, but most UMI papers that I see normalize either to 10,000 or to median. I think median became pretty standard.
Yeah... It's clearly not perfect. I plan to work on this in the future. These particular metrics were a rather quick fix to please the reviewers :-) |
Yeah, but what would you use to compare a full-length and a UMI data set?
Following this line of thought, we're currently using Moran's I correlation to measure how "good" an embedding is. It's useful because we can also use it to evaluate the "transformed" data points. It has a lot of problems on its own, but it does tell us how well-clustered points of the same class are and it does assign a single numeric value to each embedding. |
In what sense "to compare"? Anyway, I'd probably normalize them separately, each to the median within the corresponding data set.
Interesting! I've never heard about it. Let me know if/when you have some write-ups about it. |
I'm working on an example where I've got a data set of full-length reads and a UMI count data set. Now if I want to make a t-SNE embedding of both these data sets together i.e. just concatenate them, I need to get them on a similar scale, so I use CPM. Or would you prefer median normalization here?
The idea actually came from Monocle 3 where they use it for detecting DE genes in developmental trajectories. It's actually really neat! But instead of doing a statistical test, we just compute the autocorrelation coefficient. |
I don't really have a lot of experience with that. But I'd expect to see a very strong batch effect, independent of what kind of normalization one does. |
I tried it out and it worked. Thanks! I managed to confirm that
yield very similar results for my data :-) It wasn't super practical though. I used the same datasets and the same preprocessing as in the preprint: reference data set with 24k cells and 3000 selected genes, test dataset with 46 cells. When I ran |
Yes, unfortunately, I still rely on pynndescent to do this. Even still, that's surprisingly slow. Just yesterday, I was generating figures, and the reference embedding had 44k points, the new data 28k points. The entire process including optimization took like 5mins. Granted, I used 1000 genes instead of 3000. So it's very surprising that yours took so long. The algorithm is actually pretty fast on a single core, but the implementation doesn't know how to take advantage of all available cores. Eventually, I plan to get rid of it, but I have yet to find a suitable replacement. I may just end up rolling out my own implementation... |
I've tried it with 1000 genes now and But still, direct calculation of all correlations takes 0.1 s. I guess when the test dataset is very small, then it's faster than approximate nearest neighbours.
Doesn't the current version of pynndescent have multi-core support? I thought it was added recently. |
I think they're working on it. The last time I checked (a couple weeks ago) it didn't seem to be fully integrated yet. Same for sparse. The main issue is that we can't just depend on it i.e. throw it into the |
Hmm. My impression was different. Multi-threaded implementation seems to have been merged in March: lmcinnes/pynndescent#12, and sparse matrix support in April lmcinnes/pynndescent#16.
Yes, I understand that. However, you already have a frozen copy of pynndescent here :) Even though it's less than ideal, it's clearly better to have a frozen copy of a more recent version rather than of an outdated one. I guess the question is, how easy it is to update the copy of pynndescent. Does it require a lot of work? |
Oh, yeah, you're right. You mentioned this in #28. I think I did try to port that code over here, but I seem to have dropped it for some reason. Maybe something wasn't working properly at the time. There seems to be quite a bit of activity there. From my experience, if something's on the master branch and under a lot of development, there are bound to be bugs here and there. According to pypi, the latest release was 10 days ago, and it looks like they're working to get it merged into scanpy (scverse/scanpy#659), so it must be fairly stable at this point. You're entirely right and it's probably not difficult, but it needs to be tested. I have written some tests that will catch obvious errors, but "correctness" is harder to put into unit tests. You're welcome to take a crack at it, otherwise, I'll try to get to it in the coming weeks. |
I think this can be closed now, so I'm closing it. |
That's great. Good luck with publication! I will insert the citation when we revise our paper next time (currently waiting to hear back from the journal after a major revision). I read it now, so here are some very brief comments:
|
Thanks for the awesome feedback, we'll be sure to consider this.
Actually, this should be O(max{N, M}), but in the cases I worked on, the reference is larger. So the reason it's this is that we're interpolating from the reference data (which adds up to O(N)) to the secondary data (so O(M)). The matrix multiplication is still done through the FFT and the interpolation points don't depend on the number of samples. Hence O(max{N, M}).
The reason I didn't mention it is that lowering the learning rate kind of removes the need for that. I still keep it on for safety, but it doesn't usually contribute much. I suppose it's worth mentioning.
Yes, so we use the cosine distance in both cases, and that's exactly what it means. Cosine distance is very similar to correlation and works better in higher dimensional spaces. In the embedding space, we still use euclidean distance to estimate the gradients. But that's in 2d, and euclidean makes a lot of sense there.
I'm not sure I follow, to compute any similarities, we rely on
Obviously, it's hard to argue which is better from a visual perspective, and I'll likely look into it further. In any case, it seems more intuitive that if we make a reference t-SNE embedding and want to add points to it, we'd expect the new points to optimized in the same way. And we did have to mention a few things fairly briefly because we hasd a hard 15-page limit. Again, thanks for the feedback. Also, just a couple things I noticed while going over your paper that you may want to fix in your next revision :)
|
What I meant is that to implement your pipeline in openTNSE one needs to do something like this (as you showed me earlier in this thread):
which is a pretty advanced usage because one needs to "manually" change the affinities object after constructing the embedding and before running transform(). So I was wondering if could be worthwhile to add extra parameters to
I am not saying it isn't important, I'm just saying that ideally I would like to see a more explicit demonstration of this... But maybe the only way to do it convincingly is to generate some kind of synthetic toy dataset with known ground truth.
Thanks a lot! All good catches. |
Oh, I see! Yes, it a bit annoying to do, and I set up a notebook demonstrating how to do this (here)... I've given this quite a bit of thought, and I don't think there's an elegant way to incorporate this into I think having a good example notebook that is easy to find should be sufficient.
That's a great idea, but I'll have to put a lot of thought into how to design a compelling experiment. I like the figure in the paper quite a bit, but I realise that may be just because I've been looking at it for so long. A synthetic example might be more compelling. |
The parameters of the transform function are
so it has
exaggeration=2
by default. Why? This looks unintuitive to me: exaggeration is a slightly "weird" trick that can arguably be very useful for huge data sets, but I would expect the out-of-sample embedding to work just fine with it. Am I missing something?I am also curious why momentum is set to 0 (unlike in normal tSNE optimization), but here I don't have any intuition for what it should be.
Another question is: will this function work with
n_iter=0
if one just wants to get an embedding using medians of k nearest neighbours? That would be handy. Or is there another way to get this? Perhaps fromprepare_partial
?And lastly, when transform() is applied to points from a very different data set (imagine positioning Smart-seq2 cells onto a 10x Chromium reference), I prefer to use correlation distances because I suspect Euclidean distances might be completely off (even when the original tSNE was done using Euclidean distances). I think openTSNE currently does not support this, right? Did you have any problems with that? One could perhaps allow
transform()
to take ametric
argument (iscorrelation
among the supported metrics, btw?). The downside is that if this metric is different from the metric used to prepare the embedding, then the nearest neighbours object will have to be recomputed, so it will suddenly become much slower. Let me know if I should post it as a separate issue.The text was updated successfully, but these errors were encountered: