-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change for next (Rcpp)Annoy version #69
Comments
Using an optional PR coming in a minute assuming my now-committed change survives one round of Travis CI... |
Ah excellent. I was just wondering what I should do tonight. |
PR coming. I just made the change for you, |
Thanks @eddelbuettel I will look at this ASAP |
Appreciate it. The new version is not yet at CRAN so what I mostly need is a nod from you and @LTLA that you could follow-up with a release "shortly" after I send RcppAnnoy to CRAN. All that is not super-urgent, we had the new Annoy release in here for three weeks now but we should probably all move on this and have the option for multithreaded indexing. |
Thanks for the PR @eddelbuettel, I will keep this issue open to track making sure uwot works with the up-coming RcppAnnoy. With full acknowledgement of the irony of what I am about to say: the hard part of preparing a CRAN release is checking the reverse dependencies haven't broken (I have a feeling you may know a thing or two about that). That usually has to wait for a weekend but I can chip away at some other things before then. A couple of related questions:
|
Yes, sounds good. Regarding the questions:
|
@eddelbuettel would you say that there are no major surprises so far with the release of RcppAnnoy 0.0.17 and that uwot can start specifying that as the minimum version along with the changes around the |
Yes -- version 0.0.17 was released a couple of days ago and seems to be doing just fine with itself and apparently also with reverse depends as we all asserted with prior tests. So yes it was a pretty smooth release and I think you can just depend on that as the minimum version. "At some point" will I make another release for the not-so-urgent vignette updates we missed (my bad!) for 0.0.17, but nothing else should change now in the header setup. I will double check with you and @LTLA but I think I can also remove the 'interim' But I think you both can rely on just the versioned depends as well as on the new version checkers I add, i.e. #if RCPPANNOY_VERSION >= RcppAnnoyVersion(0,0,17,0)
// do stuff for newer code
#endif Makes sense from your side too? |
Sounds good to me @eddelbuettel, testing in a branch right now. Thanks for making this change relatively painless. |
Everything looks good on my end, package-wise. FWIW the first casualty of the underlying Annoy library update is that the search results seem to be different from the results acquired with the previous Annoy version, even after disabling multithreading. Testing suggests that this is because the default seeds have been modified for correct behavior in each thread, which affects the nature of the approximation used by Annoy. If you're curious, this is my MWE tested on 0.0.16 vs 0.0.17.2: set.seed(0)
library(RcppAnnoy)
vector_size <- 20
a <- new(AnnoyEuclidean, vector_size)
# a$setSeed(123456789) # <-- uncomment this to get the same result
for (i in 1:1000) a$addItem(i - 1, runif(vector_size))
a$build(50)
output <- lapply(1:100, function(i) a$getNNsByItem(i-1, 20))
digest::digest(output) It seems that we'll have to do the equivalent of Couple of fixes we could go for, all of which would involve patching Annoy upstream - thoughts welcome. |
Oh, that's actually a somewhat important issue (I had once run into in another context) but missed here.
They are receptive to that. I am listed as a contributor because of a few such interface cleanups. |
There are a couple of levels of fixes that I can forsee: Short termThe first and immediate-term patch is to modify #ifdef ANNOYLIB_MULTITHREADED_BUILD
_random.set_seed(seed); // line 1150
#endif This means that, for non-multi-threaded builds, it simply reverts to the default seed for Long term (I)This would involve some changes to the contract around virtual void set_seed(typename Random::int_type q) = 0; // line 840
// ...
typename Random::int_type _seed; // line 867
// ...
void set_seed(typename Random::int_type seed) { // line 1141
_is_seeded = true;
_seed = seed;
} Maybe we don't need those Long term (II)A further fix would be for typename Random::int_type _seed = Random::default_seed; // line 867 Then we would get back our results without any modification of uwot or my C++ code, assuming that |
Good stuff. Can you open an issue at the upstream repo? |
Whoops, read that as "open a PR". Well, there it is. Couldn't figure out a clean way to strip the |
Wow. Nice. I tend to prefer to "talk first, code second" and prefer issues (as mentioned) over PRs but hey you delivered the goods all at one. |
BTW, what do you guys get for the
Running on R 4.0.3, Ubuntu 18.04.5, gcc 7.5.0. I think I get the same on my Mac + clang, though I'd need to check. I've become mildly concerned about the difficulty of reproducing results across different platforms (see the PR for some context). |
> setwd('/tmp/')
> set.seed(0)
> library(RcppAnnoy)
> vector_size <- 20
> a <- new(AnnoyEuclidean, vector_size)
> for (i in 1:1000) a$addItem(i - 1, runif(vector_size))
> a$build(50)
> output <- lapply(1:100, function(i) a$getNNsByItem(i-1, 20))
> digest::digest(output)
[1] "fbd2e41b2d3152a58efb69c02bfe7c83"
> Ubuntu 20.04, gcc 10, master branch of RcppAnnoy i.e. not your PR yet. I am not that concerned because I think we laid the groundwork years ago to rely on R's RNG which should shelter us from OS and compiler and whatnot. So for that it looks like your #552 does just that which is nice, and as intended. |
> set.seed(0)
> library(RcppAnnoy)
> vector_size <- 20
> a <- new(AnnoyEuclidean, vector_size)
> # a$setSeed(123456789) # <-- uncomment this to get the same result
> for (i in 1:1000) a$addItem(i - 1, runif(vector_size))
> a$build(50)
> output <- lapply(1:100, function(i) a$getNNsByItem(i-1, 20))
> digest::digest(output)
[1] "fbd2e41b2d3152a58efb69c02bfe7c83" Windows 10, gcc 8.3.0 (via Rtools), RcppAnnoy 0.0.17. |
Thanks all. I was more concerned with the precision of floating point operations, especially across compiler settings. A few years ago I had a problem with reproducing RcppAnnoy's results with BiocNeighbors; this was because the latter was accidentally being compiled with |
Looks like the PR got merged. Can we suck it into RcppAnnoy 0.0.17.2? I can run it through the offending pipeline on the weekend, check that everything is back to normal. |
I just created 0.17.3 which the changes to the two files we carry over -- and that was pretty seamless. Currently in a branch. Can you test with that or do I need to PR it into the main branch? |
That's fine, I can work with a branch. Brief test indicates that I get back my original goodness: > set.seed(0)
> library(RcppAnnoy)
> vector_size <- 20
> a <- new(AnnoyEuclidean, vector_size)
> # a$setSeed(123456789) # <-- uncomment this to get the same result
> for (i in 1:1000) a$addItem(i - 1, runif(vector_size))
> a$build(50)
> output <- lapply(1:100, function(i) a$getNNsByItem(i-1, 20))
> digest::digest(output)
[1] "0a85b62f0e58107ad5964321113b9b96" Nice. I'll fire up my actual use case tomorrow while I enjoy my last pre-lockdown outdoor dining experience. |
Last release was mid-November so by mid-December I can roll up 0.1.18 which should have all the new conditioning and your RNG fixes to keep us happy for a bit. Hopefully. |
Sorry to be a fly in the ointment, but I will need to release a new version of uwot before the next version of RcppAnnoy comes out (see the PR). |
No worries. As mentioned I have some time, and this coordination requirement is precisely why we're talking here :) |
Well, I have some bad news. I wasn't able to recover results with my actual use-case by using RcppAnnoy 0.0.17.3. (For anyone who's interested, this is http://bioconductor.org/books/release/OSCA/hca-human-bone-marrow-10x-genomics.html, with an eventual call down to the Annoy libraries in BiocNeighbors.) I was able to recover the results by reverting to RcppAnnoy 0.0.16, so it's probably some algorithmic change in Annoy beyond the seed changes, something that isn't captured by my MRE above. This puts us in a bit of a pickle. I mean, the immediate discrepancy isn't such a big deal, I can just update the results. The bigger issue is how to accommodate these kinds of changes in the future. I'm guessing that Erik isn't particularly concerned about guaranteeing the same output across versions of Annoy - which is understandable, as that's a rather particular requirement of scientific applications. However, it does make it difficult to recommend Annoy if the results of the search are liable to change across releases - especially as the search occurs at a very low level in many analysis steps, so it's hard for users to diagnose what changed. For example, it wouldn't be part of the So. What to do about this. One option is to for RcppAnnoy to have an #if defined(USE_RCPPANNOY_VERSION)
#if USE_RCPPANNOY_VERSION == 0.0.16
#pragma message("This is about to be deprecated!")
#include "archive/0.0.16/mman.h"
#include "archive/0.0.16/annoylib.h"
#include "archive/0.0.16/kissrandom.h"
#endif Then it's just a matter of downstream packages (I've |
Uggh. I need to sleep this over but ex-ante I don't like it. I would not to, on the sly, introduce a meta-versioning mechanism inside an R package. All releases are marked and kept. You can always get them. For example, I personally think roxygen2 7.* is a step backwards (for compiled packages) and it always compiles. So I chose, locally, to keep 6.0.1 in a separate You could do the same and hardwire BiocNeighbors to a particular (old) version of RcppAnnoy without forcing a change onto RcppAnnoy and Annoy. Because as you say, the change of pace there is a little different from what you require in frozen versions. Let's think this over and see if we can find other ways. |
Alll this talk about hardwiring has led me to realize that we are overlooking the simplest and most effective solution: just copy-and-paste the headers in RcppAnnoy's The idea would be for BiocNeighbors to set up a GitHub Action that pulls the source files from CRAN every day (or whatever) and copies them into the appropriate location on In effect, I would be treating CRAN as a kind of CI for RcppAnnoy; once I know it passes the gauntlet, I can transplant the files into BiocNeighbors. The only missing piece is for RcppAnnoy to offer an R-level function indicating its current version of Annoy, so I can determine whether to skip the tests on the BioC-release branch when the versions get out of sync. One could do the same for RcppHNSW, if we ever got these types of changes to the original library. |
That's better. I was also thinking that this might be a use for something like |
Annoy now allows for single- or multithreaded indexing but implemented the change in a somewhat suboptimal API change without a fallback. The diff is pretty trivial (see below) but I just realized that I can't even PR it to you as it would break your CI processes as you point at the current RcppAnnoy -- where this change will be needed once the next RcppAnnoy is on CRAN.
I had that version ready but wanted to wait to get Bioconductor 3.12 out (hi, @LTLA !) . That being the case now, can you signal if/when/how you think we should coordinate the upload? Maybe I should protect my proposed change by a opt-in
#define
...Anyway, the small change is and with it it all builds well with the current GitHub release of RcppAnnoy (which I call 0.16.2). I can put that version into a
drat
repo too ...The text was updated successfully, but these errors were encountered: