Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change for next (Rcpp)Annoy version #69

Closed
eddelbuettel opened this issue Nov 10, 2020 · 30 comments
Closed

Change for next (Rcpp)Annoy version #69

eddelbuettel opened this issue Nov 10, 2020 · 30 comments

Comments

@eddelbuettel
Copy link
Contributor

Annoy now allows for single- or multithreaded indexing but implemented the change in a somewhat suboptimal API change without a fallback. The diff is pretty trivial (see below) but I just realized that I can't even PR it to you as it would break your CI processes as you point at the current RcppAnnoy -- where this change will be needed once the next RcppAnnoy is on CRAN.

I had that version ready but wanted to wait to get Bioconductor 3.12 out (hi, @LTLA !) . That being the case now, can you signal if/when/how you think we should coordinate the upload? Maybe I should protect my proposed change by a opt-in #define...

Anyway, the small change is and with it it all builds well with the current GitHub release of RcppAnnoy (which I call 0.16.2). I can put that version into a drat repo too ...

modified   src/nn_parallel.h
@@ -31,6 +31,12 @@
 
 #include "uwot/matrix.h"
 
+#ifdef ANNOYLIB_MULTITHREADED_BUILD
+  typedef AnnoyIndexMultiThreadedBuildPolicy AnnoyIndexThreadedBuildPolicy;
+#else
+  typedef AnnoyIndexSingleThreadedBuildPolicy AnnoyIndexThreadedBuildPolicy;
+#endif
+
 struct UwotAnnoyEuclidean {
   using Distance = Euclidean;
   using S = int32_t;
@@ -66,7 +72,7 @@ template <typename UwotAnnoyDistance> struct NNWorker {
   std::vector<typename UwotAnnoyDistance::T> dists;
 
   AnnoyIndex<typename UwotAnnoyDistance::S, typename UwotAnnoyDistance::T,
-             typename UwotAnnoyDistance::Distance, Kiss64Random>
+             typename UwotAnnoyDistance::Distance, Kiss64Random, AnnoyIndexThreadedBuildPolicy>
       index;
 
   NNWorker(const std::string &index_name, const std::vector<double> &mat,
@eddelbuettel
Copy link
Contributor Author

Using an optional -D.... in src/Makevars is the way to go.

PR coming in a minute assuming my now-committed change survives one round of Travis CI...

@LTLA
Copy link
Contributor

LTLA commented Nov 10, 2020

Ah excellent. I was just wondering what I should do tonight.

@eddelbuettel
Copy link
Contributor Author

PR coming. I just made the change for you,

@jlmelville
Copy link
Owner

Thanks @eddelbuettel I will look at this ASAP

@eddelbuettel
Copy link
Contributor Author

Appreciate it. The new version is not yet at CRAN so what I mostly need is a nod from you and @LTLA that you could follow-up with a release "shortly" after I send RcppAnnoy to CRAN. All that is not super-urgent, we had the new Annoy release in here for three weeks now but we should probably all move on this and have the option for multithreaded indexing.

@jlmelville
Copy link
Owner

Thanks for the PR @eddelbuettel, I will keep this issue open to track making sure uwot works with the up-coming RcppAnnoy.

With full acknowledgement of the irony of what I am about to say: the hard part of preparing a CRAN release is checking the reverse dependencies haven't broken (I have a feeling you may know a thing or two about that). That usually has to wait for a weekend but I can chip away at some other things before then. A couple of related questions:

  1. Is the current state of the RcppAnnoy github repo a good approximation of what will get submitted to CRAN, so that I can start testing with that?
  2. Do I need to change the compile settings in uwot to be C++14?

@eddelbuettel
Copy link
Contributor Author

Yes, sounds good. Regarding the questions:

  1. Absolutely. The master branch in the candidate; I would only change the version number and release date in DESCRIPTION, update NEWS, ....
  2. Not in general, I think. But if you wanted to adopt the Annoy multithreading then likely. @LTLA and I looked into that briefly and that does of course have a likely side effect (e.g. other packages, other projects like BioConductor) may prefer C++11.

@jlmelville
Copy link
Owner

@eddelbuettel would you say that there are no major surprises so far with the release of RcppAnnoy 0.0.17 and that uwot can start specifying that as the minimum version along with the changes around the RcppAnnoy.h header at some point?

@eddelbuettel
Copy link
Contributor Author

eddelbuettel commented Nov 25, 2020

Yes -- version 0.0.17 was released a couple of days ago and seems to be doing just fine with itself and apparently also with reverse depends as we all asserted with prior tests. So yes it was a pretty smooth release and I think you can just depend on that as the minimum version.

"At some point" will I make another release for the not-so-urgent vignette updates we missed (my bad!) for 0.0.17, but nothing else should change now in the header setup. I will double check with you and @LTLA but I think I can also remove the 'interim' #define we added to identify 0.0.16.2.

But I think you both can rely on just the versioned depends as well as on the new version checkers I add, i.e.

#if RCPPANNOY_VERSION >= RcppAnnoyVersion(0,0,17,0)
     // do stuff for newer code
#endif  

Makes sense from your side too?

@jlmelville
Copy link
Owner

Sounds good to me @eddelbuettel, testing in a branch right now. Thanks for making this change relatively painless.

@LTLA
Copy link
Contributor

LTLA commented Nov 26, 2020

Everything looks good on my end, package-wise.

FWIW the first casualty of the underlying Annoy library update is that the search results seem to be different from the results acquired with the previous Annoy version, even after disabling multithreading. Testing suggests that this is because the default seeds have been modified for correct behavior in each thread, which affects the nature of the approximation used by Annoy.

If you're curious, this is my MWE tested on 0.0.16 vs 0.0.17.2:

set.seed(0)
library(RcppAnnoy)
vector_size <- 20
a <- new(AnnoyEuclidean, vector_size)
# a$setSeed(123456789) # <-- uncomment this to get the same result
for (i in 1:1000) a$addItem(i - 1, runif(vector_size))
a$build(50)
output <- lapply(1:100, function(i) a$getNNsByItem(i-1, 20))
digest::digest(output)

It seems that we'll have to do the equivalent of a$setSeed(1234567890987654321ULL) to recover our previous results. Which unfortunately is impossible at multiple levels, most fundamentally because Annoy itself only allows passing on ints into that setter. I had to go in and manually comment out line 1150 in annoylib.h to get my old results back.

Couple of fixes we could go for, all of which would involve patching Annoy upstream - thoughts welcome.

@eddelbuettel
Copy link
Contributor Author

Oh, that's actually a somewhat important issue (I had once run into in another context) but missed here.

Couple of fixes we could go for, all of which would involve patching Annoy upstream - thoughts welcome.

They are receptive to that. I am listed as a contributor because of a few such interface cleanups.

@LTLA
Copy link
Contributor

LTLA commented Nov 26, 2020

There are a couple of levels of fixes that I can forsee:

Short term

The first and immediate-term patch is to modify annoylib.h in RcppAnnoy to do something like:

#ifdef ANNOYLIB_MULTITHREADED_BUILD
    _random.set_seed(seed); // line 1150
#endif 

This means that, for non-multi-threaded builds, it simply reverts to the default seed for Random, which is the old Annoy behavior. For multi-threaded builds, you get a different seed, but that would happen anyway. (FWIW thread-agnostic behavior would be achieved by defining a separate seed for each tree in the serial section; this is how I handle most of my randomization.)

Long term (I)

This would involve some changes to the contract around Random. For starters, we would need to know the integer type expected by Random's set_seed, so we'd need something like a Random::int_type. Then we could redefine:

  virtual void set_seed(typename Random::int_type q) = 0; // line 840
  // ...
  typename Random::int_type _seed; // line 867
  // ...
  void set_seed(typename Random::int_type seed) { // line 1141
    _is_seeded = true;
    _seed = seed;
  }

Maybe we don't need those typenames, I forget. But the point is that an appropriate seed can be passed from C++ without truncation. How to handle this in R is another question, but at least @jlmelville and I can get our old results back by manually calling set_seed with that big number in our C++ code.

Long term (II)

A further fix would be for Random to define a default seed that can be used in AnnoyIndexInterface, i.e.:

  typename Random::int_type _seed = Random::default_seed; // line 867

Then we would get back our results without any modification of uwot or my C++ code, assuming that thread_idx is zero.

@eddelbuettel
Copy link
Contributor Author

Good stuff. Can you open an issue at the upstream repo?

@LTLA
Copy link
Contributor

LTLA commented Nov 27, 2020

Whoops, read that as "open a PR".

Well, there it is. Couldn't figure out a clean way to strip the constness of default_seed.

@eddelbuettel
Copy link
Contributor Author

Wow. Nice. I tend to prefer to "talk first, code second" and prefer issues (as mentioned) over PRs but hey you delivered the goods all at one.

@LTLA
Copy link
Contributor

LTLA commented Dec 2, 2020

BTW, what do you guys get for the digest output above (leaving the set_seed() call commented)? I get:

Running on R 4.0.3, Ubuntu 18.04.5, gcc 7.5.0. I think I get the same on my Mac + clang, though I'd need to check. I've become mildly concerned about the difficulty of reproducing results across different platforms (see the PR for some context).

@eddelbuettel
Copy link
Contributor Author

eddelbuettel commented Dec 2, 2020

> setwd('/tmp/')
> set.seed(0)
> library(RcppAnnoy)
> vector_size <- 20
> a <- new(AnnoyEuclidean, vector_size)
> for (i in 1:1000) a$addItem(i - 1, runif(vector_size))
> a$build(50)
> output <- lapply(1:100, function(i) a$getNNsByItem(i-1, 20))
> digest::digest(output)
[1] "fbd2e41b2d3152a58efb69c02bfe7c83"
> 

Ubuntu 20.04, gcc 10, master branch of RcppAnnoy i.e. not your PR yet.

I am not that concerned because I think we laid the groundwork years ago to rely on R's RNG which should shelter us from OS and compiler and whatnot. So for that it looks like your #552 does just that which is nice, and as intended.

@jlmelville
Copy link
Owner

> set.seed(0)
> library(RcppAnnoy)
> vector_size <- 20
> a <- new(AnnoyEuclidean, vector_size)
> # a$setSeed(123456789) # <-- uncomment this to get the same result
> for (i in 1:1000) a$addItem(i - 1, runif(vector_size))
> a$build(50)
> output <- lapply(1:100, function(i) a$getNNsByItem(i-1, 20))
> digest::digest(output)
[1] "fbd2e41b2d3152a58efb69c02bfe7c83"

Windows 10, gcc 8.3.0 (via Rtools), RcppAnnoy 0.0.17.

@LTLA
Copy link
Contributor

LTLA commented Dec 3, 2020

Thanks all. I was more concerned with the precision of floating point operations, especially across compiler settings. A few years ago I had a problem with reproducing RcppAnnoy's results with BiocNeighbors; this was because the latter was accidentally being compiled with -O3, which introduced a bunch of speed-ups that changed precision and thus results. Variation in AVX support is also a potential problem but I explicitly turn it off in BiocNeighbors so that's that.

@LTLA
Copy link
Contributor

LTLA commented Dec 4, 2020

Looks like the PR got merged. Can we suck it into RcppAnnoy 0.0.17.2? I can run it through the offending pipeline on the weekend, check that everything is back to normal.

@eddelbuettel
Copy link
Contributor Author

I just created 0.17.3 which the changes to the two files we carry over -- and that was pretty seamless.

Currently in a branch. Can you test with that or do I need to PR it into the main branch?

@LTLA
Copy link
Contributor

LTLA commented Dec 5, 2020

That's fine, I can work with a branch. Brief test indicates that I get back my original goodness:

> set.seed(0)
> library(RcppAnnoy)
> vector_size <- 20
> a <- new(AnnoyEuclidean, vector_size)
> # a$setSeed(123456789) # <-- uncomment this to get the same result
> for (i in 1:1000) a$addItem(i - 1, runif(vector_size))
> a$build(50)
> output <- lapply(1:100, function(i) a$getNNsByItem(i-1, 20))
> digest::digest(output)
[1] "0a85b62f0e58107ad5964321113b9b96"

Nice. I'll fire up my actual use case tomorrow while I enjoy my last pre-lockdown outdoor dining experience.

@eddelbuettel
Copy link
Contributor Author

Last release was mid-November so by mid-December I can roll up 0.1.18 which should have all the new conditioning and your RNG fixes to keep us happy for a bit. Hopefully.

@jlmelville
Copy link
Owner

Sorry to be a fly in the ointment, but I will need to release a new version of uwot before the next version of RcppAnnoy comes out (see the PR).

@eddelbuettel
Copy link
Contributor Author

No worries. As mentioned I have some time, and this coordination requirement is precisely why we're talking here :)

@LTLA
Copy link
Contributor

LTLA commented Dec 6, 2020

Well, I have some bad news. I wasn't able to recover results with my actual use-case by using RcppAnnoy 0.0.17.3. (For anyone who's interested, this is http://bioconductor.org/books/release/OSCA/hca-human-bone-marrow-10x-genomics.html, with an eventual call down to the Annoy libraries in BiocNeighbors.) I was able to recover the results by reverting to RcppAnnoy 0.0.16, so it's probably some algorithmic change in Annoy beyond the seed changes, something that isn't captured by my MRE above.

This puts us in a bit of a pickle. I mean, the immediate discrepancy isn't such a big deal, I can just update the results. The bigger issue is how to accommodate these kinds of changes in the future. I'm guessing that Erik isn't particularly concerned about guaranteeing the same output across versions of Annoy - which is understandable, as that's a rather particular requirement of scientific applications. However, it does make it difficult to recommend Annoy if the results of the search are liable to change across releases - especially as the search occurs at a very low level in many analysis steps, so it's hard for users to diagnose what changed. For example, it wouldn't be part of the NEWS files of the packages that depend on BiocNeighbors.

So. What to do about this. One option is to for RcppAnnoy to have an inst/include/archive directory containing old releases of Annoy. And then our new RcppAnnoy.h file could have something like:

#if defined(USE_RCPPANNOY_VERSION)
#if USE_RCPPANNOY_VERSION == 0.0.16
#pragma message("This is about to be deprecated!")
#include "archive/0.0.16/mman.h"
#include "archive/0.0.16/annoylib.h"
#include "archive/0.0.16/kissrandom.h"
#endif

Then it's just a matter of downstream packages #defineing USE_RCPPANNOY_VERSION to get their old results... and whenever they developers are willing to make the switch, they can just bump USE_RCPPANNOY_VERSION to the current version. And of course, if developers want to just use the latest version that's provided by RcppAnnoy, they don't have to do anything extra. We wouldn't have to support this indefinitely; this is just a C++-level deprecation of the old library to the new, to give people a grace period. In my specific case, it would be ~6 months when BioC rolls over to a new release and I can justify changes to the BiocNeighbors output.

(I've #include "mman.h" explicitly so that the compiler looks up the right version; otherwise, left to its own devices, the compiler could accidentally suck in the latest version of mman.h while trying to use the archived version. The explicit include and the subsequent definition of the header guards should prevent any subsequent includes.)

@eddelbuettel
Copy link
Contributor Author

Uggh. I need to sleep this over but ex-ante I don't like it. I would not to, on the sly, introduce a meta-versioning mechanism inside an R package.

All releases are marked and kept. You can always get them. For example, I personally think roxygen2 7.* is a step backwards (for compiled packages) and it always compiles. So I chose, locally, to keep 6.0.1 in a separate .libPath() from where I can call it for lighter/simpler use of roxygen2. Which is then quicker.

You could do the same and hardwire BiocNeighbors to a particular (old) version of RcppAnnoy without forcing a change onto RcppAnnoy and Annoy. Because as you say, the change of pace there is a little different from what you require in frozen versions.

Let's think this over and see if we can find other ways.

@LTLA
Copy link
Contributor

LTLA commented Dec 6, 2020

Alll this talk about hardwiring has led me to realize that we are overlooking the simplest and most effective solution: just copy-and-paste the headers in RcppAnnoy's inst/include into BiocNeighbors's inst/include!

The idea would be for BiocNeighbors to set up a GitHub Action that pulls the source files from CRAN every day (or whatever) and copies them into the appropriate location on master. This ensures that the BioC-devel branch is always up to date with the latest RcppAnnoy. However, once it becomes BioC-release, the daily copies have no effect and thus the package is frozen/stable.

In effect, I would be treating CRAN as a kind of CI for RcppAnnoy; once I know it passes the gauntlet, I can transplant the files into BiocNeighbors. The only missing piece is for RcppAnnoy to offer an R-level function indicating its current version of Annoy, so I can determine whether to skip the tests on the BioC-release branch when the versions get out of sync.

One could do the same for RcppHNSW, if we ever got these types of changes to the original library.

@eddelbuettel
Copy link
Contributor Author

That's better. I was also thinking that this might be a use for something like renv or other "freezing" mechanisms. Vendoring works equally well. I am usually against both and prefer working from @Head but I obviously fully understand your need to reproduce. The issue really may be that the algorithm is too approximate and hence _annoy_ing. I stop the bad puns now....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants