Skip to content

libscran/umappp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A C++ library for UMAP

Unit tests Documentation uwot comparison Codecov

Overview

umappp is a header-only C++ implementation of the Uniform Manifold Approximation and Projection (UMAP) algorithm (McInnes, Healy and Melville, 2018). UMAP is a non-linear dimensionality reduction technique that is most commonly used for visualization of complex datasets. This is achieved by placing each observation on a low-dimensional (usually 2D) embedding in a manner that preserves the neighborhood of each observation from the original high-dimensional space. The aim is to ensure that the local structure of the data is faithfully recapitulated in lower dimensions Further theoretical details can be found in the original UMAP documentation; the implementation here is derived from the C++ code in the uwot R package.

Quick start

Given a pointer to a column-major input array with ndim rows and nobs columns, we use initialize() to start the UMAP algorithm and run() to run it across epochs:

#include "umappp/umappp.hpp"

// Assuming `data` contains high-dimensional data in column-major format,
// i.e., each column is a observation and each row is a dimension.
// We fill it with some random noise for the purposes of this example.
int ndim = 10;
int nobs = 2000;
std::vector<double> data(ndim * nobs);
std::mt19937_64 rng(1000);
std::normal_distribution<double> ndist;
for (auto& x : data) {
    x = ndist(rng);
}

// Configuring the neighbor search algorithm; here, we'll be using an exact
// search based on VP trees with a Euclidean distance metric.
knncolle::VptreeBuilder<int, double, double> vp_builder(
    std::make_shared<knncolle::EuclideanDistance<double, double> >()
);

// Set number of dimensions in the output embedding.
size_t out_dim = 2;
std::vector<double> embedding(nobs * out_dim);

// Initialize the UMAP state:
umappp::Options opt;
auto status = umappp::initialize(
    ndim,
    nobs,
    data.data(),
    vp_builder, 
    out_dim,
    embedding.data(),
    opt
);

// Run UMAP algorithm to completion. This updates the contents
// of the 'embedding' vector supplied to initialize().
status.run(embedding.data());

We can modify parameters in the Options class that is passed to initialize():

opt.num_neighbors = 20;
opt.num_epochs = 200;
opt.min_dist = 0.2;

We can also run the algorithm up to the specified number of epochs, which is occasionally useful for inspecting the intermediate states of the embedding:

auto status2 = umappp::initialize(
    ndim,
    nobs,
    data.data(),
    vp_builder,
    out_dim,
    embedding.data(),
    opt
);

for (int iter = 10; iter < 200; iter += 10) {
    status2.run(embedding.data(), iter);
    // do something with the current contents of 'embedding',
    // e.g., create an animation over iterations.
}

Advanced users can control the neighbor search by either providing the search results directly (as a vector of vectors of index-distance pairs) or by providing an appropriate knncolle subclass to the initialize() function:

knncolle_annoy::AnnoyBuilder<int, double, double, Annoy::Euclidean> annoy_builder;
auto annoy_idx = annoy_builder.build_unique(knncolle::SimpleMatrix(ndim, nobs, data.data()));
auto status_annoy = umappp::initialize(*annoy_idx, 2, embedding.data(), opt);
status_annoy.run(embedding.data());

See the reference documentation for more details.

Building projects

CMake with FetchContent

If you're already using CMake, you can add something like this to your CMakeLists.txt:

include(FetchContent)

FetchContent_Declare(
  umappp 
  GIT_REPOSITORY https://github.com/libscran/umappp
  GIT_TAG master # replace with a pinned release
)

FetchContent_MakeAvailable(umappp)

And then:

# For executables:
target_link_libraries(myexe libscran::umappp)

# For libaries
target_link_libraries(mylib INTERFACE libscran::umappp)

By default, this will use FetchContent to fetch all external dependencies. Applications are advised to pin the versions of all dependencies for stability - see extern/CMakeLists.txt for suggested versions. If you want to install them manually, use -DUMAPPP_FETCH_EXTERN=OFF.

CMake with find_package()

find_package(libscran_umappp CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE libscran::umappp)

To install the library, use:

mkdir build && cd build
cmake .. -DUMAPPP_TESTS=OFF
cmake --build . --target install

Again, this will use FetchContent to fetch dependencies, see recommendations above.

Manual

If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This also requires the external dependencies listed in extern/CMakeLists.txt.

References

McInnes L, Healy J, Melville J (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv, https://arxiv.org/abs/1802.03426

About

C++ port of the UMAP algorithm

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •