Skip to content

williamsyy/LocalMAP

 
 

Repository files navigation

LocalMAP

Introduction

Our work has been published at the The 39th Annual AAAI Conference on Artificial Intelligence!

LocalMAP (Pairwise Controlled Manifold Approximation with Local Adjusted Graph) is a new dimensionality reduction algorithm that dynamically and locally adjusts the graph to address the challenges of getting a suboptimal graph due to unreliable high-dimensional distances and the limited information extracted from the high-dimensional data.

Previous research within the Dimension Reduction (DR) methods often involves converting the original high-dimensional data into a graph. Each edge in the graph represents the similarity or dissimilarity between pairs of data points. However, this graph is frequently suboptimal due to unreliable high-dimensional distances and the limited information extracted from the high-dimensional data. Therefore, we introduce LocalMAP, a new dimensionality reduction algorithm that dynamically and locally adjusts the graph to address these challenges. LocalMAP is capable of identifying and separating real clusters within the data that other DR methods may overlook or combine.

Release Notes

Please see the release notes. This release note is correlated with PaCMAP.

Installation

LocalMAP method is currently embedded in PaCMAP package. To try LocalMAP, please install the PaCMAP package.

Install from conda-forge via conda or mamba

You can use conda or mamba to install PaCMAP from the conda-forge channel.

conda:

conda install pacmap -c conda-forge

mamba:

mamba install pacmap -c conda-forge

Install from PyPI via pip

You can use pip to install pacmap from PyPI. It will automatically install the dependencies for you:

pip install pacmap

If you have any problems during the installation of dependencies, such as Failed building wheel for annoy, you can try to install these dependencies with conda or mamba. Users have also reported that in some cases, you may wish to use numba >= 0.57.

conda install -c conda-forge python-annoy
pip install pacmap

Usage

Using LocalMAP in Python

The pacmap package is designed to be compatible with scikit-learn, meaning that it has a similar interface with functions in the sklearn.manifold module. To run LocalMAP on your own dataset, you should install the package following the instructions in installation, and then import the module. The following code clip includes a use case about how to use PaCMAP on the COIL-20 dataset:

from pacmap import LocalMAP
import numpy as np
import matplotlib.pyplot as plt

# loading preprocessed coil_20 dataset
# you can change it with any dataset that is in the ndarray format, with the shape (N, D)
# where N is the number of samples and D is the dimension of each sample
X = np.load("./data/coil_20.npy", allow_pickle=True)
X = X.reshape(X.shape[0], -1)
y = np.load("./data/coil_20_labels.npy", allow_pickle=True)

# initializing the pacmap instance
# Setting n_neighbors to "None" leads to an automatic choice shown below in "parameter" section
embedding = LocalMAP(n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0) 

# fit the data (The index of transformed data corresponds to the index of the original data)
X_transformed = embedding.fit_transform(X, init="pca")

# visualize the embedding
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.scatter(X_transformed[:, 0], X_transformed[:, 1], cmap="Spectral", c=y, s=0.6)

Benchmarks

The following images are visualizations of two datasets: MNIST (n=70,000, d=784) and USPS (n=9,298, d=256), generated by PaCMAP. The two visualizations demonstrate the local and global structure's preservation ability of LocalMAP respectively, and it shows better separatation of true clusters comparing to other methods.

MNIST

Figure 1. DR Performance Comparison on MNIST

Mammoth

Figure 2. DR Performance Comparison on USPS

Parameters

The list of the most important parameters is given below.

  • n_components: the number of dimensions of the output. Default to 2.

  • n_neighbors: the number of neighbors considered in the k-Nearest Neighbor graph. Default to 10. We also allow this parameter to be set to None to enable the auto-selection of numbers of neighbors: the number of neighbors will be set to 10 for datasets whose sample size is smaller than 10000. For large dataset whose sample size (n) is larger than 10000, the value is: 10 + 15 * (log10(n) - 4).

  • MN_ratio: the ratio of the number of mid-near pairs to the number of neighbors, n_MN = n_neighbors * MN_ratio . Default to 0.5.

  • FP_ratio: the ratio of the number of further pairs to the number of neighbors, n_FP = n_neighbors * FP_ratio Default to 2.

  • [New for LocalMAP] low_dist_thres: the average low-dimension distance among all nearest clusters pair. Default to 10.

The initialization is also important to the result, but it's a parameter of the fit and fit_transform function.

  • init: the initialization of the lower dimensional embedding. One of "pca" or "random", or a user-provided numpy ndarray with the shape (N, 2). Default to "pca".

Other parameters include:

  • num_iters: number of iterations. Default to 450. 450 iterations are enough for most datasets to converge.
  • pair_neighbors, pair_MN and pair_FP: pre-specified neighbor pairs, mid-near points, and further pairs. Allows user to use their own graphs. Default to None.
  • verbose: print the progress of pacmap. Default to False
  • lr: learning rate of the AdaGrad optimizer. Default to 1.
  • apply_pca: whether localmap should apply PCA to the data before constructing the k-Nearest Neighbor graph. Using PCA to preprocess the data can largely accelerate the DR process without losing too much accuracy. Notice that this option does not affect the initialization of the optimization process.
  • intermediate: whether localmap should also output the intermediate stages of the optimization process of the lower dimension embedding. If True, then the output will be a numpy array of the size (n, n_components, 13), where each slice is a "screenshot" of the output embedding at a particular number of steps, from [0, 10, 30, 60, 100, 120, 140, 170, 200, 250, 300, 350, 450].

Methods

Similar to the scikit-learn API, the LocalMAP instance can generate embedding for a dataset via fit, fit_transform and transform method. We currently support numpy.ndarray format as our input. Specifically, to convert pandas DataFrame to ndarray format, please refer to the pandas documentation. For a more detailed walkthrough, please see the demo directory.

How to use user-specified nearest neighbor

We have provided an option to allow users to use their own nearest neighbors when mapping large-scale datasets. Please see the demo for a detailed walkthrough about how to use LocalMAP with the user-specified nearest neighbors.

Reproducing our experiments

We have provided the code we use to run experiment for better reproducibility. The code are separated into three parts, in three folders, respectively:

  • data, which includes part of the datasets we used, preprocessed into the file format each DR method use. Since some of the datasets are too large to put in Github. If you need a specific dataset, please send an email to yiyang.sun@duke.edu.
  • experiments, which includes all the scripts we use to produce DR results.
  • evaluation, which includes all the scripts we use to evaluate DR results.

After downloading the code, you may need to specify some of the paths in the script to make them fully functional.

Citation

LocalMAP will be released to Arxiv Soon!

License

Please see the license file.

About

LocalMAP: Dimension Reduction with Locally Adjusted Graphs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 50.4%
  • Jupyter Notebook 49.6%