Skip to content

Robust Clustering on High-Dimensional Data with Stochastic Quantization

License

Notifications You must be signed in to change notification settings

kaydotdev/stochastic-quantization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Robust Clustering on High-Dimensional Data with Stochastic Quantization

Arxiv Open In Colab Open In Kaggle

by Anton Kozyriev1, Vladimir Norkin1,2

  • Igor Sikorsky Kyiv Polytechnic Institute, National Technical University of Ukraine, Kyiv, 03056, Ukraine
  • V.M.Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, 03178, Ukraine

22 pages, 5 figures, to be published in the International Scientific Technical Journal "Problems of Control and Informatics"

Introduction

This paper addresses the limitations of conventional vector quantization algorithms, particularly K-Means and its variant K-Means++, and investigates the Stochastic Quantization (SQ) algorithm as a scalable alternative for high-dimensional unsupervised and semi-supervised learning tasks. Traditional clustering algorithms often suffer from inefficient memory utilization during computation, necessitating the loading of all data samples into memory, which becomes impractical for large-scale datasets. While variants such as Mini-Batch K-Means partially mitigate this issue by reducing memory usage, they lack robust theoretical convergence guarantees due to the non-convex nature of clustering problems. In contrast, the Stochastic Quantization algorithm provides strong theoretical convergence guarantees, making it a robust alternative for clustering tasks. We demonstrate the computational efficiency and rapid convergence of the algorithm on an image classification problem with partially labeled data, comparing model accuracy across various ratios of labeled to unlabeled data. To address the challenge of high dimensionality, we employ a Triplet Network to encode images into low-dimensional representations in a latent space, which serve as a basis for comparing the efficiency of both the Stochastic Quantization algorithm and traditional quantization algorithms. Furthermore, we enhance the algorithm's convergence speed by introducing modifications with an adaptive learning rate.

Getting Started

To get started with this project, follow the instructions below to set up your environment, install the necessary dependencies, and run the code to reproduce the results from our paper.

Dependencies

The installation process requires a Conda package manager for managing third-party dependencies and virtual environments. A step-by-step guide on installing the CLI tool is available on the official website. The third-party dependencies used are listed in the environment.yml file, with the corresponding licenses in the NOTICES file.

Installation

Clone the repository (alternatively, you can download the source code as a zip archive):

git clone https://github.com/kaydotdev/stochastic-quantization.git
cd stochastic-quantization

then, create a Conda virtual environment and activate it:

conda env create -f environment.yml
conda activate stochastic-quantization

Reproducing the Results

Use the following command to install the core sq package with third-party dependencies, run the test suite, compile LaTeX files, and generate results:

make all

Produced figures and other artifacts (except compiled LaTeX files) will be stored in the results directory. Optionally, use the following command to perform the actions above without LaTeX file compilation:

make -C code all

To automatically remove all generated results and compiled LaTeX files produced by scripts, use the following command:

make clean

License

This repository contains both software (source code) and an academic manuscript. Different licensing terms apply to these components as follows:

  1. Source Code: All source code contained in this repository, unless otherwise specified, is licensed under the MIT License. The full text of the MIT License can be found in the file LICENSE.code.md in the code directory.

  2. Academic Manuscript: The academic manuscript, including all LaTeX source files and associated content (e.g., figures), is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). The full text of the CC BY-NC-ND 4.0 License can be found in the file LICENSE.manuscript.md in the manuscript directory.

Citation

If you use this work in your research, please cite our paper:

@misc{Kozyriev_Norkin_2024,
    title={Robust Clustering on High-Dimensional Data with Stochastic Quantization}, 
    author={Anton Kozyriev and Vladimir Norkin},
    year={2024},
    eprint={2409.02066},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2409.02066},
}