Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dataset from CAGRA index #1435

Closed
wants to merge 3 commits into from

Conversation

enp1s0
Copy link
Member

@enp1s0 enp1s0 commented Apr 20, 2023

Description

This PR removes the dataset from the current CAGRA index format.

Background

We create some indices using CAGRA, serialize them, and compare their performance. However, the current CAGRA index format includes the dataset, which means that every time we serialize the index, we also store a copy of the dataset with it. As a result, this can take up much storage space.

Test

The current tests can test this change.

@enp1s0 enp1s0 requested a review from a team as a code owner April 20, 2023 10:25
@rapids-bot
Copy link

rapids-bot bot commented Apr 20, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added the cpp label Apr 20, 2023
@cjnolet
Copy link
Member

cjnolet commented Apr 20, 2023

/ok to test

@tfeher
Copy link
Contributor

tfeher commented Apr 20, 2023

Thanks @enp1s0 for the PR!

This PR is adressing the following two issues:

  • we create a copy in device memory, which might not be necessary as discussed here
    CAGRA #1375 (comment)

  • If the dataset is already on disk then it is unnecessary to save it with the index. This is particularly problematic if we want to generate multiple graphs for the same dataset to experiment with hyper parameters.

If we want to solve these issues by removing the dataset from the index, then this PR does a clean job in doing so.

But removing the dataset from the index is not the only way to solve these problems. Alternative option:

  • The serialize method could have a flag to prevent saving the dataset.
  • The deserialize method could take two files (graph and dataset). Alternatively we could have a constructor that takes a graph filename and a dataset array.
  • If the dataset is already in device memory, then the index could store a reference to that, instead of creating a copy.

The question is what shall the index represent? For the IVF methods in RAFT, the index is a self contained object that stores everything that is needed for the search (including the dataset). Removing the dataset from the CAGRA index would assign a different meaning to the index: a helper structure for search, which might or might not need the dataset. Both are valid approaches. I slightly prefer the self contained index. What is your opinion @cjnolet?

@tfeher tfeher added improvement Improvement / enhancement to an existing function breaking Breaking change labels Apr 20, 2023
@tfeher
Copy link
Contributor

tfeher commented Apr 20, 2023

To resolve style issues, you could use the pre-commit hooks.

@cjnolet cjnolet assigned cjnolet and enp1s0 and unassigned cjnolet Apr 20, 2023
@anaruse
Copy link
Contributor

anaruse commented Apr 21, 2023

Hi,

In the case of IVFflat and IVFPQ, there is a strong dependency between the dataset and the reset of "index", so I think it is a reasonable solution to keep everything as one file. For example, the order of the dataset changes based on the clustering result, and in the case of PQ, the encoding result changes depending on the PQ code bit length used.

However, in the case of graph-based ANNS, the dependency between the dataset and the graph is not that strong.

For example, in graph-based ANNS, a graph of low degree tends to have low accuracy, but a graph of high degree tends to have low speed. Therefore, we create graphs of various degrees to find the best one that satisfies the performance requirements (size, accuracy, and speed), but we do not need to change the dataset at that time. Creating graphs of different degrees does not require changing the dataset.

Conversely, there is a need to use the same graphs, but to change the dataset. Suppose the data type of the original dataset is fp32. If the data type is changed to lower precision type such as fp16, for example, the speed will increase, but the search accuracy will decrease. How much accuracy loss depends on the dataset, so there is a need to try multiple data types, but there is no need to change the graph just because the data type of the dataset is changed.

Thus, since the dependency between dataset and graph is not that strong in graph-based ANNS, and since there are advantages to keeping them separate, I would prefer to have the dataset and graph as separate files.

@cjnolet
Copy link
Member

cjnolet commented Apr 24, 2023

I slightly prefer the self contained index. What is your opinion @cjnolet?

@tfeher I also prefer the self-contained index.

@anaruse I definitely see the argument that the dataset doesn't have to be coupled completely to the graph but this does still assume the graph is representative of the dataset, correct? Right now coupling is enforced through the index itself having a reference to the dataset. One concern I have with removing the dataset from the index altogether is that it makes it too easy for a user to pass in a completely different dataset to search, which is no longer representative of the graph in the index.

I think we can design this in a way which will be more immediately obvious to users, making it clear to the user what they are intending to do while better avoiding the potential for a user accidentally misuing our APIs. I propose we keep the dataset view on the index but provide a function to take an index and produce a new index with a different dataset. Something like the following:

template<typename T, typename IdxT, typename T_old>
index<T, IdxT> update_dataset(raft::resources const &handle, 
                                                   index<T_old, IdxT> const &index, 
                                                   T const &dataset);

This function will return a new index with all of the members of the old index (copied or referenced directly, depending on needs), but with the new dataset. This makes the user's intention very clear here so when they want to swap out the dataset, they need to make sure the dataset they pass in is representative of the original graph. Otherwise, they can use the search() on whichever index they prefer.

Another thing we can do, which might be even cleaner, is provide a method right on the index itself that can produce a new index with a different dataset:

auto new_index = index.update_dataset(new_dataset);

@anaruse
Copy link
Contributor

anaruse commented Apr 25, 2023

I discussed this with @enp1s0 and we came up with an idea that could solve or at least alleviate various issues, so I think we will close this PR and @enp1s0 will submit an issue or PR based on that idea.

@enp1s0
Copy link
Member Author

enp1s0 commented Apr 25, 2023

@cjnolet @tfeher Thank you for your comments.
I completely agree with you that the current index format is effective in preventing API misuse. Furthermore, the update_dataset function seems to be a suitable solution for modifying the dataset.
However, I would like to address my original concern, which is the need to store a copy of the dataset each time we serialize the index.
As an example, the Microsoft Turing-ANNS 1B benchmark dataset has a size of 400GB, and I believe it would be inconvenient for users to make multiple copies of such a large dataset.
1B scale datasets are typically more than 100GB in size.
(Benchmark dataset examples: https://big-ann-benchmarks.com/#bench-datasets ; DatasetSize=sizeof(Datatype) * Dimension * 1B)
Therefore, I believe separating the graph and dataset is the best solution concerning the storage size.

The alternative approach @anaruse mentioned is to offer the functionality to manipulate the graph degree, dataset types, and other features dynamically after deserializing the index file.
To compare the performance of different graph configurations, we create a large graph initially and serialize it.
Upon deserialization, we measure the performance while altering the configurations using the provided functionality.
This approach requires making only one copy of the dataset, thereby eliminating the need for multiple copies.
If there is no problem with this approach, I will close this PR and submit a new one for the functionality.

@tfeher
Copy link
Contributor

tfeher commented Apr 25, 2023

Thanks everyone for the comments!

However, I would like to address my original concern, which is the need to store a copy of the dataset each time we serialize the index. [...] I believe separating the graph and dataset is the best solution concerning the storage size.

Separating the graph from the dataset can be done while saving / loading the data. Here is an example:

#include "../../bench/ann/src/common/dataset.h"
#include <iostream>
#include <raft/core/detail/mdspan_numpy_serializer.hpp>
#include <raft/core/host_mdarray.hpp>
#include <raft/core/serialize.hpp>
#include <raft/neighbors/cagra.cuh>

// Load a dataset using custom dataset loader (e.g. loader for bigANN input type)
raft::bench::ann::BinFile<float> dataset_file("dataset_filename", "r");
size_t n_rows;
int dim;
dataset_file.get_shape(&n_rows, &dims);
auto dataset = make_host_matrix<float, uint32_t>(n_rows, dim);
dataset_file.read(dataset.data_handle());

// train an index
index = cagra::build(res, params, make_const_mdspan(dataset.view());

// Serialize graph only
std::ofstream of("cagra_graph", std::ios::out | std::ios::binary);
serialize_mdspan(res, of, index.graph());
of.close();

// Deserialize graph
std::ifstream is("cagra_graph", std::ios::in | std::ios::binary);
raft::detail::numpy_serializer::header_t header = raft::detail::numpy_serializer::read_header(is);
is.seekg(0);  // rewind
auto knn_graph = make_host_matrix<uint32_t, uint32_t>(header.shape[0], header.shape[1]);
deserialize_mdspan(res, is, knn_graph.view());
is.close();

// Create index from knn_graph and dataset
auto loaded_index = cagra::index<float, uint32_t>((res,
                                                   raft::distance::DistanceType::L2Expanded,
                                                   make_const_mdspan(dataset.view()),
                                                   knn_graph.view());

In the last step, the constructor creates memory copies of the dataset and graph. Ideally, we would only store a reference in case these arrays are already in device memory. In that case constructing the index is would be a cheap operation: just wrapping the pointers (mdspans) into the index structure. This should be addressed in a separate issue, so that we can focus on disk copies here.

To make this nicer, we should have deserialize_host_matrix() utility to encapsulate the steps under deserialize graph.

The example above avoids creating unnecessary disk copies of the dataset. We shall discuss whether we need further convenience functions to simplify these steps. Please let me know what do you think.

alternative approach @anaruse mentioned is to offer the functionality to manipulate the graph degree, dataset types, and other features dynamically after deserializing the index file.

This goes one step further then just optimizing disc copies, and it is definitely an interesting idea to explore. Could you provide more information?

@enp1s0
Copy link
Member Author

enp1s0 commented Apr 25, 2023

@tfeher Thank you for the comment.

This goes one step further then just optimizing disc copies, and it is definitely an interesting idea to explore. Could you provide more information?

I'll make a new issue as it is a bit independent of this PR.

The approach you have proposed can address my concern while not changing the index format on memory. (Sorry for skipping your proposal earlier.)

  • The serialize method could have a flag to prevent saving the dataset.
  • The deserialize method could take two files (graph and dataset). Alternatively we could have a constructor that takes a graph filename and a dataset array.
  • If the dataset is already in device memory, then the index could store a reference to that, instead of creating a copy.

Additionally, we could use the update_dataset function to incorporate the dataset into the index when the index file does not contain one.
What do you think about this approach, @cjnolet ?

@cjnolet
Copy link
Member

cjnolet commented Apr 25, 2023

  • The serialize method could have a flag to prevent saving the dataset.
  • The deserialize method could take two files (graph and dataset). Alternatively we could have a constructor that takes a graph filename and a dataset array.
  • If the dataset is already in device memory, then the index could store a reference to that, instead of creating a copy.

I think I like this approach, but I wonder if we should make it even a little easier for the user and store a boolean in the serialized index format that denotes whether or not the dataset was serialized along with the index. That way a user who might have created a moderately small index could just serialize and deserialize the entire index and dataset while the user with a larger dataset can use the flag in the serialize function to turn off serializing the dataset.

For deserialization, we can provide two functions- one which just accepts the single serialized file and deserializes the dataset if it was serialized. Another implementation of the deserialize funciton can accept two files- a serialized index and dataset. Of course, if a user invokes the deserialize that only accepts the serialized index and it didn't provide the dataset, they could still use update_index to add the dataset to the deserialized index.

@tfeher
Copy link
Contributor

tfeher commented Apr 26, 2023

Another implementation of the deserialize funciton can accept two files- a serialized index and dataset.

@cjnolet I am not convinced that the deserialize function shall load the dataset. I would expect the dataset to be stored in a custom format (e.g. format used by big-ann-benchmarks, or HDF5, etc), and we probably not want to deal with these details.

@cjnolet
Copy link
Member

cjnolet commented Apr 26, 2023

I am not convinced that the deserialize function shall load the dataset

@tfeher, the more I think about this, we will likely never need to serialize the dataset with the index. If the user was able to provide the dataset when training the index, they'll be able to provide the dataset after deserializing the index. I just want to make sure we're keeping the dataset coupled to the index itself and not requiring it as an argument to the search function.

@cjnolet
Copy link
Member

cjnolet commented May 15, 2023

@enp1s0 we are getting close to burndown for 23.06 and I'm just checking in on the status of these changes. I believe we were discussing adding an update_dataset() method to the cagra::index but I'm not sure if you are planning to implement that change in this PR or another one.

@enp1s0
Copy link
Member Author

enp1s0 commented May 16, 2023

@cjnolet yes, we will add update_dataset. I'll make another PR for that so this PR would be unnecessary anymore. Thank you all for participating in the discussion!

@enp1s0 enp1s0 closed this May 16, 2023
rapids-bot bot pushed a commit that referenced this pull request Jul 20, 2023
This PR aims to improve the workflow when dealing with large datasets. When experimenting with different versions of the knn-graph, we might want to construct indices with the same dataset (see #1435 for further discussion).

If the dataset is already in device memory (and rows are properly aligned / padded), then we only store a reference to the dataset, therefore multiple indices can refer to the same dataset. Similarly, when `knn_graph` is a device array, then store only a reference. 

Additionally, this PR adds `update_dataset` and `update_graph` methods to the index.
Closes #1479

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1494
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change cpp improvement Improvement / enhancement to an existing function
Projects
Development

Successfully merging this pull request may close these issues.

4 participants