Skip to content

Conversation

@jinsolp
Copy link
Contributor

@jinsolp jinsolp commented Aug 4, 2025

Summary

This PR removed filters from templates in ivfpq. Only applied to compute_similarity_kernel in this PR, but can be applied elsewhere (e.g. ivfflat_interleaved_scan) which will be a follow-up PR to this one.

Perf check

Ran the default setting with bitset flter 20 times to check if this slows anything down.
All numbers are averaged across the 20 runs. Profiing done with nsys.

compute_similarity kernel runtime
Screenshot 2025-08-04 at 1 10 31 PM

Think the extra time is coming from the switch statements inside ivf_to_sample_filter_dev's operator(). But not too much of an overhead.

Binary size reduction

From CI reports: 1107.27 MB ->1040.44 MB

@jinsolp jinsolp requested review from a team as code owners August 4, 2025 20:15
@jinsolp jinsolp self-assigned this Aug 4, 2025
@jinsolp jinsolp moved this from Todo to In Progress in Vector Search, ML, & Data Mining Release Board Aug 4, 2025
@jinsolp jinsolp added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Aug 4, 2025
};

_RAFT_HOST_DEVICE inline none_sample_filter::none_sample_filter() = default;
_RAFT_HOST_DEVICE inline none_sample_filter::~none_sample_filter() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to define them as default here and not in the struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the previous CI run, I could see that the build was failing because it wouldn't let me declare + default with the __device__ annotation. It seems like nvcc ignores __host__ __device__ on default functions when they are defaulted at their first declaration within the struct.

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jinsolp, thanks for working on this!
I have a request regarding the synchronization behavior of the filter object. I believe we should strive to avoid the sync at all costs where possible to gradually improve the small-batch and dynamic-batch performance across our algorithms.

@jinsolp
Copy link
Contributor Author

jinsolp commented Aug 7, 2025

image

Thanks @achirkin @lowener for your feedbacks! In the binary size sync today, @divyegala suggested using the enum + creating filter inside kernel approach. This is a much less complicated approach, and also shows similar perf to what we already have in this PR (launching separate kernels to track lifetime of kernels). Latest commit introduces this approach.

@divyegala
Copy link
Member

@jinsolp thanks for the changes! IMO the diff looks much cleaner now.

@achirkin
Copy link
Contributor

achirkin commented Aug 8, 2025

@jinsolp thanks, would you mind to benchmark this for somewhat larger values of k? let's say 64, 128, 256.
The kind of changes introduced in the kernel seem likely to increase the register pressure, which may lead to performance degradation when we already use a lot of registers - that is with a beefier fused top-k component.

const uint32_t* index_list, \
float* query_kths, \
IvfSampleFilterT sample_filter, \
const filtering::base_filter& sample_filter, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertmaynard is this going to impact the update you just made for CUDA 13?

@jinsolp
Copy link
Contributor Author

jinsolp commented Aug 11, 2025

Screenshot 2025-08-11 at 12 50 59 PM @achirkin This is the result of using different k's. Some interesting observations
  1. This approach is faster for a larger k (or at least I think we can say that there are no concerning drops in perf)
  2. Kernel suddenly gets faster for k=256 compared to k=128. I'm not sure why this is the case.

Comment on lines 160 to 161
_RAFT_HOST_DEVICE ivf_filter_dev_args_variant() {}
_RAFT_HOST_DEVICE ~ivf_filter_dev_args_variant() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get away with no explicitly defined constructor/destructor? I have a feeling this looks more complicated than should :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

@jinsolp jinsolp Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler complains if I remove these explicit constructor/destructor

error: the default constructor of "cuvs::neighbors::filtering::ivf_filter_dev::ivf_filter_dev_args_variant" cannot be referenced -- it is a deleted function

Looks like this is because the bitset_filter_args_t is not trivially constructible. Is there a way to workaround this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What line is this error occurring on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happens when I get rid of the explicit constructor/destructor for the union (these two lines)

_RAFT_HOST_DEVICE ivf_filter_dev_args_variant() {}
_RAFT_HOST_DEVICE ~ivf_filter_dev_args_variant() {}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you report what line the compiler says the error occurs on? Must be somewhere during construction of this struct?

Copy link
Contributor Author

@jinsolp jinsolp Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 159 and 164 of sample_filter.cuh. That would be the ivf_filter_dev constructors.

sample_filter.cuh(159): error: the default constructor of "cuvs::neighbors::filtering::ivf_filter_dev::ivf_filter_dev_args_variant" cannot be referenced -- it is a deleted function
    {
    ^

sample_filter.cuh(164): error: the default constructor of "cuvs::neighbors::filtering::ivf_filter_dev::ivf_filter_dev_args_variant" cannot be referenced -- it is a deleted function
    {
    ^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Artem's comment below is relevant to this. If you simplify the construction it should be possible.


_RAFT_HOST_DEVICE ivf_filter_dev(none_filter_args_t args = {}) : tag_(FilterType::None)
{
new (&args_.none_filter_args) none_filter_args_t(args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just initialize using the args_(args) same as you initialize the tag_ above (after you remove explicit constructor/destructor)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achirkin @divyegala
Changed to use the initializer list. We still need constructors in the union because bitset_filter_args_t is not trivially constructible and therefore does not have a default constructor. But we don't need the destructor anymore.

Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. Can you try to remove all explicit constructors and destructors that are specified as default?

Comment on lines 160 to 161
_RAFT_HOST_DEVICE ivf_filter_dev_args_variant() {}
_RAFT_HOST_DEVICE ~ivf_filter_dev_args_variant() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


struct none_filter_args_t {};
using bitset_filter_args_t =
std::tuple<const int64_t* const*, cuvs::core::bitset_view<uint32_t, int64_t>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be tuple<const int64_t* const*, filtering::bitset_filter<uint32_t, int64_t>> to construct the bitset_filter object just once instead of every call to operator()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we have problems in each of the following scenarios if we want to do so;
If we were to make the filter once...

  • outside the compute_similairty_kernel: we have to launch separate kernels for the filter construction, which complicates things
  • inside the compute_similairty_kernel: we don't want to dynamic dispatch inside the kernel, so it is difficult to share common code.

Copy link
Contributor

@lowener lowener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@jinsolp
Copy link
Contributor Author

jinsolp commented Aug 27, 2025

/merge

@rapids-bot rapids-bot bot merged commit 4f710ec into rapidsai:branch-25.10 Aug 27, 2025
55 checks passed
@jinsolp jinsolp deleted the bs/filter-ivfpq branch August 27, 2025 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change Waiting for review

Development

Successfully merging this pull request may close these issues.

5 participants