Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use group_by in translate_codec_idx_to_original_idx #1736

Merged
merged 1 commit into from
Dec 26, 2022
Merged

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Dec 22, 2022

The group_by has some overhead, and for the strictly sparse case it's just worse. But it will address a pathological case when there are cluster of dense blocks in a sparse field with a many search results in it. I think that's a valid scenario, when users start to use the JSON type to ingest and search different kinds of data into a single index

Index has 1mio docs

Now
running 5 tests
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_0comma005percent_hit  ... bench:         101 ns/iter (+/- 1)
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_10percent_hit         ... bench:      10,805 ns/iter (+/- 682)
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_full_scan             ... bench:      47,879 ns/iter (+/- 3,955)
test null_index::sparse::bench::bench_translate_codec_to_orig_90percent_filled_0comma005percent_hit ... bench:       9,267 ns/iter (+/- 66)
test null_index::sparse::bench::bench_translate_codec_to_orig_90percent_filled_full_scan            ... bench:  16,630,831 ns/iter (+/- 315,576)


Before
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_0comma005percent_hit  ... bench:          91 ns/iter (+/- 2)
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_10percent_hit         ... bench:       6,505 ns/iter (+/- 256)
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_full_scan             ... bench:      33,322 ns/iter (+/- 2,482)
test null_index::sparse::bench::bench_translate_codec_to_orig_90percent_filled_0comma005percent_hit ... bench:      14,599 ns/iter (+/- 225)
test null_index::sparse::bench::bench_translate_codec_to_orig_90percent_filled_full_scan            ... bench: 295,434,573 ns/iter (+/- 31,193,127)

@PSeitz PSeitz force-pushed the sparse_dense_index branch 2 times, most recently from f28713a to 6151c51 Compare December 22, 2022 09:20
@codecov-commenter
Copy link

codecov-commenter commented Dec 22, 2022

Codecov Report

Merging #1736 (6151c51) into main (7385a8f) will decrease coverage by 0.01%.
The diff coverage is 94.69%.

@@            Coverage Diff             @@
##             main    #1736      +/-   ##
==========================================
- Coverage   94.13%   94.12%   -0.02%     
==========================================
  Files         266      267       +1     
  Lines       50814    50950     +136     
==========================================
+ Hits        47835    47955     +120     
- Misses       2979     2995      +16     
Impacted Files Coverage Δ
common/src/lib.rs 89.33% <ø> (ø)
common/src/group_by.rs 89.88% <89.88%> (ø)
fastfield_codecs/src/null_index/sparse.rs 95.46% <95.58%> (-1.29%) ⬇️
fastfield_codecs/src/null_index/dense.rs 99.16% <100.00%> (+0.10%) ⬆️
stacker/src/expull.rs 98.33% <0.00%> (-0.56%) ⬇️
src/store/index/mod.rs 97.83% <0.00%> (-0.55%) ⬇️
src/indexer/segment_updater.rs 95.44% <0.00%> (+0.13%) ⬆️
src/schema/schema.rs 98.91% <0.00%> (+0.13%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@PSeitz PSeitz force-pushed the sparse_dense_index branch 2 times, most recently from 78fef9f to eb22259 Compare December 22, 2022 13:31
// somehow.
//
// One potential solution would be to replace the iterator approach with something similar.
inner: Rc<RefCell<GroupByShared<I, F, K>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a use case for GAT + LendingIterator maybe? But that's not an Iterator unfortunately.

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this for the moment but I suspect there is a problem in the API.

My understanding is that this function (the id conversion thing) is meant to be used in the range queries. The current Column trait get_range is not working as an iterator.

If we focus on catering to this usage, we do not need an Iterator.
The group by logic can be implemented as loops.

It will probably unlock more from your group_by optimisation:
the Iterator won't be boxed anymore , and within a given block, the
loop will be doing static dispatch.

@PSeitz
Copy link
Contributor Author

PSeitz commented Dec 26, 2022

Yes this is for range queries, but for that we call:

pub fn get_docids_for_value_range(
    &self,
    value_range: RangeInclusive<T>,
    doc_id_range: Range<u32>,
    positions: &mut Vec<u32>,
) {

So we need to translate the positions from the codex index to orig index. We could replace the iterator and pass the positions directly. That will remove the dispatch issue. We could also replace the group_by, since we work on a slice then.

@fulmicoton
Copy link
Collaborator

@PSeitz yes. let's merge this for the moment.

@PSeitz PSeitz merged commit 45156fd into main Dec 26, 2022
@PSeitz PSeitz deleted the sparse_dense_index branch December 26, 2022 05:13
This was referenced Jan 13, 2023
Hodkinson pushed a commit to Hodkinson/tantivy that referenced this pull request Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants