use group_by in translate_codec_idx_to_original_idx #1736

PSeitz · 2022-12-22T09:17:54Z

The group_by has some overhead, and for the strictly sparse case it's just worse. But it will address a pathological case when there are cluster of dense blocks in a sparse field with a many search results in it. I think that's a valid scenario, when users start to use the JSON type to ingest and search different kinds of data into a single index

Index has 1mio docs

Now
running 5 tests
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_0comma005percent_hit  ... bench:         101 ns/iter (+/- 1)
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_10percent_hit         ... bench:      10,805 ns/iter (+/- 682)
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_full_scan             ... bench:      47,879 ns/iter (+/- 3,955)
test null_index::sparse::bench::bench_translate_codec_to_orig_90percent_filled_0comma005percent_hit ... bench:       9,267 ns/iter (+/- 66)
test null_index::sparse::bench::bench_translate_codec_to_orig_90percent_filled_full_scan            ... bench:  16,630,831 ns/iter (+/- 315,576)


Before
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_0comma005percent_hit  ... bench:          91 ns/iter (+/- 2)
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_10percent_hit         ... bench:       6,505 ns/iter (+/- 256)
test null_index::sparse::bench::bench_translate_codec_to_orig_1percent_filled_full_scan             ... bench:      33,322 ns/iter (+/- 2,482)
test null_index::sparse::bench::bench_translate_codec_to_orig_90percent_filled_0comma005percent_hit ... bench:      14,599 ns/iter (+/- 225)
test null_index::sparse::bench::bench_translate_codec_to_orig_90percent_filled_full_scan            ... bench: 295,434,573 ns/iter (+/- 31,193,127)

codecov-commenter · 2022-12-22T09:34:53Z

Codecov Report

Merging #1736 (6151c51) into main (7385a8f) will decrease coverage by 0.01%.
The diff coverage is 94.69%.

@@            Coverage Diff             @@
##             main    #1736      +/-   ##
==========================================
- Coverage   94.13%   94.12%   -0.02%     
==========================================
  Files         266      267       +1     
  Lines       50814    50950     +136     
==========================================
+ Hits        47835    47955     +120     
- Misses       2979     2995      +16

Impacted Files	Coverage Δ
common/src/lib.rs	`89.33% <ø> (ø)`
common/src/group_by.rs	`89.88% <89.88%> (ø)`
fastfield_codecs/src/null_index/sparse.rs	`95.46% <95.58%> (-1.29%)`	⬇️
fastfield_codecs/src/null_index/dense.rs	`99.16% <100.00%> (+0.10%)`	⬆️
stacker/src/expull.rs	`98.33% <0.00%> (-0.56%)`	⬇️
src/store/index/mod.rs	`97.83% <0.00%> (-0.55%)`	⬇️
src/indexer/segment_updater.rs	`95.44% <0.00%> (+0.13%)`	⬆️
src/schema/schema.rs	`98.91% <0.00%> (+0.13%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

fulmicoton · 2022-12-26T01:08:12Z

common/src/group_by.rs

+    // somehow.
+    //
+    // One potential solution would be to replace the iterator approach with something similar.
+    inner: Rc<RefCell<GroupByShared<I, F, K>>>,


This might be a use case for GAT + LendingIterator maybe? But that's not an Iterator unfortunately.

fulmicoton

Let's merge this for the moment but I suspect there is a problem in the API.

My understanding is that this function (the id conversion thing) is meant to be used in the range queries. The current Column trait get_range is not working as an iterator.

If we focus on catering to this usage, we do not need an Iterator.
The group by logic can be implemented as loops.

It will probably unlock more from your group_by optimisation:
the Iterator won't be boxed anymore , and within a given block, the
loop will be doing static dispatch.

PSeitz · 2022-12-26T04:52:43Z

Yes this is for range queries, but for that we call:

pub fn get_docids_for_value_range(
    &self,
    value_range: RangeInclusive<T>,
    doc_id_range: Range<u32>,
    positions: &mut Vec<u32>,
) {

So we need to translate the positions from the codex index to orig index. We could replace the iterator and pass the positions directly. That will remove the dispatch issue. We could also replace the group_by, since we work on a slice then.

fulmicoton · 2022-12-26T04:56:15Z

@PSeitz yes. let's merge this for the moment.

PSeitz force-pushed the sparse_dense_index branch 2 times, most recently from f28713a to 6151c51 Compare December 22, 2022 09:20

PSeitz requested a review from fulmicoton December 22, 2022 09:28

PSeitz force-pushed the sparse_dense_index branch 2 times, most recently from 78fef9f to eb22259 Compare December 22, 2022 13:31

use group_by in translate_codec_idx_to_original_id

f954ef8

PSeitz force-pushed the sparse_dense_index branch from eb22259 to f954ef8 Compare December 23, 2022 02:31

fulmicoton reviewed Dec 26, 2022

View reviewed changes

fulmicoton approved these changes Dec 26, 2022

View reviewed changes

PSeitz merged commit 45156fd into main Dec 26, 2022

PSeitz deleted the sparse_dense_index branch December 26, 2022 05:13

This was referenced Jan 13, 2023

truncation comment PSeitz/tantivy#30

Closed

use stats PSeitz/tantivy#31

Closed

Hodkinson pushed a commit to Hodkinson/tantivy that referenced this pull request Jan 30, 2023

use group_by in translate_codec_idx_to_original_id (quickwit-oss#1736)

b203194

PSeitz mentioned this pull request Jan 31, 2023

update lz4 flex PSeitz/tantivy#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use group_by in translate_codec_idx_to_original_idx #1736

use group_by in translate_codec_idx_to_original_idx #1736

PSeitz commented Dec 22, 2022 •

edited

Loading

codecov-commenter commented Dec 22, 2022 •

edited

Loading

fulmicoton Dec 26, 2022

fulmicoton left a comment

PSeitz commented Dec 26, 2022

fulmicoton commented Dec 26, 2022

use group_by in translate_codec_idx_to_original_idx #1736

use group_by in translate_codec_idx_to_original_idx #1736

Conversation

PSeitz commented Dec 22, 2022 • edited Loading

codecov-commenter commented Dec 22, 2022 • edited Loading

Codecov Report

fulmicoton Dec 26, 2022

Choose a reason for hiding this comment

fulmicoton left a comment

Choose a reason for hiding this comment

PSeitz commented Dec 26, 2022

fulmicoton commented Dec 26, 2022

PSeitz commented Dec 22, 2022 •

edited

Loading

codecov-commenter commented Dec 22, 2022 •

edited

Loading