Doc blocks get too large if no fields are stored #1552

PSeitz · 2022-09-26T05:51:11Z

If no fields are stored, we store too many documents in a block. The lookup times get very slow, e.g. when merging.
There are two options:

Limits the maximum number of docs per block
Add a fast path when no fields are stored

fulmicoton · 2022-09-26T06:22:58Z

Nice find!
Can you detail the path that makes it slow on merge? I thought we were caching decompressed blocks.

PSeitz · 2022-09-26T06:56:24Z

The issue is the linear scanning inside a block (fn get_document_bytes_from_block), e.g. when there are millions of docs inside a block. The issue in merge occurs only with sorting, but search should also affected. If we limit the number of docs per block, we can shift more to the faster skip list vs the linear scan.

Merge Code

for old_doc_id in doc_id_map.iter_old_doc_ids() {
    let doc_bytes = store_read.get_document_bytes(old_doc_id)?;
    serializer.get_store_writer().store_bytes(&doc_bytes)?;
}

trinity-1686a · 2022-09-28T09:33:53Z

The default block-size appear to be 16384, which can store up to 8192 empty documents (one byte for the size, and one byte for the field count). It's not that big of a number imo. However the merge code is linear-scanning the block n times, so we end-up with something in n².

I think StoreReader should get a fn get_many_documents_bytes(&self, doc_ids: &[DocId]) -> crate::Result<Vec<OwnedBytes>> so it can scan only once

PSeitz · 2022-09-28T09:49:53Z

doc_ids: &[DocId] is not sorted, and can be random. This is only an issue when sorting the index by a field and the new sort order would be reflected in doc_ids: &[DocId]

trinity-1686a · 2022-09-28T14:10:54Z

how can I reproduce the slowness you observed?

PSeitz · 2022-09-29T01:36:45Z

Don't store any fields, enable sorting, and index a large amount of docs that includes merging. For me it was around 50% time spent in get_document_bytes`. I think that case is slightly unusual, so I'd go with a limit on the number of docs as a general solution, vs special path.

fulmicoton assigned trinity-1686a Sep 28, 2022

trinity-1686a mentioned this issue Sep 29, 2022

change format for store to make it faster with small documents #1569

Merged

trinity-1686a closed this as completed in #1569 Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc blocks get too large if no fields are stored #1552

Doc blocks get too large if no fields are stored #1552

PSeitz commented Sep 26, 2022

fulmicoton commented Sep 26, 2022

PSeitz commented Sep 26, 2022 •

edited

Loading

trinity-1686a commented Sep 28, 2022 •

edited

Loading

PSeitz commented Sep 28, 2022

trinity-1686a commented Sep 28, 2022

PSeitz commented Sep 29, 2022

Doc blocks get too large if no fields are stored #1552

Doc blocks get too large if no fields are stored #1552

Comments

PSeitz commented Sep 26, 2022

fulmicoton commented Sep 26, 2022

PSeitz commented Sep 26, 2022 • edited Loading

trinity-1686a commented Sep 28, 2022 • edited Loading

PSeitz commented Sep 28, 2022

trinity-1686a commented Sep 28, 2022

PSeitz commented Sep 29, 2022

PSeitz commented Sep 26, 2022 •

edited

Loading

trinity-1686a commented Sep 28, 2022 •

edited

Loading