Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc blocks get too large if no fields are stored #1552

Closed
PSeitz opened this issue Sep 26, 2022 · 6 comments · Fixed by #1569
Closed

Doc blocks get too large if no fields are stored #1552

PSeitz opened this issue Sep 26, 2022 · 6 comments · Fixed by #1569
Assignees

Comments

@PSeitz
Copy link
Contributor

PSeitz commented Sep 26, 2022

If no fields are stored, we store too many documents in a block. The lookup times get very slow, e.g. when merging.
There are two options:

  • Limits the maximum number of docs per block
  • Add a fast path when no fields are stored
@fulmicoton
Copy link
Collaborator

Nice find!
Can you detail the path that makes it slow on merge? I thought we were caching decompressed blocks.

@PSeitz
Copy link
Contributor Author

PSeitz commented Sep 26, 2022

The issue is the linear scanning inside a block (fn get_document_bytes_from_block), e.g. when there are millions of docs inside a block. The issue in merge occurs only with sorting, but search should also affected. If we limit the number of docs per block, we can shift more to the faster skip list vs the linear scan.

Merge Code

for old_doc_id in doc_id_map.iter_old_doc_ids() {
    let doc_bytes = store_read.get_document_bytes(old_doc_id)?;
    serializer.get_store_writer().store_bytes(&doc_bytes)?;
}

@trinity-1686a
Copy link
Contributor

trinity-1686a commented Sep 28, 2022

The default block-size appear to be 16384, which can store up to 8192 empty documents (one byte for the size, and one byte for the field count). It's not that big of a number imo. However the merge code is linear-scanning the block n times, so we end-up with something in n².

I think StoreReader should get a fn get_many_documents_bytes(&self, doc_ids: &[DocId]) -> crate::Result<Vec<OwnedBytes>> so it can scan only once

@PSeitz
Copy link
Contributor Author

PSeitz commented Sep 28, 2022

doc_ids: &[DocId] is not sorted, and can be random. This is only an issue when sorting the index by a field and the new sort order would be reflected in doc_ids: &[DocId]

@trinity-1686a
Copy link
Contributor

how can I reproduce the slowness you observed?

@PSeitz
Copy link
Contributor Author

PSeitz commented Sep 29, 2022

Don't store any fields, enable sorting, and index a large amount of docs that includes merging. For me it was around 50% time spent in get_document_bytes`. I think that case is slightly unusual, so I'd go with a limit on the number of docs as a general solution, vs special path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants