Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: unnecessary permission documents are createad #10597

Open
landreev opened this issue May 30, 2024 · 0 comments
Open

Indexing: unnecessary permission documents are createad #10597

landreev opened this issue May 30, 2024 · 0 comments

Comments

@landreev
Copy link
Contributor

landreev commented May 30, 2024

From slack earlier:

[The] discrepancy between the number of the permission documents and the indexed dvobjects is still there, even on an index created from scratch. So, it cannot be explained by permission documents being left behind when objects are deleted.
curl "http://172.31.88.120:8983/solr/collection1/query?debug=query&q=definitionPointDocId:*&rows=0"
The sum of q=dvObjectType:dataverses, ...:datasets and ...:files is 3715147.

curl "http://172.31.88.120:8983/solr/collection1/query?debug=queryq=*:*&rows=0" returns the number that is the sum of the 2 numbers above.

This number of extra permission docs is growing, the longer the index is in place.

On a closer look:

none of [these] are literal duplicates; i.e., there are no datasets with multiple definitionPointDocId:dataset_NNNNN permission docs.
However, there multiple cases of, for example, both
definitionPointDocId:dataset_NNNNN
and
definitionPointDocId:dataset_NNNNN_draft permission docs, when the only indexed dvobject document for it is id:dataset_NNNNN .
Seeing such cases for both datasets and files. Also seeing cases that are reverse of the above (i.e., both perm. documents exist when only an indexed draft document is present).
Meaning, likely, that in some cases we fail to remove the permission doc for the draft when we publish; and in some - create permission docs for published documents while they are still in draft.
For example, we may be creating definitionPointDocId:file_MMMMM permission documents for unpublished files, when the parent datasets has published versions.

This does not appear to corrupt the index in a way that would affect the accuracy of the results of searches. But the redundant permission docs are likely slowing down the lookups on such.

We should also experiment with dropping creating permission documents for public indexed dvobjects altogether, in combination with the new "avoid expensive solr join" mechanism (#10555). Still debating if that should be handled as a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant