Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of datasets not resolved by db-solr-sync #4355

Closed
FuhuXia opened this issue Jun 14, 2023 · 2 comments
Closed

Large number of datasets not resolved by db-solr-sync #4355

FuhuXia opened this issue Jun 14, 2023 · 2 comments
Assignees
Labels
bug Software defect or bug

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Jun 14, 2023

We are seeing large number of datasets stuck in db-solr-sync. As of today the count is 10757.

Nightly db-solr-sync job resolves discrepancy in packages' harvest_object property between DB and SOLR. Due to some glitches in network and other factors, it is normal to see single digits of packages that need to be synced every day after harvesting. After db-solr-sync job is done, the count should be 0.

A large number of stuck (not resolved by db-solr-sync job) datasets means there are other issues with tose datasets that need to be identified and resolved by other process, after which we should see 0 count of stuck datasets.

Related to GSA/catalog.data.gov#848

@FuhuXia
Copy link
Member Author

FuhuXia commented Jun 23, 2023

Causes identified for large amount of out-of-sync dataset:

  1. Datasets have multiple harvest objects but none of them is marked as current
    db-solr-sync will NOT fix them.
    solution: relink script will fix them.

  2. Solr has outdated harvest object after each re-harvest, as identified in unchanged source lost harvest_object_id after each reharvest #4362
    db-solr-sync will fix them, but comes back after next re-harvest
    solution: fix the bug

  3. Datasets with no harvest object in the DB:
    Non datajson harvest type can be viewed via this API call. All can be viewed from db-solr-sync log.
    db-solr-sync will not fix them.
    solution: skip them in db-solr-sync; purge them via api call or other fix.

After above fix the number should drop from 10-20K to miminal. If not, we can further identify causes and eventually get the number down to double digits.

@FuhuXia
Copy link
Member Author

FuhuXia commented Jun 26, 2023

After multiple db-solr-sync runs and cleanups, recurring datasets are gone.

total 373590 solr indexed_package
0 packages need to be removed from Solr
0 packages need to be updated/added to Solr
0 packages without harvest_object need to be mannually deleted

We will monitor the daily output as O&M task and there are minimal stuck datasets.

@FuhuXia FuhuXia closed this as completed Jun 26, 2023
@github-project-automation github-project-automation bot moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Jun 26, 2023
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug
Projects
Archived in project
Development

No branches or pull requests

1 participant