-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USGS data clean up #4000
Comments
This is the list of harvest sources we need to clear one by one: |
The |
Query used: Given a source title, find the name and state, make sure the state is deleted
CKAN command to clear the source. Should be run as cf tasks.
Query to check ongoing query execution status:
|
Note that we have done manual DB manipulation in the past, some of those notes are here and here. Please note that the first has a link to query we ran for cleaning up harvest tables; it may be out of date though compared with ckanext-harvest current cleanup. |
Found a missing index. Adding the index increases the speed 230 times for a long running query, tested in the staging environment with 16millon records in table Deleting 10,000 records from table The index to be add: field Upstream PR created ckan/ckanext-harvest#514. |
harvest job page is also benefiting from this added index, noticeably faster (from 44 seconds to 12 seconds on page https://catalog-stage-admin-datagov.app.cloud.gov/harvest/gsa-json/job). |
Manually added index on production database:
Killed previous harvest source clear job then re-ran it. |
Another index needs to be added to speed things up.
Speed difference:
Without index: |
With the two indexes added, |
The initial harvest sources have been run through, and we believe this is complete (checking with user before closing)... |
We have a few more items to clear out:
These items seem to be orphaned, we can examine what/how they are orphaned and then they need to be removed: I will run the de-dupe process for DOI to see if we can clean up some of the other records that the user noted... |
@jbrown-xentity Just checking that this is actually done? |
USGS got their list of harvest sources down to 4 (see here). Unfortunately, the datasets are still around (the delete must not have cleared the harvest sources, or the clear failed for some reason). You can see the list of normal datasets by harvest source, and collection records by harvest source. We need to manually clear these.
How to reproduce
Expected behavior
Delete would kick off clear
Actual behavior
Datasets are left orphaned
Sketch
We were already able to clear one harvest source. This involves running
ckan harvester source clear name-of-source
for each harvest source that still has data. We expect some issues on the larger (100K-400K) datasets, and maybe have to run the db queries piecemeal in order to completely clear the data.The text was updated successfully, but these errors were encountered: