Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USGS data clean up #4000

Closed
jbrown-xentity opened this issue Oct 6, 2022 · 12 comments
Closed

USGS data clean up #4000

jbrown-xentity opened this issue Oct 6, 2022 · 12 comments
Assignees
Labels
bug Software defect or bug

Comments

@jbrown-xentity
Copy link
Contributor

USGS got their list of harvest sources down to 4 (see here). Unfortunately, the datasets are still around (the delete must not have cleared the harvest sources, or the clear failed for some reason). You can see the list of normal datasets by harvest source, and collection records by harvest source. We need to manually clear these.

How to reproduce

  1. Delete a harvest source without clearing it

Expected behavior

Delete would kick off clear

Actual behavior

Datasets are left orphaned

Sketch

We were already able to clear one harvest source. This involves running ckan harvester source clear name-of-source for each harvest source that still has data. We expect some issues on the larger (100K-400K) datasets, and maybe have to run the db queries piecemeal in order to completely clear the data.

@jbrown-xentity
Copy link
Contributor Author

This is the list of harvest sources we need to clear one by one:
usgs-clear-list.txt

@jbrown-xentity jbrown-xentity moved this to In Progress [8] in data.gov team board Oct 6, 2022
@FuhuXia
Copy link
Member

FuhuXia commented Oct 7, 2022

The source clear task runs ok for harvest source with less than 1k datasets. When the datasets are more than 1k, the DB query starts to take long. The time it spends increase exponentially. The current query has spent 6 hours on harvest source USGS National Hydrography Dataset (NHD) Downloadable Data Collection Harvest Source: 10313 and its still ongoing.

@FuhuXia
Copy link
Member

FuhuXia commented Oct 7, 2022

Query used:

Given a source title, find the name and state, make sure the state is deleted

 select name, state from package where title='[source title]' and type='harvest';

CKAN command to clear the source. Should be run as cf tasks.

ckan harvester source show [name] # to double check the state is inactive
ckan harvester source clear [name] # to clear the source

Query to check ongoing query execution status:

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds' and state='active';

@jbrown-xentity
Copy link
Contributor Author

Note that we have done manual DB manipulation in the past, some of those notes are here and here. Please note that the first has a link to query we ran for cleaning up harvest tables; it may be out of date though compared with ckanext-harvest current cleanup.
This is only if we need to run the queries manually...

@FuhuXia
Copy link
Member

FuhuXia commented Oct 10, 2022

Found a missing index. Adding the index increases the speed 230 times for a long running query, tested in the staging environment with 16millon records in table harvest_object and 11millon records in table harvest_object_error on a medium-gp-psql DB.

Deleting 10,000 records from table harvest_object:
Without index: Time: 1888580.541 ms (31:28.581)
With index: Time: 8164.541 ms (00:08.165)

The index to be add: field harvest_object_id in table harvest_object_error.

Upstream PR created ckan/ckanext-harvest#514.

@FuhuXia
Copy link
Member

FuhuXia commented Oct 10, 2022

harvest job page is also benefiting from this added index, noticeably faster (from 44 seconds to 12 seconds on page https://catalog-stage-admin-datagov.app.cloud.gov/harvest/gsa-json/job).

@FuhuXia
Copy link
Member

FuhuXia commented Oct 10, 2022

Manually added index on production database:

CREATE INDEX harvest_error_harvest_object_id_idx
ON harvest_object_error(harvest_object_id);

CREATE INDEX
Time: 9149.779 ms (00:09.150)

Killed previous harvest source clear job then re-ran it.

@FuhuXia
Copy link
Member

FuhuXia commented Oct 10, 2022

Another index needs to be added to speed things up.

CREATE INDEX idx_view_resource_id
ON resource_view(resource_id);

Speed difference:

begin;
explain (analyze,buffers,timing)
delete from resource where package_id = [###];                                  
rollback;

Without index:
Trigger for constraint resource_view_resource_id_fkey: time=68.124 calls=6
With index:
Trigger for constraint resource_view_resource_id_fkey: time=0.420 calls=6

Issue and PR created on CKAN upstream.

@FuhuXia
Copy link
Member

FuhuXia commented Oct 10, 2022

With the two indexes added, harvest source clear command is fast now. Previous a 10-day job can be finished in 10 mins.

@FuhuXia FuhuXia moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Oct 11, 2022
@jbrown-xentity
Copy link
Contributor Author

The initial harvest sources have been run through, and we believe this is complete (checking with user before closing)...

@jbrown-xentity
Copy link
Contributor Author

We have a few more items to clear out:

These items seem to be orphaned, we can examine what/how they are orphaned and then they need to be removed:

I will run the de-dupe process for DOI to see if we can clean up some of the other records that the user noted...

@FuhuXia FuhuXia moved this from 👀 Needs Review [2] to 🏗 In Progress [8] in data.gov team board Oct 12, 2022
@FuhuXia FuhuXia moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Oct 13, 2022
@nickumia-reisys
Copy link
Contributor

@jbrown-xentity Just checking that this is actually done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug
Projects
Archived in project
Development

No branches or pull requests

3 participants