USGS data clean up #4000

jbrown-xentity · 2022-10-06T18:47:07Z

USGS got their list of harvest sources down to 4 (see here). Unfortunately, the datasets are still around (the delete must not have cleared the harvest sources, or the clear failed for some reason). You can see the list of normal datasets by harvest source, and collection records by harvest source. We need to manually clear these.

How to reproduce

Delete a harvest source without clearing it

Expected behavior

Delete would kick off clear

Actual behavior

Datasets are left orphaned

Sketch

We were already able to clear one harvest source. This involves running ckan harvester source clear name-of-source for each harvest source that still has data. We expect some issues on the larger (100K-400K) datasets, and maybe have to run the db queries piecemeal in order to completely clear the data.

The text was updated successfully, but these errors were encountered:

jbrown-xentity · 2022-10-06T18:47:42Z

This is the list of harvest sources we need to clear one by one:
usgs-clear-list.txt

FuhuXia · 2022-10-07T21:21:00Z

The source clear task runs ok for harvest source with less than 1k datasets. When the datasets are more than 1k, the DB query starts to take long. The time it spends increase exponentially. The current query has spent 6 hours on harvest source USGS National Hydrography Dataset (NHD) Downloadable Data Collection Harvest Source: 10313 and its still ongoing.

FuhuXia · 2022-10-07T21:57:54Z

Query used:

Given a source title, find the name and state, make sure the state is deleted

 select name, state from package where title='[source title]' and type='harvest';

CKAN command to clear the source. Should be run as cf tasks.

ckan harvester source show [name] # to double check the state is inactive
ckan harvester source clear [name] # to clear the source

Query to check ongoing query execution status:

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds' and state='active';

jbrown-xentity · 2022-10-07T22:12:34Z

Note that we have done manual DB manipulation in the past, some of those notes are here and here. Please note that the first has a link to query we ran for cleaning up harvest tables; it may be out of date though compared with ckanext-harvest current cleanup.
This is only if we need to run the queries manually...

FuhuXia · 2022-10-10T14:25:12Z

Found a missing index. Adding the index increases the speed 230 times for a long running query, tested in the staging environment with 16millon records in table harvest_object and 11millon records in table harvest_object_error on a medium-gp-psql DB.

Deleting 10,000 records from table harvest_object:
Without index: Time: 1888580.541 ms (31:28.581)
With index: Time: 8164.541 ms (00:08.165)

The index to be add: field harvest_object_id in table harvest_object_error.

Upstream PR created ckan/ckanext-harvest#514.

FuhuXia · 2022-10-10T14:48:26Z

harvest job page is also benefiting from this added index, noticeably faster (from 44 seconds to 12 seconds on page https://catalog-stage-admin-datagov.app.cloud.gov/harvest/gsa-json/job).

FuhuXia · 2022-10-10T18:54:10Z

Manually added index on production database:

CREATE INDEX harvest_error_harvest_object_id_idx
ON harvest_object_error(harvest_object_id);

CREATE INDEX
Time: 9149.779 ms (00:09.150)

Killed previous harvest source clear job then re-ran it.

FuhuXia · 2022-10-10T19:38:42Z

Another index needs to be added to speed things up.

CREATE INDEX idx_view_resource_id
ON resource_view(resource_id);

Speed difference:

begin;
explain (analyze,buffers,timing)
delete from resource where package_id = [###];                                  
rollback;

Without index:
Trigger for constraint resource_view_resource_id_fkey: time=68.124 calls=6
With index:
Trigger for constraint resource_view_resource_id_fkey: time=0.420 calls=6

Issue and PR created on CKAN upstream.

FuhuXia · 2022-10-10T20:34:42Z

With the two indexes added, harvest source clear command is fast now. Previous a 10-day job can be finished in 10 mins.

jbrown-xentity · 2022-10-11T16:49:02Z

The initial harvest sources have been run through, and we believe this is complete (checking with user before closing)...

jbrown-xentity · 2022-10-12T14:07:41Z

We have a few more items to clear out:

https://catalog.data.gov/dataset/lidar-point-cloud-usgs-national-map-3dep-downloadable-data-collection1 (clear harvest source USGS Lidar Point Cloud LAS Harvest Source root record only)
https://catalog-prod-admin-datagov.app.cloud.gov/dataset/original-product-resolution-opr-source-digital-elevation-models-dems-usgs-national-map-3dep-dow1 (clear harvest source Original Product Resolution (OPR) Source Digital Elevation (DEMS) USGS National Map root record only)
https://catalog-prod-admin-datagov.app.cloud.gov/dataset/alaska-digital-surface-models-dsms-usgs-national-map-3dep-downloadable-data-collection (clear harvest source Alaska Digital Surface Models (DSMs) - USGS National Map)
https://catalog-prod-admin-datagov.app.cloud.gov/dataset/usgs-us-topo-map-collection (clear harvest source USGS US Topo Maps root record only)

These items seem to be orphaned, we can examine what/how they are orphaned and then they need to be removed:

https://catalog-prod-admin-datagov.app.cloud.gov/dataset/usgs-historical-topographic-map-collection-a5228

I will run the de-dupe process for DOI to see if we can clean up some of the other records that the user noted...

nickumia-reisys · 2023-02-02T22:51:23Z

@jbrown-xentity Just checking that this is actually done?

jbrown-xentity added the bug Software defect or bug label Oct 6, 2022

jbrown-xentity assigned FuhuXia Oct 6, 2022

jbrown-xentity added this to data.gov team board Oct 6, 2022

jbrown-xentity moved this to In Progress [8] in data.gov team board Oct 6, 2022

FuhuXia moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Oct 11, 2022

FuhuXia moved this from 👀 Needs Review [2] to 🏗 In Progress [8] in data.gov team board Oct 12, 2022

FuhuXia moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Oct 13, 2022

nickumia-reisys mentioned this issue Oct 17, 2022

harvest job page fails to load #3484

Closed

jbrown-xentity closed this as completed Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USGS data clean up #4000

USGS data clean up #4000

jbrown-xentity commented Oct 6, 2022

jbrown-xentity commented Oct 6, 2022

FuhuXia commented Oct 7, 2022

FuhuXia commented Oct 7, 2022 •

edited

Loading

jbrown-xentity commented Oct 7, 2022

FuhuXia commented Oct 10, 2022 •

edited

Loading

FuhuXia commented Oct 10, 2022

FuhuXia commented Oct 10, 2022

FuhuXia commented Oct 10, 2022 •

edited

Loading

FuhuXia commented Oct 10, 2022

jbrown-xentity commented Oct 11, 2022

jbrown-xentity commented Oct 12, 2022

nickumia-reisys commented Feb 2, 2023

USGS data clean up #4000

USGS data clean up #4000

Comments

jbrown-xentity commented Oct 6, 2022

How to reproduce

Expected behavior

Actual behavior

Sketch

jbrown-xentity commented Oct 6, 2022

FuhuXia commented Oct 7, 2022

FuhuXia commented Oct 7, 2022 • edited Loading

jbrown-xentity commented Oct 7, 2022

FuhuXia commented Oct 10, 2022 • edited Loading

FuhuXia commented Oct 10, 2022

FuhuXia commented Oct 10, 2022

FuhuXia commented Oct 10, 2022 • edited Loading

FuhuXia commented Oct 10, 2022

jbrown-xentity commented Oct 11, 2022

jbrown-xentity commented Oct 12, 2022

nickumia-reisys commented Feb 2, 2023

FuhuXia commented Oct 7, 2022 •

edited

Loading

FuhuXia commented Oct 10, 2022 •

edited

Loading

FuhuXia commented Oct 10, 2022 •

edited

Loading