Two harvest sources both harvesting the same dataset causes errors #162

letmaik · 2015-10-16T22:48:49Z

I just had a job stuck in running, so I wanted to clear everything to rerun it. By the way, the data got successfully imported, but for some reason the job didn't finish properly. This is probably connected to the error I get when I press "Clear" to redo it:

An error occurred: [(IntegrityError) update or delete on table "package" violates foreign key constraint "harvest_object_package_id_fkey" on table "harvest_object" DETAIL: Key (id)=(1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc) is still referenced from table "harvest_object".

begin; 
update package set state = 'to_delete' where id in (select package_id from harvest_object where harvest_source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e');
delete from resource_view where resource_id in (select id from resource where package_id in (select id from package where state = 'to_delete' ));
delete from resource_revision where package_id in (select id from package where state = 'to_delete' );
delete from resource where package_id in (select id from package where state = 'to_delete' );    
delete from harvest_object_error where harvest_object_id in (select id from harvest_object where harvest_source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e');
delete from harvest_object_extra where harvest_object_id in (select id from harvest_object where harvest_source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e');
delete from harvest_object where harvest_source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e';
delete from harvest_gather_error where harvest_job_id in (select id from harvest_job where source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e');
delete from harvest_job where source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e';
delete from package_role where package_id in (select id from package where state = 'to_delete' );
delete from user_object_role where id not in (select user_object_role_id from package_role) and context = 'Package';
delete from package_tag_revision where package_id in (select id from package where state = 'to_delete');
delete from member_revision where table_id in (select id from package where state = 'to_delete');
delete from package_extra_revision where package_id in (select id from package where state = 'to_delete');
delete from package_revision where id in (select id from package where state = 'to_delete');
delete from package_tag where package_id in (select id from package where state = 'to_delete');
delete from package_extra where package_id in (select id from package where state = 'to_delete');
delete from package_relationship_revision where subject_package_id in (select id from package where state = 'to_delete');
delete from package_relationship_revision where object_package_id in (select id from package where state = 'to_delete');
delete from package_relationship where subject_package_id in (select id from package where state = 'to_delete');
delete from package_relationship where object_package_id in (select id from package where state = 'to_delete');
delete from member where table_id in (select id from package where state = 'to_delete');
delete from related_dataset where dataset_id in (select id from package where state = 'to_delete');
delete from related where id in ('');
delete from package where id in (select id from package where state = 'to_delete');
commit;

It's a fresh CKAN install / server and I'm just doing some first testing, but this doesn't look too promising. Any idea?

letmaik · 2015-10-17T09:35:04Z

The initial crash was caused by the geoview plugin which threw an exception: ckan/ckanext-geoview#26

However, why can exceptions of random plugins lead to such havoc? Isn't it quite simple to just catch all exceptions? I'm now in a state where I fixed the geoview plugin, can run "harvester import", but still the database is messed up and neither purge_queues nor clearsource (which fails with the quoted DB error) can get me out of this situation. Do I now actually have to go in the database and manually delete rows??

davidread · 2015-10-17T20:06:28Z

Mark, it would be helpful if you can look at the delete code and the
database, and try and work out why the package with id
1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc isn't being deleted. I don't think
the clear source code is at all complicated. If everyone chips in with
improving the occasional thing like this then it is better for everyone.

The original problem looks to be because the RDF harvester doesn't always
create a format key in the resource dict, so that should be a simple fix,
if you have a chance.

On 17 October 2015 at 10:35, Maik Riechert notifications@github.com wrote:

The initial crash was caused by the geoview plugin which threw an
exception: ckan/ckanext-geoview#26
ckan/ckanext-geoview#26

However, why can exceptions of random plugins lead to such havoc? Isn't it
quite simple to just catch all exceptions? I'm now in a state where I fixed
the geoview plugin, can run "harvester import", but still the database is
messed up and neither purge_queues nor clearsource (which fails with the
quoted DB error) can get me out of this situation. Do I now actually have
to go in the database and manually delete rows??

—
Reply to this email directly or view it on GitHub
#162 (comment)
.

letmaik · 2015-10-17T21:33:43Z

Well, I certainly can try to debug it, but my first results worry me a bit:

select harvest_source_id, package_id from harvest_object;
          harvest_source_id           |              package_id
--------------------------------------+--------------------------------------
 fd15d7cd-b576-4577-acbb-7246eee9b41e | b37d3c77-354a-40ab-aae4-1b43e20b12de
 041fbcb6-fb0d-4488-bbaa-b5baa674fe8e | 1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc
 fd15d7cd-b576-4577-acbb-7246eee9b41e |
 fd15d7cd-b576-4577-acbb-7246eee9b41e | 1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc
 fd15d7cd-b576-4577-acbb-7246eee9b41e | b37d3c77-354a-40ab-aae4-1b43e20b12de
 fd15d7cd-b576-4577-acbb-7246eee9b41e | 1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc
 fd15d7cd-b576-4577-acbb-7246eee9b41e | b37d3c77-354a-40ab-aae4-1b43e20b12de
(7 rows)

Two different harvest_source_id's reference the same package with id "1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc". How can this happen? Also, in one case the package_id is empty, not sure if that's supposed to be like that.

letmaik · 2015-10-17T22:06:55Z

Ok I think I know why it got in this mess. The two harvest sources delivered datasets with partially identical IDs (I use the RDF harvester, which means the URLs were the same). And then the harvester just matched it up to the existing dataset from the other source I guess, and messed up implicitly assumed constraints which however are not part of some sql key constraint. This case should be caught and not happen in the first place.

In particular, I found in harvest.logic.action.get.harvest_source_for_a_dataset a "TODO: Deprecated, harvest source id is added as an extra to each dataset automatically". I checked the database and the source id is not added to the package_extra, so this is a misleading TODO, actually meaning, not implemented yet. Because if this relationship is clear for each dataset (and not somehow inferred from the harvest_object table), then these problems cannot appear and every source is only concerned with its own datasets, even if some global IDs are identical, it wouldn't matter.

I will now go ahead and delete everything manually and fix the source (which I have control of). Not pleasant... I expect it to happen at any time soon again.

davidread · 2015-10-19T11:05:33Z

the harvester just matched it up to the existing dataset from the other source

I agree this is a problem. It's not desirable for two harvest sources to fight over a dataset. I imagine that the most likely thing we want to happen is for the dataset to be 'owned' by the source that created it, and another source can't update it - it would store a warning and skip it.

I think that the code that looks for a dataset with a given guid is not in a central place - it is written into every harvester, so we should centralize that.

the source id is not added to the package_extra

Well it would depend on the harvester, but I can't think of examples where it does. I think the Deprecation warning should be deleted. @amercader what do you think? harvest_source_for_a_dataset seems a good function to be using here.

amercader · 2015-10-19T15:25:02Z

In particular, I found in harvest.logic.action.get.harvest_source_for_a_dataset a "TODO: Deprecated, harvest source id is added as an extra to each dataset automatically". I checked the database and the source id is not added to the package_extra, so this is a misleading TODO, actually meaning, not implemented yet.

Is not misleading, the source id is added to each dataset at the logic layer level on after_show, ie not actually stored on the database.

Two different harvest_source_id's reference the same package with id "1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc". How can this happen? Also, in one case the package_id is empty, not sure if that's supposed to be like that.

It would be good if you could show the current column as well. Old objects might belong to other sources, but only the one marked with current=True is the one used. An empty package id just means that this particular object didn't end up on an update or create, eg because of a validation error, or on some harvesters because the remote doc was not updated.

the harvester just matched it up to the existing dataset from the other source

I agree this is a problem. It's not desirable for two harvest sources to fight over a dataset. I imagine that the most likely thing we want to happen is for the dataset to be 'owned' by the source that created it, and another source can't update it - it would store a warning and skip it.

That should be easy to add as a further check on the import stage but it needs to be implemented on each separate harvester.

the source id is not added to the package_extra

Well it would depend on the harvester, but I can't think of examples where it does. I think the Deprecation warning should be deleted. @amercader what do you think? harvest_source_for_a_dataset seems a good function to be using here.

See above regarding the harvest_source_id extra, but if the function is needed then yes, we can remove the deprecation notice. The function would need to be updated to take the current field into account though.

davidread · 2015-11-27T13:00:25Z

We now have a function in base.py _find_existing_package() which a harvest can use. It would be great if this function could ignore an existing dataset if it was harvested by a different (and active) source. (It is currently also searching by package_id, which needs changing to guid so that it can be used by other harvesters). @neothemachine did you want to try coding this up?

letmaik · 2015-11-28T11:59:05Z

@davidread I won't have time until after new year. I had a look at _find_existing_package but I don't really understand what it's doing. Is that basically an internal API request? I find "package_show" a bit confusing in that context, "show" to me means UI layer. I think currently I don't understand enough of CKANs internals to properly implement this.

valentin0h · 2016-10-28T18:28:50Z

I have run into the same issue recently. Any update?

@letmaik any pointers how to fix this manually? My harvest_source_id doesn't seem to actually exist at all in harvest_source

davidread · 2016-11-01T14:07:50Z

@valeviolin harvest_source_id is just 'id' on the harvest_source object.

gallexme · 2020-08-17T11:15:13Z

sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) update or delete on table "package" violates foreign key constraint "harvest_object_package_id_fkey" on table "harvest_object"
DETAIL:  Key (id)=(043d0c7d-9016-49ec-8038-ab1288f721f1) is still referenced from table "harvest_object".
 [SQL: "begin;
         update package set state = 'to_delete' where id in (
                 select package_id from harvest_object
             where harvest_source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880');
         delete from resource_view where resource_id in (
                 select id from resource where package_id in (
                     select id from package where state = 'to_delete'));
         delete from resource_revision where package_id in (
                 select id from package where state = 'to_delete');
         delete from resource where package_id in (
                 select id from package where state = 'to_delete');
         
     delete from harvest_object_error where harvest_object_id in (
             select id from harvest_object
         where harvest_source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880');
     delete from harvest_object_extra where harvest_object_id in (
             select id from harvest_object
         where harvest_source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880');
     delete from harvest_object where harvest_source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880';
     delete from harvest_gather_error where harvest_job_id in (
             select id from harvest_job where source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880');
     delete from harvest_job where source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880';
     delete from package_tag_revision where package_id in (
             select id from package where state = 'to_delete');
     delete from member_revision where table_id in (
             select id from package where state = 'to_delete');
     delete from package_extra_revision where package_id in (
             select id from package where state = 'to_delete');
     delete from package_revision where id in (
             select id from package where state = 'to_delete');
     delete from package_tag where package_id in (
             select id from package where state = 'to_delete');
     delete from package_extra where package_id in (
             select id from package where state = 'to_delete');
     delete from package_relationship_revision where subject_package_id in (
             select id from package where state = 'to_delete');
     delete from package_relationship_revision where object_package_id in (
             select id from package where state = 'to_delete');
     delete from package_relationship where subject_package_id in (
             select id from package where state = 'to_delete');
     delete from package_relationship where object_package_id in (
             select id from package where state = 'to_delete');
     delete from member where table_id in (
             select id from package where state = 'to_delete');
      
         delete from package where id in (
                 select id from package where state = 'to_delete');
         
     commit;

how do i fix it and avoid it in future?
since there can always be X Open data portals calling their Dataset just "Population" instead of "Populatio globally UNIQUE ID"

i rather have copies of 2 datasets from 2 different harvest sources, than having to manually intervene

davidread changed the title ~~Hanging job / database error~~ Two harvest sources both harvesting the same dataset causes errors Oct 19, 2015

metaodi added the bug label Nov 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two harvest sources both harvesting the same dataset causes errors #162

Two harvest sources both harvesting the same dataset causes errors #162

letmaik commented Oct 16, 2015

letmaik commented Oct 17, 2015

davidread commented Oct 17, 2015

letmaik commented Oct 17, 2015

letmaik commented Oct 17, 2015

davidread commented Oct 19, 2015

amercader commented Oct 19, 2015

davidread commented Nov 27, 2015

letmaik commented Nov 28, 2015

valentin0h commented Oct 28, 2016

davidread commented Nov 1, 2016

gallexme commented Aug 17, 2020 •

edited

Loading

Two harvest sources both harvesting the same dataset causes errors #162

Two harvest sources both harvesting the same dataset causes errors #162

Comments

letmaik commented Oct 16, 2015

letmaik commented Oct 17, 2015

davidread commented Oct 17, 2015

letmaik commented Oct 17, 2015

letmaik commented Oct 17, 2015

davidread commented Oct 19, 2015

amercader commented Oct 19, 2015

davidread commented Nov 27, 2015

letmaik commented Nov 28, 2015

valentin0h commented Oct 28, 2016

davidread commented Nov 1, 2016

gallexme commented Aug 17, 2020 • edited Loading

gallexme commented Aug 17, 2020 •

edited

Loading