Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two harvest sources both harvesting the same dataset causes errors #162

Open
letmaik opened this issue Oct 16, 2015 · 11 comments
Open

Two harvest sources both harvesting the same dataset causes errors #162

letmaik opened this issue Oct 16, 2015 · 11 comments
Labels

Comments

@letmaik
Copy link

letmaik commented Oct 16, 2015

I just had a job stuck in running, so I wanted to clear everything to rerun it. By the way, the data got successfully imported, but for some reason the job didn't finish properly. This is probably connected to the error I get when I press "Clear" to redo it:

An error occurred: [(IntegrityError) update or delete on table "package" violates foreign key constraint "harvest_object_package_id_fkey" on table "harvest_object" DETAIL: Key (id)=(1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc) is still referenced from table "harvest_object".

begin; 
update package set state = 'to_delete' where id in (select package_id from harvest_object where harvest_source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e');
delete from resource_view where resource_id in (select id from resource where package_id in (select id from package where state = 'to_delete' ));
delete from resource_revision where package_id in (select id from package where state = 'to_delete' );
delete from resource where package_id in (select id from package where state = 'to_delete' );    
delete from harvest_object_error where harvest_object_id in (select id from harvest_object where harvest_source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e');
delete from harvest_object_extra where harvest_object_id in (select id from harvest_object where harvest_source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e');
delete from harvest_object where harvest_source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e';
delete from harvest_gather_error where harvest_job_id in (select id from harvest_job where source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e');
delete from harvest_job where source_id = 'fd15d7cd-b576-4577-acbb-7246eee9b41e';
delete from package_role where package_id in (select id from package where state = 'to_delete' );
delete from user_object_role where id not in (select user_object_role_id from package_role) and context = 'Package';
delete from package_tag_revision where package_id in (select id from package where state = 'to_delete');
delete from member_revision where table_id in (select id from package where state = 'to_delete');
delete from package_extra_revision where package_id in (select id from package where state = 'to_delete');
delete from package_revision where id in (select id from package where state = 'to_delete');
delete from package_tag where package_id in (select id from package where state = 'to_delete');
delete from package_extra where package_id in (select id from package where state = 'to_delete');
delete from package_relationship_revision where subject_package_id in (select id from package where state = 'to_delete');
delete from package_relationship_revision where object_package_id in (select id from package where state = 'to_delete');
delete from package_relationship where subject_package_id in (select id from package where state = 'to_delete');
delete from package_relationship where object_package_id in (select id from package where state = 'to_delete');
delete from member where table_id in (select id from package where state = 'to_delete');
delete from related_dataset where dataset_id in (select id from package where state = 'to_delete');
delete from related where id in ('');
delete from package where id in (select id from package where state = 'to_delete');
commit;

It's a fresh CKAN install / server and I'm just doing some first testing, but this doesn't look too promising. Any idea?

@letmaik
Copy link
Author

letmaik commented Oct 17, 2015

The initial crash was caused by the geoview plugin which threw an exception: ckan/ckanext-geoview#26

However, why can exceptions of random plugins lead to such havoc? Isn't it quite simple to just catch all exceptions? I'm now in a state where I fixed the geoview plugin, can run "harvester import", but still the database is messed up and neither purge_queues nor clearsource (which fails with the quoted DB error) can get me out of this situation. Do I now actually have to go in the database and manually delete rows??

@davidread
Copy link
Contributor

Mark, it would be helpful if you can look at the delete code and the
database, and try and work out why the package with id
1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc isn't being deleted. I don't think
the clear source code is at all complicated. If everyone chips in with
improving the occasional thing like this then it is better for everyone.

The original problem looks to be because the RDF harvester doesn't always
create a format key in the resource dict, so that should be a simple fix,
if you have a chance.

On 17 October 2015 at 10:35, Maik Riechert notifications@github.com wrote:

The initial crash was caused by the geoview plugin which threw an
exception: ckan/ckanext-geoview#26
ckan/ckanext-geoview#26

However, why can exceptions of random plugins lead to such havoc? Isn't it
quite simple to just catch all exceptions? I'm now in a state where I fixed
the geoview plugin, can run "harvester import", but still the database is
messed up and neither purge_queues nor clearsource (which fails with the
quoted DB error) can get me out of this situation. Do I now actually have
to go in the database and manually delete rows??


Reply to this email directly or view it on GitHub
#162 (comment)
.

@letmaik
Copy link
Author

letmaik commented Oct 17, 2015

Well, I certainly can try to debug it, but my first results worry me a bit:

select harvest_source_id, package_id from harvest_object;
          harvest_source_id           |              package_id
--------------------------------------+--------------------------------------
 fd15d7cd-b576-4577-acbb-7246eee9b41e | b37d3c77-354a-40ab-aae4-1b43e20b12de
 041fbcb6-fb0d-4488-bbaa-b5baa674fe8e | 1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc
 fd15d7cd-b576-4577-acbb-7246eee9b41e |
 fd15d7cd-b576-4577-acbb-7246eee9b41e | 1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc
 fd15d7cd-b576-4577-acbb-7246eee9b41e | b37d3c77-354a-40ab-aae4-1b43e20b12de
 fd15d7cd-b576-4577-acbb-7246eee9b41e | 1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc
 fd15d7cd-b576-4577-acbb-7246eee9b41e | b37d3c77-354a-40ab-aae4-1b43e20b12de
(7 rows)

Two different harvest_source_id's reference the same package with id "1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc". How can this happen? Also, in one case the package_id is empty, not sure if that's supposed to be like that.

@letmaik
Copy link
Author

letmaik commented Oct 17, 2015

Ok I think I know why it got in this mess. The two harvest sources delivered datasets with partially identical IDs (I use the RDF harvester, which means the URLs were the same). And then the harvester just matched it up to the existing dataset from the other source I guess, and messed up implicitly assumed constraints which however are not part of some sql key constraint. This case should be caught and not happen in the first place.

In particular, I found in harvest.logic.action.get.harvest_source_for_a_dataset a "TODO: Deprecated, harvest source id is added as an extra to each dataset automatically". I checked the database and the source id is not added to the package_extra, so this is a misleading TODO, actually meaning, not implemented yet. Because if this relationship is clear for each dataset (and not somehow inferred from the harvest_object table), then these problems cannot appear and every source is only concerned with its own datasets, even if some global IDs are identical, it wouldn't matter.

I will now go ahead and delete everything manually and fix the source (which I have control of). Not pleasant... I expect it to happen at any time soon again.

@davidread
Copy link
Contributor

the harvester just matched it up to the existing dataset from the other source

I agree this is a problem. It's not desirable for two harvest sources to fight over a dataset. I imagine that the most likely thing we want to happen is for the dataset to be 'owned' by the source that created it, and another source can't update it - it would store a warning and skip it.

I think that the code that looks for a dataset with a given guid is not in a central place - it is written into every harvester, so we should centralize that.

the source id is not added to the package_extra

Well it would depend on the harvester, but I can't think of examples where it does. I think the Deprecation warning should be deleted. @amercader what do you think? harvest_source_for_a_dataset seems a good function to be using here.

@davidread davidread changed the title Hanging job / database error Two harvest sources both harvesting the same dataset causes errors Oct 19, 2015
@amercader
Copy link
Member

In particular, I found in harvest.logic.action.get.harvest_source_for_a_dataset a "TODO: Deprecated, harvest source id is added as an extra to each dataset automatically". I checked the database and the source id is not added to the package_extra, so this is a misleading TODO, actually meaning, not implemented yet.

Is not misleading, the source id is added to each dataset at the logic layer level on after_show, ie not actually stored on the database.

Two different harvest_source_id's reference the same package with id "1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc". How can this happen? Also, in one case the package_id is empty, not sure if that's supposed to be like that.

It would be good if you could show the current column as well. Old objects might belong to other sources, but only the one marked with current=True is the one used. An empty package id just means that this particular object didn't end up on an update or create, eg because of a validation error, or on some harvesters because the remote doc was not updated.

the harvester just matched it up to the existing dataset from the other source

I agree this is a problem. It's not desirable for two harvest sources to fight over a dataset. I imagine that the most likely thing we want to happen is for the dataset to be 'owned' by the source that created it, and another source can't update it - it would store a warning and skip it.

That should be easy to add as a further check on the import stage but it needs to be implemented on each separate harvester.

the source id is not added to the package_extra

Well it would depend on the harvester, but I can't think of examples where it does. I think the Deprecation warning should be deleted. @amercader what do you think? harvest_source_for_a_dataset seems a good function to be using here.

See above regarding the harvest_source_id extra, but if the function is needed then yes, we can remove the deprecation notice. The function would need to be updated to take the current field into account though.

@davidread
Copy link
Contributor

We now have a function in base.py _find_existing_package() which a harvest can use. It would be great if this function could ignore an existing dataset if it was harvested by a different (and active) source. (It is currently also searching by package_id, which needs changing to guid so that it can be used by other harvesters). @neothemachine did you want to try coding this up?

@letmaik
Copy link
Author

letmaik commented Nov 28, 2015

@davidread I won't have time until after new year. I had a look at _find_existing_package but I don't really understand what it's doing. Is that basically an internal API request? I find "package_show" a bit confusing in that context, "show" to me means UI layer. I think currently I don't understand enough of CKANs internals to properly implement this.

@valentin0h
Copy link

I have run into the same issue recently. Any update?

@letmaik any pointers how to fix this manually? My harvest_source_id doesn't seem to actually exist at all in harvest_source

@davidread
Copy link
Contributor

@valeviolin harvest_source_id is just 'id' on the harvest_source object.

@metaodi metaodi added the bug label Nov 17, 2017
@gallexme
Copy link

gallexme commented Aug 17, 2020

sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) update or delete on table "package" violates foreign key constraint "harvest_object_package_id_fkey" on table "harvest_object"
DETAIL:  Key (id)=(043d0c7d-9016-49ec-8038-ab1288f721f1) is still referenced from table "harvest_object".
 [SQL: "begin;
         update package set state = 'to_delete' where id in (
                 select package_id from harvest_object
             where harvest_source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880');
         delete from resource_view where resource_id in (
                 select id from resource where package_id in (
                     select id from package where state = 'to_delete'));
         delete from resource_revision where package_id in (
                 select id from package where state = 'to_delete');
         delete from resource where package_id in (
                 select id from package where state = 'to_delete');
         
     delete from harvest_object_error where harvest_object_id in (
             select id from harvest_object
         where harvest_source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880');
     delete from harvest_object_extra where harvest_object_id in (
             select id from harvest_object
         where harvest_source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880');
     delete from harvest_object where harvest_source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880';
     delete from harvest_gather_error where harvest_job_id in (
             select id from harvest_job where source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880');
     delete from harvest_job where source_id = 'e7beb4c6-d009-4c7e-bec6-a2835e899880';
     delete from package_tag_revision where package_id in (
             select id from package where state = 'to_delete');
     delete from member_revision where table_id in (
             select id from package where state = 'to_delete');
     delete from package_extra_revision where package_id in (
             select id from package where state = 'to_delete');
     delete from package_revision where id in (
             select id from package where state = 'to_delete');
     delete from package_tag where package_id in (
             select id from package where state = 'to_delete');
     delete from package_extra where package_id in (
             select id from package where state = 'to_delete');
     delete from package_relationship_revision where subject_package_id in (
             select id from package where state = 'to_delete');
     delete from package_relationship_revision where object_package_id in (
             select id from package where state = 'to_delete');
     delete from package_relationship where subject_package_id in (
             select id from package where state = 'to_delete');
     delete from package_relationship where object_package_id in (
             select id from package where state = 'to_delete');
     delete from member where table_id in (
             select id from package where state = 'to_delete');
      
         delete from package where id in (
                 select id from package where state = 'to_delete');
         
     commit;

how do i fix it and avoid it in future?
since there can always be X Open data portals calling their Dataset just "Population" instead of "Populatio globally UNIQUE ID"

i rather have copies of 2 datasets from 2 different harvest sources, than having to manually intervene

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants