-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two harvest sources both harvesting the same dataset causes errors #162
Comments
The initial crash was caused by the geoview plugin which threw an exception: ckan/ckanext-geoview#26 However, why can exceptions of random plugins lead to such havoc? Isn't it quite simple to just catch all exceptions? I'm now in a state where I fixed the geoview plugin, can run "harvester import", but still the database is messed up and neither purge_queues nor clearsource (which fails with the quoted DB error) can get me out of this situation. Do I now actually have to go in the database and manually delete rows?? |
Mark, it would be helpful if you can look at the delete code and the The original problem looks to be because the RDF harvester doesn't always On 17 October 2015 at 10:35, Maik Riechert notifications@github.com wrote:
|
Well, I certainly can try to debug it, but my first results worry me a bit:
Two different harvest_source_id's reference the same package with id "1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc". How can this happen? Also, in one case the package_id is empty, not sure if that's supposed to be like that. |
Ok I think I know why it got in this mess. The two harvest sources delivered datasets with partially identical IDs (I use the RDF harvester, which means the URLs were the same). And then the harvester just matched it up to the existing dataset from the other source I guess, and messed up implicitly assumed constraints which however are not part of some sql key constraint. This case should be caught and not happen in the first place. In particular, I found in harvest.logic.action.get.harvest_source_for_a_dataset a "TODO: Deprecated, harvest source id is added as an extra to each dataset automatically". I checked the database and the source id is not added to the package_extra, so this is a misleading TODO, actually meaning, not implemented yet. Because if this relationship is clear for each dataset (and not somehow inferred from the harvest_object table), then these problems cannot appear and every source is only concerned with its own datasets, even if some global IDs are identical, it wouldn't matter. I will now go ahead and delete everything manually and fix the source (which I have control of). Not pleasant... I expect it to happen at any time soon again. |
I agree this is a problem. It's not desirable for two harvest sources to fight over a dataset. I imagine that the most likely thing we want to happen is for the dataset to be 'owned' by the source that created it, and another source can't update it - it would store a warning and skip it. I think that the code that looks for a dataset with a given guid is not in a central place - it is written into every harvester, so we should centralize that.
Well it would depend on the harvester, but I can't think of examples where it does. I think the Deprecation warning should be deleted. @amercader what do you think? harvest_source_for_a_dataset seems a good function to be using here. |
Is not misleading, the source id is added to each dataset at the logic layer level on
It would be good if you could show the
That should be easy to add as a further check on the import stage but it needs to be implemented on each separate harvester.
See above regarding the |
We now have a function in base.py _find_existing_package() which a harvest can use. It would be great if this function could ignore an existing dataset if it was harvested by a different (and active) source. (It is currently also searching by package_id, which needs changing to guid so that it can be used by other harvesters). @neothemachine did you want to try coding this up? |
@davidread I won't have time until after new year. I had a look at _find_existing_package but I don't really understand what it's doing. Is that basically an internal API request? I find "package_show" a bit confusing in that context, "show" to me means UI layer. I think currently I don't understand enough of CKANs internals to properly implement this. |
I have run into the same issue recently. Any update? @letmaik any pointers how to fix this manually? My |
@valeviolin harvest_source_id is just 'id' on the harvest_source object. |
how do i fix it and avoid it in future? i rather have copies of 2 datasets from 2 different harvest sources, than having to manually intervene |
I just had a job stuck in running, so I wanted to clear everything to rerun it. By the way, the data got successfully imported, but for some reason the job didn't finish properly. This is probably connected to the error I get when I press "Clear" to redo it:
An error occurred: [(IntegrityError) update or delete on table "package" violates foreign key constraint "harvest_object_package_id_fkey" on table "harvest_object" DETAIL: Key (id)=(1e85a8ef-efcc-4fa1-a40a-3cc1bec5c8bc) is still referenced from table "harvest_object".
It's a fresh CKAN install / server and I'm just doing some first testing, but this doesn't look too promising. Any idea?
The text was updated successfully, but these errors were encountered: