Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

At time of scrape, store a copy of all extensions used #122

Closed
odscjames opened this issue Mar 1, 2019 · 11 comments
Closed

At time of scrape, store a copy of all extensions used #122

odscjames opened this issue Mar 1, 2019 · 11 comments
Labels
feature Relating to loading data from the web API or CLI command

Comments

@odscjames
Copy link

Scenario:

  • We scrape some data that uses extensions
  • 3 months pass .......
  • The extension changes, and it's an unversioned extension, so it just changes
  • 3 months pass .......
  • We load the data again because we want to evaluate it, so we load it, and check it

PROBLEM! The data now fails validation because we are checking old data against a new extension schema, and a bunch of fields are missing , wrong type, etc .... this is unfair! We really need to be checking the data against the extension at the time of the original scrape!

SOLUTION: When we get data, also get copies of the extensions, schemas, codelists, etc .... save that alongside files! When we recheck later, use these copies.

ARE EXTENSIONS VERSIONED OR NOT? Obviously, if the extension is versioned properly and we trust that versioning to be done well then this won't be a problem at all. So the question is - how many extensions are unversioned? How bad a problem is this?

Realised when reading open-contracting/lib-cove-ocds#9 (comment)

@jpmckinney
Copy link
Member

Most extensions are unversioned, and that's not likely to change to the point that most are versioned, i.e. we will have to handle the case of unversioned extensions for the foreseeable future.

Thinking through different scenarios: If a publisher has data, then changes its extensions in a backwards-incompatible way, but doesn't update its old data, then that data should fail. It doesn't matter that, at one time, its data and schema matched such that it would pass. We require that any presently accessible data match any presently referenced schema.

So, in the above scenario, if the publisher did go back and change their old data to match the updated extension, we should re-download that old data before re-checking it.

If they didn't go back, and now their old data has errors according to the updated extension, then that is a true error and isn't unfair.

Publishers shouldn't be making backwards-incompatible changes to their extensions, and if they are, then they should at least version them (or publish them at different URLs).

@robredpath
Copy link

We're planning on keeping a record of pretty much everything we ever do on Kingfisher, right? So, we should still be able to say with confidence that a certain publishers' data passed validation on a certain date, even if we can't now reproduce that because the extensions that it uses have changed?

@jpmckinney
Copy link
Member

jpmckinney commented Mar 1, 2019

Yes to the second question – I don't think we have a use case for re-checking year-old data against year-old extensions, but we do have a use case to say "publisher X passed validation at time Y" – though, regarding the first question, that doesn't necessarily need database support, as we'll have logged that fact in feedback reports, MEL measurements, etc.

@odscjames
Copy link
Author

Thinking through different scenarios:

There are 2 different scenarios here tho.

  • We get all data and check it, 6 months pass, we get the latest copy of all data again and check it again - that's fine. In this case the latest version of the data should match the current version of the extensions, and that will be what is checked.
  • We get all data and check it, 6 months pass, we want to know something about that data so we local load the old version and check it again. This is where the problem rises.

Maybe the second scenario is very unlikely, but one of our analysts is doing just that right now (because the publisher has stopped publishing).

We're planning on keeping a record of pretty much everything we ever do on Kingfisher, right?

Are we planning on doing that outside Kingfisher? At the moment if you delete a collection from Kingfisher you delete the check results too.

@jpmckinney
Copy link
Member

For the second scenario, it makes sense to store the schemas, etc. somewhere. So, we might as well store them for all scenarios.

Are we planning on doing that outside Kingfisher?

Leave it up to the user. Feedback reports and MEL reports will mention results; granular results can then be discarded. When Kingfisher is used for another purpose, I assume the relevant results will be captured at least as prose somewhere… If anyone uses Kingfisher and never reports any results anywhere else, then I assume that person won't be deleting their collections…

@jpmckinney
Copy link
Member

@jpmckinney
Copy link
Member

jpmckinney commented Jul 22, 2020

The next version of the Extension Registry Python Package means that, if Kingfisher Process downloads all unique extensions referenced by packages (e.g. after closing the collection), then ProfileBuilder can use those downloaded extensions to generate an ad-hoc 'profile', which can be made available to other steps (e.g. the check step – if/when lib-cove-ocds allows passing in a schema) – so that those don't need to be retrieved at the time the check is performed.

@jpmckinney
Copy link
Member

The Extension Registry Python Package can now generate extended package schema (like CoVE).

@jpmckinney
Copy link
Member

jpmckinney commented Jun 8, 2022

@yolile Do you think this feature is needed? We can do it (e.g. update the collection when the first release is merged, and assume that all releases use the same extensions), but I'm not sure anyone has ever needed this.

@jpmckinney jpmckinney removed this from the V3 milestone Jun 8, 2022
@yolile
Copy link
Member

yolile commented Jun 8, 2022

I'm not sure anyone has ever needed this.

I'm not sure either. The only case I can recall is a publisher using an old version of OCDS for PPPs, but in that case, the problem was they were using and declaring the old version of the extension instead of the most recent one.

In general, I think the higher-risk scenario is when a partner uses a community extension, and the extension owner updates the extension for their own purpose and doesn't communicate this to anyone. But this will be a problem for any data user or tool. So maybe it is more a problem for how the extensions work rather than for kingfisher process. For us using kingfisher process, I guess we can always manually check if a dataset is failing due to problems with an extension, check the changes of that extension in GitHub or similar, and check if the issue is the publisher's concern or not, and if maybe they need to create their own extension now. So for me, it is better to actually raise the validation error rather than kind of hide it.

@jpmckinney
Copy link
Member

Sounds good 👍

@jpmckinney jpmckinney closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Relating to loading data from the web API or CLI command
Projects
None yet
Development

No branches or pull requests

4 participants