Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compile-releases: Store warnings (in collection_note table) #222

Closed
pindec opened this issue Nov 20, 2019 · 6 comments
Closed

compile-releases: Store warnings (in collection_note table) #222

pindec opened this issue Nov 20, 2019 · 6 comments
Labels
error handling Relating to how errors are handled
Milestone

Comments

@pindec
Copy link

pindec commented Nov 20, 2019

As an analyst I need to know whether the compilation script encounters data quality issues in the source data in order to judge whether the compiled collection may be missing data.

Example - uk_contracts_finder collection 606 has all award/suppliers with same id (and no parties array). Compiling into collection 608 means all but one supplier is removed.

It would be helpful for compile warnings to be stored in the compiled release collection.

This should include warning about any array with non-unique local ids.

@pindec
Copy link
Author

pindec commented Nov 20, 2019

@jpmckinney
Copy link
Member

Compilation doesn't presently check for this, but new warnings can be implemented. These would still have to be captured and reported by Kingfisher.

Can you share a specific release that has this issue, to be sure that I understand the problem?

@pindec
Copy link
Author

pindec commented Nov 20, 2019

Sure.

In collection 606, release with OCID ocid = 'ocds-b5fd17-2f6874ab-adc8-11e6-9901-0019b9f3037b' has 37 suppliers with distinct names, all with id = 0. There is no parties array, so this is deprecated/1.0 use of suppliers anyway (suppliers have contactPoint and address fields). Identifier fields are empty or null. There is just one release with that OCID.

Truncated data:

"suppliers": [ { "id": 0, "sme": false, "name": "Aquilant Limited", }, { "id": 0, "sme": false, "name": "Aspen Medical Europe Limited", }, { "id": 0, "sme": false, "name": "B. Braun Medical Limited", }, ... etc

Same OCID in compiled_release collection 608 has one supplier (truncated):

"suppliers": [ { "id": 0, "sme": false, "name": "Ypsomed Limited", } ]

This supplier is the final object in the award.suppliers array from collection 606.

Same can be seen from running
select * from views.award_suppliers_summary where ocid = 'ocds-b5fd17-2f6874ab-adc8-11e6-9901-0019b9f3037b'
for each collection_id.

@jpmckinney
Copy link
Member

jpmckinney commented Nov 21, 2019

I've released a new version of ocdsmerge that will issue a Python warning if there are duplicate IDs in a single release.

  • Kingfisher will have to catch those warnings and do something with them.

Warnings are of the class DuplicateIdValueWarning with a message like:

Multiple objects have the `id` value '1' in the `awards` array

@jpmckinney
Copy link
Member

jpmckinney commented Nov 21, 2019

I release a new version of ocdsmerge that now offers some alternatives to the default merge routine. See the documentation: https://ocds-merge.readthedocs.io/en/latest/handle-bad-data.html

However, since Kingfisher doesn't implement a routing slip pattern (discussed here) and instead sets global behaviors for all collections, it's not possible to e.g. only have the Contracts Finder collection use a different merge routine.

So, for now, Kingfisher can just use the warnings above.

@jpmckinney jpmckinney added the steps Relating to specific steps (transforms) label Dec 9, 2019
@jpmckinney jpmckinney added the error handling Relating to how errors are handled label Aug 13, 2020
@jpmckinney jpmckinney changed the title Compilation data quality issues need to be flagged compile-releases: Compilation data quality issues need to be flagged Aug 13, 2020
@jpmckinney jpmckinney removed the steps Relating to specific steps (transforms) label Aug 13, 2020
@jpmckinney jpmckinney added this to the V2 (Django) milestone Nov 20, 2020
@jpmckinney jpmckinney modified the milestone: Priority Jan 25, 2023
@jpmckinney jpmckinney changed the title compile-releases: Compilation data quality issues need to be flagged compile-releases: Store warnings (in collection_note table) Jul 4, 2023
@jpmckinney
Copy link
Member

jpmckinney commented Apr 10, 2024

Contracts Finder seems to have fixed their issue. dominican_republic_api has duplicates, but as far as we know, they are actual duplicates (and therefore only one should survive).

For now, I added a commit, so that these warnings are stored in collection_note. If we find a need (open-contracting/data-registry#30), we can allow spiders in Kingfisher Collect to set merge rules, and those can be sent the Kingfisher Process in the spider_opened signal, and then saved to its Collection.options column.

OCDS Merge's rules are expressed as {('tender', 'tenderers'): ocdsmerge.APPEND}. In Kingfisher Collect, we would probably do: merge_rules = {"tender/tenderers": "APPEND"}, to be JSON-serializable, and Kingfisher Process would similarly store this JSON in the database. Then, in compiler.py, we might first do something like:

rule_overrides = tuple(collection.options.get("merge_rules", {}).items())

since arguments to _get_merger need to be hashable, and then in _get_merger, we'd do:

    rule_overrides = {tuple(path.split("/")): getattr(ocdsmerge, attribute) for path, attribute in rule_overrides}
    return ocdsmerge.Merger(patched_schema, rule_overrides=rule_overrides)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error handling Relating to how errors are handled
Projects
None yet
Development

No branches or pull requests

2 participants