Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Republishing, provenance and transformation #464

Open
lgs85 opened this issue Sep 15, 2022 · 4 comments
Open

Feature: Republishing, provenance and transformation #464

lgs85 opened this issue Sep 15, 2022 · 4 comments

Comments

@lgs85
Copy link
Contributor

lgs85 commented Sep 15, 2022

[This ticket helps track progress towards developing a particular feature in BODS where changes or revisions to the standard may be required. It should be placed on the BODS Feature Tracker, under the relevant status column.

See Feature development in BODS in the Handbook.

The title of this GitHub ticket should be 'Feature: XXXXX' where XXXXX is the feature name below. The information in this first post on the thread should be updated as necessary so that it holds up-to-date information. Comments on this ticket can be used to help track high-level work towards this feature or to refine this set of information.]

Feature name: Republishing, provenance and transformation

Feature background

Briefly describe the purpose of this feature

This feature ticket proposes that BODS should provide scope for representing republished data derived from multiple sources, and should enable transparency about the provenance of republished or mapped BODS data, alongside any transformation steps taken.

BODS is tailored almost exclusively towards primary beneficial ownership data, consisting of newly published or updated statements by declaring entities to state bodies. This state-level beneficial ownership data is stored and sometimes published in a register, and usually consists solely of declarations by entities registered in the state which maintains the register.

A number of organizations are already republishing beneficial ownership data from multiple sources. Most notably, the Open Ownership Register combines beneficial ownership data from the United Kingdom, Denmark, Slovakia and Ukraine, reconciles and deduplicates this with data from OpenCorporates, and republishes the reconciled data in BODS format.

The Open Ownership Register is currently being upgraded to publish in BODS v0.2, but neither v0.2 nor v0.3 enable source information from multiple sources to be represented, because the section of BODS that deals with source information is designed around the idea of a single source.

What user needs are met by introducing or developing this feature in BODS?

A key reason for developing any data standard is to enable interoperability, and with beneficial ownership data, being able easily represent reconciled and republished statements from multiple sources meets a large and distinct set of use needs. User stories include:

  • As business register analyst, I want to be able to show how I have combined my register with other sources, so that I can provide a more complete picture of beneficial ownership in my jurisdiction.

  • As a non-governmental organization republishing a beneficial ownership register from multiple sources, I want to be able to demonstrate the provenance of my register, so that my service users have confidence in my product

  • As investigative journalist, I want access to global beneficial ownership data with clear provenance so that I can carry out my work while being able to easily verify original sources.

  • As a Know Your Customer (KYC) technical data architect, I want to know whether a set of BODS data is original or republished so that I can decide whether to use this dataset or access and use source data.

  • As a user using a tool like the Open Ownership Register, I want to know whether the beneficial ownership information I'm looking at is confirmed, corroborated or even refuted by additional sources of data about the person or entity whose information I'm interested in.

What impact would not meeting these needs have?

  • Data re-publishers may release ambiguous or non-standard beneficial data that is difficult to verify, and journalists and other data users may be discouraged from using this data.

  • As the Open Ownership Register expands, it may have to be adapted to use non-BODS data in order to clearly demonstrate provenance.

  • State publishers of beneficial ownership data may discouraged from reconciling their data with other sources and jurisdictions.

How urgent is it to meet the above needs?

At present there are relatively few public registers of beneficial ownership data and as a result enhancing and republishing beneficial ownership data in BODS format can be done using slight adaptations to the standard (e.g. using the source.description field). This slightly reduces the urgency of modifying BODS to explicitly accommodate republished data, but as more countries start publishing public beneficial ownership registers, this will become more urgent.

Are there any obvious problems, dependencies or challenges that any proposal to develop this feature would need to address?

  • There is a significant challenge in determining the scope of including information on provenance in republished BODS data, and even more of a challenge if this scope is extended to include information on transformation processes. Trying to incorporate field-level provenance into BODS itself, for example, would result in the standard becoming extremely unwieldy. It is likely, therefore, that accommodating republishing would need to combine smaller-scale schema changes with additional technical guidance.

  • There is a dependency on any proposals to represent change over time using BODS, and careful consideration will need to be made as to what change over time means in the context of republished BODS data.

  • There is a dependency on proposals to separate BODS statements into core data and metadata as provenance information in republished BODS data would result in significant changes to the metadata of statements.

Feature work tracking

@lgs85 lgs85 changed the title Feature: [feature name] Feature: republishing and provenance Sep 15, 2022
@lgs85 lgs85 changed the title Feature: republishing and provenance Feature: Republishing and provenance Sep 15, 2022
@lgs85 lgs85 changed the title Feature: Republishing and provenance Feature: Republishing, provenance and transformation Sep 15, 2022
@StephenAbbott
Copy link
Member

See openownership/register#223 for republishing issue raised by work on Open Ownership Register

@StephenAbbott
Copy link
Member

Check ISO 8000-120:2016 which "specifies requirements for the representation and exchange of information about the provenance of master data that consists of characteristic data, and supplements the requirements of ISO 8000‑110"

@StephenAbbott
Copy link
Member

StephenAbbott commented May 9, 2024

Noting issue #638 raised by @kathryn-ods to consider adding 'cleaning' and 'enrichment' to the motivations codelist which may be relevant to this feature development in future

@kathryn-ods
Copy link
Contributor

Have been thinking about this in the context of https://standard.openownership.org/en/latest/standard/modelling/dates-guidance.html and making decisions about dates when republishing.

E.g. statementDate could be the date that the data was originally filed by the subject or it could be the date that the publisher first published it.

publicationDate could be the original publicationDate or the date of republication.

It depends on who is classed as the publisher. For the GLEIF mapping I am currently working on we are treating the dates as though GLEIF are the ones making the claim and we are the ultimate publisher. So statementDate is the date of the GLEIF delta file and publication date is our date of publication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Research
Development

No branches or pull requests

3 participants