4499 - dataset version pids #9462

poikilotherm · 2023-03-22T15:21:54Z

What this PR does / why we need it:

This pull request adds the long awaited option to generate PIDs for dataset version.

TODOs:

Which issue(s) this PR closes:

Special notes for your reviewer:
This is WIP.

Suggestions on how to test this:
This is WIP.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Some minor additions. Will be included once avail.

Is there a release notes update needed for this change?:
Yes, will be included.

Additional documentation:
None yet.

This commit adds a new scope and setting to the JvmSettings, enabling the configuration of different modes for Dataset Version PIDs. These modes are depicted in VersionPidMode. A test ensures the parsability. In addition, VersionPidMode also contains a fine grained option to change the conduct of Dataverse collections and their datasets for these PIDs.

@transient

…R official Also removing unused import of @transient

…JPA model Enable the database model to carry the configured conduct of a collection. Also enable JSON parser and printer to marshal the setting.

…d conduct IQSS#4499 This commit adds a public method DataverseServiceBean.wantsDatasetVersionPids() that will determine how to deal with a dataset version (which belongs to a dataset that lives within a collection) in terms of "should a PID be registered/updated?". The background is: when a dataset is published, there will be the context of the owning Dataverse collection. It's important to take into account the configured conduct for the collection in the decision how to go ahead with a version's PID.

These are placeholders for now, to be filled with actual code.

These are placeholders for now, to be extended with real code.

poikilotherm · 2023-03-22T15:57:07Z

Hi @mreekie, this is not yet ready to be drawn into any other column other than my own. I did add a size, but obviously this is not the size of the estimated work for me, but my expectation of time spent on review, testing and Q&A at IQSS. Hope that's alright!

Making it NON NULL in database and Bean Validation.

…reusable Removing the unnecessary transformation via the ValidationError class and at the same time making it available to use from API endpoints to create nicely formatted JSON (error) responses.

For a first set of attributes (name, alias, description and, most important for the PR about IQSS#4499, the dataset version PID conduct), make an endpoint available that allows changes via simple PUT HTTP commands.

… DB migration

…migration

coveralls · 2023-03-31T12:14:57Z

coverage: 20.399% (+0.07%) from 20.332% when pulling fa8d544 on poikilotherm:4499-version-doi into fd190a3 on IQSS:develop.

By adding a NotImplementedException, a subclass of UnsupportedOperationException, we can describe methods not yet implemented. This status might change in the future (or not). The important part about this introduction: this runtime exception is flagged as an "application exception", which makes the EJB exception inspection not handle it as a system exception (which is the default), resulting in rolling back transactions. The idea is to enable handling perfectly fine situations within the command engine when some other component of Dataverse - e.g. the EJB of a PID provider - throws this exception. This way we can catch the exception and deal e.g. with dataset unlocks.

…ier()

…ServiceBean With the addition of NotImplementedException we can better express what's going on without rollback exceptions in EJB. The methods that will reach out to some provider now throw checked IOExceptions, as the communication might go sideways and the command engine needs to act accordingly. (This is the same behaviour now as for the normal object methods) Also extending JavaDoc descriptions and expectations a bit.

Now returning the same identifier for a minor version that has been assigned to the adjacent major version. As before, this can be changed by a provider. The tests have been changed accordingly.

… versions Identifiers for versions are done in two steps: 1. create it (if not existing) and 2. make it findable. The second step is done by GlobalIdServiceBean.publicizeIdentifier, and this commit now adds the interface method for the first step.

…iceBean

…ication With this commit, a first draft to create and release a version identifier from the command engine is added. FinalizeDatasetPublicationCommand is used after someone hit publish in the UI and currently takes care of the creation and publishing for dataset identifiers and files, now extended to look after version PIDs as well.

…base As minor versions might carry the identifier of their major version, we cannot set a unique constraint on the column. Removing this from the model as well as the Flyway migration.

poikilotherm · 2023-05-02T20:07:29Z

Key takeaways from first feedback during tech hour:

Put the business logic for no PID / major version only / major + minor into the decision in wantVersionPid() and make it a three way thing. Don't put the decision into the PID provider.
Think about overrides: current design says collection decides. The storage option for example can be overridden per dataset via API. Do we need this here?
Tap into moving a dataset to new owner: give warning / abort / react to the new owner not having versions enabled when the dataset has identifiers for versions (which would be very quite trap)
Give some thought into possibilities of "going back" or at least send warnings if for example the version was publicized but the dataset was not, which creates inconsistencies.
Let the idea of "make this PR small scoped and tucked away behind feature flag" instead of "create a one big boom PR madness" sink in and discuss again later.

Thanks @qqmyers @pdurbin and everyone else for commenting and sharing your thoughts!

poikilotherm · 2023-05-11T07:59:46Z

After an elaborate chat with @landreev yesterday (thanks again!), let me write down our key insights we got out of it:

Make the JVM setting dataverse.pid.version.mode three way: off (default), allow-major, allow-minor. That way, a sysadmin can restrict choice in collections, as this is cost related.
Enable collection curators to choose for a collection: inherit (default), skip, major, minor. This way they inherit from parent collection, have an opt-out (skip) and can choose between major only or also create for minors. The selection will be limited by the global setting.
We'll change the order of creating and publicizing version PIDs to reduce the risks of incoherence at DataCite and database.
1. We will create the PID after all file PIDs are publicized but before we publicize the dataset PID. This might be a long list and the chance of things going sideways is highest there.
2. We will publicize the version PID after the dataset PID. If that update goes sideways, the version PID is registered, but not yet charged for and can be reused on next publication attempt. We can already include the version PID in the dataset PID metadata because we created it before (see 1).
3. There is a tricky edge case. Assume version PIDs to be a suffix of the dataset PID. If someone tries to publish and says "major update" and things go sideways before the version PID is done but after the dataset PID is publicized, on the next publication attempt they might choose "minor update". As we did not yet publicize the major one, we can create a minor one and go with it. The major one can be reused later (it's not findable/resolvable anyway). The business logic to create an identifier for a version using suffixes needs to be OK with a PID already existing. (This is different for datasets, where we need to create a new identifier!!!)
We were in mutual agreement that we should make this PR smaller scoped. This PR should only be about the "inner core", enabling the general concept to be realized. It should not yet touch the UI, unnecessary API changes or PID provider support. This way, reviewing and quality assurance are much easier to handle without endangering a release hitting the public. Any documentation included in this PR should state clearly this is not yet ready and should only be used in development and experimentation.

qqmyers · 2023-05-11T13:01:09Z

A couple minor points:

you say 'off' in 1, and 'skip' in 2 - those are the same thing?
~3.2, 3.3 - PIDs are only non-deletable once they are publicized (I think that's true for all current PID providers) so I don't think there's ever a need to keep an un-published PID around after a failure, i.e. those can just be deleted, and new ones or the same PID recreated later if/when needed. One would have to make those calls to delete as part of a rollback action. Version PIDs are definitely a new case as the major/minor status and whether those categories get PIDs can change.

poikilotherm · 2023-05-11T15:28:28Z

Thanks for continuing the feedback @qqmyers!

Ad 1)

Yes, on purpose. The selection in the collection will be disabled if the feature is turned off globally (by an admin). No one can turn it on by accident.

"Skip" on the other hand is an opt-out. If your parent collection has it on, you might want to turn it off for a certain subcollection.

The decision on the collection is by curators! This is a difference to selecting storage per collection.

Ad 3.3)

Yes, one could try to make that happen in rollback. Let me add a note here that we don't do this for dataset PIDs or File PIDs, so maybe it's fine not to do it for version as well.

Also, in case things go sideways because the provider is out shopping 500s, it might be hard to ask for a deletion.

Of course, we could add a service that retries deleting failed registrations. Might be neat for dataset PIDs and file PIDs as well. But sounds like beyond scope for this PR to me.

qqmyers · 2023-05-11T16:30:44Z

re: 1) I'm still a bit confused - I understand off/no version PIDs allowed as a setting (though just not having setting defined could be 'off'), but does the setting then enforce whether major/minor pids are allowed at all? If not, the setting can just be binary. If it does, the the curator option either shouldn't allow minor if minor isn't turned on, etc.
re 3: I agree rollback could fail. My point was not to add scope though, Instead, I thought the difference from dataset and file pids could help you avoid having to deal with the version corner cases. Removing reserved PIDs is done when the dataset/file is deleted and isn't needed in a rollback because they don't go away if publish fails, whereas, as you point out, the version as a fixed vX.Y does go away if publication fails (and the maleable draft version remains). Because of all that, I think if you basically remove the local PID entry for a draft version in rollback (and try to tell the PID provider), you can avoid having to track it - just regenerate a new one when you publish again. (If the PID provider call fails, there would still be an abandoned registered PID that could/should be removed someday). Hopefully that would save you time/effort relative to handling the corner cases.

src/main/java/edu/harvard/iq/dataverse/api/Dataverses.java

…being included in next release

Instead of providing an initialized value and always saving a value, make the model itself return "inherit" as default when the DB value is null. This is better for backward compat and searches.

See IQSS#9462 (comment) See IQSS#9462 (comment)

…s with info about minor or major version

…tVersionPids - Incorporate feedback from first review - Switch to model where admin sets limits but the collections can override but not exceed the limit - Use admin settings as defaults if collection chain does not provide a choice now - Also add unit testing for business logic See also IQSS#9462 (comment) See also IQSS#9462 (comment)

pdurbin · 2024-07-19T20:05:49Z

@poikilotherm this PR is on your wishlist for 6.4 but there are a lot of merge conflicts.

Also, do you still feel that a size of 30 is right?

Thanks.

vera · 2024-10-24T11:09:47Z

We are also interested in this feature. I saw above this feature was (potentially) planned for 6.4. Is it now planned for a later version?

poikilotherm added 3 commits March 22, 2023 16:08

refactor(model): switch JPA constraints from Hibernate internal to JS…

d4a3ed3

…R official Also removing unused import of @transient

feat(model): add dataset version PID conduct to Dataverse collection …

3cb82b4

…JPA model Enable the database model to carry the configured conduct of a collection. Also enable JSON parser and printer to marshal the setting.

poikilotherm self-assigned this Mar 22, 2023

poikilotherm added 3 commits March 22, 2023 16:44

feat(pid): add initial extension points for dataset version PIDs

b6eb3be

These are placeholders for now, to be filled with actual code.

feat(api): add stub API endpoints for version pid conduct in collections

45dd5de

These are placeholders for now, to be extended with real code.

poikilotherm mentioned this pull request Mar 23, 2023

DOIs for Dataset versions #4499

Open

poikilotherm added 7 commits March 24, 2023 18:41

fix(model): make Dataverse.datasetVersionPidConduct require value

a503718

Making it NON NULL in database and Bean Validation.

refactor(api): simplify ConstraintViolationExceptionHandler and make …

f340e12

…reusable Removing the unnecessary transformation via the ValidationError class and at the same time making it available to use from API endpoints to create nicely formatted JSON (error) responses.

feat(api): make some Dataverse collection attributes changeable

db272fa

For a first set of attributes (name, alias, description and, most important for the PR about IQSS#4499, the dataset version PID conduct), make an endpoint available that allows changes via simple PUT HTTP commands.

Merge branch 'develop' into 4499-version-doi

21fd755

Merge branch 'develop' into 4499-version-doi

d94bc31

fix(model): add missing default for collection version pid conduct in…

8a37a50

… DB migration

feat(model): add PID field to DatasetVersion model and include in DB …

2dc83b7

…migration

mreekie removed the GB02 label Apr 8, 2023

Merge branch 'develop' into 4499-version-doi

d51d3bd

poikilotherm added 8 commits May 2, 2023 16:52

fix(pid): simple typo in javadoc of GlobalIdServiceBean.createIdentif…

4eac492

…ier()

feat(pid): restructure default version identifier generation

3c18c73

Now returning the same identifier for a minor version that has been assigned to the adjacent major version. As before, this can be changed by a provider. The tests have been changed accordingly.

feat(pid): add fake version identifier methods to FakePidProviderServ…

afc23d1

…iceBean

fix(pid): make DatasetVersion.persistentIdentifier non-unique in data…

288e987

…base As minor versions might carry the identifier of their major version, we cannot set a unique constraint on the column. Removing this from the model as well as the Flyway migration.

landreev mentioned this pull request May 10, 2023

Feature Request: Enable file PIDs at Dataverse Collection Level #8889

Closed

landreev reviewed May 18, 2023

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/api/Dataverses.java Show resolved Hide resolved

landreev mentioned this pull request May 24, 2023

8889 file-level PIDs configuration in individual collections #9614

Merged

poikilotherm added 8 commits June 13, 2023 17:07

Merge branch 'develop' into 4499-version-doi

e5dd1a6

fix(db): rename version PID db migration because of conflict and not …

877c728

…being included in next release

refactor(pid): align collection conduct to be more aligned to JPA

c17cf27

Instead of providing an initialized value and always saving a value, make the model itself return "inherit" as default when the DB value is null. This is better for backward compat and searches.

refactor(pid): align global and collection modes to review feedback

77c9e5a

See IQSS#9462 (comment) See IQSS#9462 (comment)

feat(pid): update DataverseServiceBean.wantsDatasetVersionPids() call…

0b51e1b

…s with info about minor or major version

docs(pid): fix Javadoc in VersionPidMode

d8f8534

Merge branch 'develop' into 4499-version-doi

4b9a3ba

cmbz mentioned this pull request Mar 15, 2024

Epic: Extend and Enhance Support for PIDs and Handles IQSS/dataverse-pm#195

Open

6 tasks

jggautier mentioned this pull request Aug 19, 2024

Epic: Support the 5 core persistent identifiers in Dataverse IQSS/dataverse-pm#19

Closed

vkush mentioned this pull request Oct 7, 2024

PIDs for dataset and file versions needed nfdi4cat/repo4cat#5

Open

pdurbin added the Type: Feature a feature request label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4499 - dataset version pids #9462

4499 - dataset version pids #9462

poikilotherm commented Mar 22, 2023 •

edited

Loading

poikilotherm commented Mar 22, 2023

coveralls commented Mar 31, 2023 •

edited

Loading

poikilotherm commented May 2, 2023 •

edited

Loading

poikilotherm commented May 11, 2023

qqmyers commented May 11, 2023

poikilotherm commented May 11, 2023 •

edited

Loading

qqmyers commented May 11, 2023

pdurbin commented Jul 19, 2024

vera commented Oct 24, 2024 •

edited

Loading

4499 - dataset version pids #9462

Are you sure you want to change the base?

4499 - dataset version pids #9462

Conversation

poikilotherm commented Mar 22, 2023 • edited Loading

poikilotherm commented Mar 22, 2023

coveralls commented Mar 31, 2023 • edited Loading

poikilotherm commented May 2, 2023 • edited Loading

poikilotherm commented May 11, 2023

qqmyers commented May 11, 2023

poikilotherm commented May 11, 2023 • edited Loading

qqmyers commented May 11, 2023

pdurbin commented Jul 19, 2024

vera commented Oct 24, 2024 • edited Loading

poikilotherm commented Mar 22, 2023 •

edited

Loading

coveralls commented Mar 31, 2023 •

edited

Loading

poikilotherm commented May 2, 2023 •

edited

Loading

poikilotherm commented May 11, 2023 •

edited

Loading

vera commented Oct 24, 2024 •

edited

Loading