Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4499 - dataset version pids #9462

Draft
wants to merge 41 commits into
base: develop
Choose a base branch
from

Conversation

poikilotherm
Copy link
Contributor

@poikilotherm poikilotherm commented Mar 22, 2023

What this PR does / why we need it:

This pull request adds the long awaited option to generate PIDs for dataset version.

TODOs:

  • Add tests for JsonPrinter/JsonParser of Collection JSON
  • Implement API endpoints to set collection conduct (might need new engine command)
  • Add tests for DataverseServiceBean.wantsDatasetVersionPids()
  • Add GlobalIdServiceBean.publicizeIdentifier(DatasetVersion datasetVersion) and implementations
    • DataCite (also needs metadata template adaption)
    • PermaLink
    • Fake (stub only)
    • Handle: leave unsupported for now
    • EZID: leave unsupported for now
  • Add identifier generation for dataset versions (add generator to interface GlobalIdServiceBean and impl. classes)
  • Extend DatasetVersion model to include the identifier (so we can generate random ones as extension)
  • Think about pre-registring the version PID like we do for datasets/files No - this doesn't make sense, as we never know a-priori if a version will be minor or major, and only major versions shall have a PID.
  • Extend with creating and publicizing the identifiers:
    • UpdateDvObjectPIDMetadataCommand
    • RegisterDvObjectCommand
    • FinalizeDatasetPublicationCommand
  • Extend to modify existing version records with updates for minors (e.g. in UpdateDatasetTargetURLCommand)
  • Extend Version UI to show the PIDs as links
  • Extend Data Citation UI component to include the version DOI (Configuration?)
  • Write out the version PID in Dataset JSON parts
  • Parse the version PID in the JSON Parser for Datasets
  • Adapt DatasetVersion.getJsonLd() to indicate this is a version, use version identifier and add relation to dataset via concept PID
  • Extend Export API to accept PIDs of versions & export accordingly
  • Write API tests for new endpoints
  • Write tests to make sure this whole thing works as designed (it needs adaption of current tests as well)
  • Docs for the API calls
  • Docs for the extension and implied default in the Collection JSON / JPA model
  • Docs for the configuration and default conduct

Which issue(s) this PR closes:

Closes #4499

Special notes for your reviewer:
This is WIP.

Suggestions on how to test this:
This is WIP.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Some minor additions. Will be included once avail.

Is there a release notes update needed for this change?:
Yes, will be included.

Additional documentation:
None yet.

This commit adds a new scope and setting to the JvmSettings,
enabling the configuration of different modes for Dataset Version PIDs.
These modes are depicted in VersionPidMode. A test ensures the
parsability.

In addition, VersionPidMode also contains a fine grained option
to change the conduct of Dataverse collections and their datasets
for these PIDs.
…JPA model

Enable the database model to carry the configured conduct of a collection.
Also enable JSON parser and printer to marshal the setting.
@poikilotherm poikilotherm added Feature: API Feature: Publishing & Versions Feature: API Guide User Role: Curator Curates and reviews datasets, manages permissions User Role: API User Makes use of APIs User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh User Role: Depositor Creates datasets, uploads data, etc. HERMES related to @hermes-hmc work on Dataverse code NIH OTA: 1.3.1 3 | 1.3.1 | Support software metadata | 5 prdOwnThis is an item synched from the product planning... Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) GB02 labels Mar 22, 2023
@poikilotherm poikilotherm self-assigned this Mar 22, 2023
…d conduct IQSS#4499

This commit adds a public method DataverseServiceBean.wantsDatasetVersionPids()
that will determine how to deal with a dataset version (which belongs to a dataset
that lives within a collection) in terms of "should a PID be registered/updated?".

The background is: when a dataset is published, there will be the context of the
owning Dataverse collection. It's important to take into account the configured
conduct for the collection in the decision how to go ahead with a version's PID.
These are placeholders for now, to be filled with actual code.
These are placeholders for now, to be extended with real code.
@poikilotherm
Copy link
Contributor Author

Hi @mreekie, this is not yet ready to be drawn into any other column other than my own. I did add a size, but obviously this is not the size of the estimated work for me, but my expectation of time spent on review, testing and Q&A at IQSS. Hope that's alright!

Making it NON NULL in database and Bean Validation.
…reusable

Removing the unnecessary transformation via the ValidationError class and
at the same time making it available to use from API endpoints to create nicely
formatted JSON (error) responses.
For a first set of attributes (name, alias, description and, most
important for the PR about IQSS#4499, the dataset version PID conduct),
make an endpoint available that allows changes via simple PUT
HTTP commands.
@coveralls
Copy link

coveralls commented Mar 31, 2023

Coverage Status

coverage: 20.399% (+0.07%) from 20.332% when pulling fa8d544 on poikilotherm:4499-version-doi into fd190a3 on IQSS:develop.

@mreekie mreekie removed the GB02 label Apr 8, 2023
By adding a NotImplementedException, a subclass of UnsupportedOperationException,
we can describe methods not yet implemented. This status might change in the future (or not).

The important part about this introduction: this runtime exception is flagged
as an "application exception", which makes the EJB exception inspection not handle it
as a system exception (which is the default), resulting in rolling back transactions.

The idea is to enable handling perfectly fine situations within the command engine when
some other component of Dataverse - e.g. the EJB of a PID provider - throws this exception.
This way we can catch the exception and deal e.g. with dataset unlocks.
…ServiceBean

With the addition of NotImplementedException we can better express what's going on without
rollback exceptions in EJB.

The methods that will reach out to some provider now throw checked IOExceptions,
as the communication might go sideways and the command engine needs to act accordingly.
(This is the same behaviour now as for the normal object methods)

Also extending JavaDoc descriptions and expectations a bit.
Now returning the same identifier for a minor version that has been assigned to the adjacent major version.
As before, this can be changed by a provider. The tests have been changed accordingly.
… versions

Identifiers for versions are done in two steps: 1. create it (if not existing)
and 2. make it findable. The second step is done by GlobalIdServiceBean.publicizeIdentifier,
and this commit now adds the interface method for the first step.
…ication

With this commit, a first draft to create and release a version identifier
from the command engine is added. FinalizeDatasetPublicationCommand is used
after someone hit publish in the UI and currently takes care of the creation
and publishing for dataset identifiers and files, now extended to look after
version PIDs as well.
…base

As minor versions might carry the identifier of their major version,
we cannot set a unique constraint on the column. Removing this from the
model as well as the Flyway migration.
@poikilotherm
Copy link
Contributor Author

poikilotherm commented May 2, 2023

Key takeaways from first feedback during tech hour:

  • Put the business logic for no PID / major version only / major + minor into the decision in wantVersionPid() and make it a three way thing. Don't put the decision into the PID provider.
  • Think about overrides: current design says collection decides. The storage option for example can be overridden per dataset via API. Do we need this here?
  • Tap into moving a dataset to new owner: give warning / abort / react to the new owner not having versions enabled when the dataset has identifiers for versions (which would be very quite trap)
  • Give some thought into possibilities of "going back" or at least send warnings if for example the version was publicized but the dataset was not, which creates inconsistencies.
  • Let the idea of "make this PR small scoped and tucked away behind feature flag" instead of "create a one big boom PR madness" sink in and discuss again later.

Thanks @qqmyers @pdurbin and everyone else for commenting and sharing your thoughts!

@poikilotherm
Copy link
Contributor Author

After an elaborate chat with @landreev yesterday (thanks again!), let me write down our key insights we got out of it:

  1. Make the JVM setting dataverse.pid.version.mode three way: off (default), allow-major, allow-minor. That way, a sysadmin can restrict choice in collections, as this is cost related.
  2. Enable collection curators to choose for a collection: inherit (default), skip, major, minor. This way they inherit from parent collection, have an opt-out (skip) and can choose between major only or also create for minors. The selection will be limited by the global setting.
  3. We'll change the order of creating and publicizing version PIDs to reduce the risks of incoherence at DataCite and database.
    1. We will create the PID after all file PIDs are publicized but before we publicize the dataset PID. This might be a long list and the chance of things going sideways is highest there.
    2. We will publicize the version PID after the dataset PID. If that update goes sideways, the version PID is registered, but not yet charged for and can be reused on next publication attempt. We can already include the version PID in the dataset PID metadata because we created it before (see 1).
    3. There is a tricky edge case. Assume version PIDs to be a suffix of the dataset PID. If someone tries to publish and says "major update" and things go sideways before the version PID is done but after the dataset PID is publicized, on the next publication attempt they might choose "minor update". As we did not yet publicize the major one, we can create a minor one and go with it. The major one can be reused later (it's not findable/resolvable anyway). The business logic to create an identifier for a version using suffixes needs to be OK with a PID already existing. (This is different for datasets, where we need to create a new identifier!!!)
  4. We were in mutual agreement that we should make this PR smaller scoped. This PR should only be about the "inner core", enabling the general concept to be realized. It should not yet touch the UI, unnecessary API changes or PID provider support. This way, reviewing and quality assurance are much easier to handle without endangering a release hitting the public. Any documentation included in this PR should state clearly this is not yet ready and should only be used in development and experimentation.

@qqmyers
Copy link
Member

qqmyers commented May 11, 2023

A couple minor points:

  • you say 'off' in 1, and 'skip' in 2 - those are the same thing?
  • ~3.2, 3.3 - PIDs are only non-deletable once they are publicized (I think that's true for all current PID providers) so I don't think there's ever a need to keep an un-published PID around after a failure, i.e. those can just be deleted, and new ones or the same PID recreated later if/when needed. One would have to make those calls to delete as part of a rollback action. Version PIDs are definitely a new case as the major/minor status and whether those categories get PIDs can change.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented May 11, 2023

Thanks for continuing the feedback @qqmyers!

Ad 1)

Yes, on purpose. The selection in the collection will be disabled if the feature is turned off globally (by an admin). No one can turn it on by accident.

"Skip" on the other hand is an opt-out. If your parent collection has it on, you might want to turn it off for a certain subcollection.

The decision on the collection is by curators! This is a difference to selecting storage per collection.

Ad 3.3)

Yes, one could try to make that happen in rollback. Let me add a note here that we don't do this for dataset PIDs or File PIDs, so maybe it's fine not to do it for version as well.

Also, in case things go sideways because the provider is out shopping 500s, it might be hard to ask for a deletion.

Of course, we could add a service that retries deleting failed registrations. Might be neat for dataset PIDs and file PIDs as well. But sounds like beyond scope for this PR to me.

@qqmyers
Copy link
Member

qqmyers commented May 11, 2023

re: 1) I'm still a bit confused - I understand off/no version PIDs allowed as a setting (though just not having setting defined could be 'off'), but does the setting then enforce whether major/minor pids are allowed at all? If not, the setting can just be binary. If it does, the the curator option either shouldn't allow minor if minor isn't turned on, etc.
re 3: I agree rollback could fail. My point was not to add scope though, Instead, I thought the difference from dataset and file pids could help you avoid having to deal with the version corner cases. Removing reserved PIDs is done when the dataset/file is deleted and isn't needed in a rollback because they don't go away if publish fails, whereas, as you point out, the version as a fixed vX.Y does go away if publication fails (and the maleable draft version remains). Because of all that, I think if you basically remove the local PID entry for a draft version in rollback (and try to tell the PID provider), you can avoid having to track it - just regenerate a new one when you publish again. (If the PID provider call fails, there would still be an abandoned registered PID that could/should be removed someday). Hopefully that would save you time/effort relative to handling the corner cases.

Instead of providing an initialized value and always saving a value,
make the model itself return "inherit" as default when the DB value is
null. This is better for backward compat and searches.
…tVersionPids

- Incorporate feedback from first review
- Switch to model where admin sets limits but the collections can override but not exceed the limit
- Use admin settings as defaults if collection chain does not provide a choice now
- Also add unit testing for business logic

See also IQSS#9462 (comment)
See also IQSS#9462 (comment)
@pdurbin
Copy link
Member

pdurbin commented Jul 19, 2024

@poikilotherm this PR is on your wishlist for 6.4 but there are a lot of merge conflicts.

Also, do you still feel that a size of 30 is right?

Thanks.

@vera
Copy link
Contributor

vera commented Oct 24, 2024

We are also interested in this feature. I saw above this feature was (potentially) planned for 6.4. Is it now planned for a later version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: API Guide Feature: API Feature: Publishing & Versions HERMES related to @hermes-hmc work on Dataverse code NIH OTA: 1.3.1 3 | 1.3.1 | Support software metadata | 5 prdOwnThis is an item synched from the product planning... Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Feature a feature request User Role: API User Makes use of APIs User Role: Curator Curates and reviews datasets, manages permissions User Role: Depositor Creates datasets, uploads data, etc. User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh
Projects
Status: Interested
Status: No status
Development

Successfully merging this pull request may close these issues.

DOIs for Dataset versions
7 participants