-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request/Idea: Standardize standard license configuration #8512
Comments
FWIW:
|
Thanks for your comments, @qqmyers. Just a short reply to your first bullet point. I think the way you have done this in v5.10 is already in line with my suggestion; cf.
I think we only should use the SPDX URI when there is no (authoritative) URI provided by the license issuer. |
Thank you @philippconzett starting this discussion. This is related to proper future software support (important for our project HERMES), so I'm taking the liberty to join it. As 5.10 included the first iteration of multi license support, I think we should be very careful when taking the next steps. Some context about interoperability:
@qqmyers removed the generation of this JSON-LD part from the code and replaced it with the URL only (which is perfectly valid schema.org syntax). The RO-Crate @qqmyers: looking at https://github.com/spdx/license-list-data/blob/master/json/licenses.json, there are licenses marked as deprecated - maybe we need to open an issue at https://github.com/spdx/license-list-XML and talk to them about PDDC being deprecated by upstream (there isn't an issue for this yet).
(Future: IMHO it would be great to have a summary in our UI, so people don't need to look at license texts. Maybe grabbing the quick summaries from https://tldrlegal.com helps?) |
Thanks for your feedback, @poikilotherm! I wasn't aware that RO-Crate already had addressed this issue. My main concern was just that to make sure that standard licenses are described in the same way across Dataverse installations. |
When I mentioned that my suggestion was meant to improve interoperability between Dataverse installations and beyond Dataverse installations, I first of all had in mind that license information from Dataverse installations should be made harvestable in a way that complies with recommendations. I'm not sure about the status of RO-Crate, but a standard that is already implemented and widely used is the DataCite Metadata Schema. The current version of this schema, v.4.4 (cf. https://schema.datacite.org/meta/kernel-4.4/), says the following about license information:
As the license identifier, DataCite requires "a short, standardized version of the license name", and they suggest to use the SPDX identifier. Based on the DataCite recommendations, I've updated the Google spreadsheet (see tab "English v.0.2") and the JSON files for the standard licenses I suggest we should provide on GitHub; see this Google folder. As far as I can see, none of the standard licenses I suggest we should provide on GitHub are obsolete, so this shouldn't be a show stopper. Also pinging @janvanmansum for feedback. |
Here are two JSON examples created following the suggested workflow above:
@qqmyers @pdurbin I guess we might have to change back some of the field names, in order to this not messing up your current setup, e.g., rightsName >> name? I don't know what needs to be done to discuss this further, but I'd be happy to contribute as suggested above. For example, if you create a suitable place on GitHub, I could create and upload the JSON files, once we've agreed on how they should look like. Thanks! |
@philippconzett I'm not sure either. Perhaps we can try to make the problem more concrete with a scenario and a screenshot. Imagine a future where you're harvesting datasets from another Dataverse installation with slightly different names. Also imagine that there's a search facet called "License" that makes these differences obvious at a glance: Once the data is in a facet like this, it's obvious that there's a problem, that counts of the same license should be combined. |
Thanks, @pdurbin, and sorry for my late reply. The scenario you described above is definitely an example of what might be an undesired result of the current way of configuring standard licenses. A similar situation could arise in search engines supporting search/filtering based on license information, e.g., in the advanced search of BASE (https://www.base-search.net/Search/Advanced); cf. this mock-up screenshot: In general, I think we should aim at providing license information along the recommendations of DataCite. I'd be happy to create a pull request, but I need some help:
I suggest we make this a prioritized PR because the longer we wait, the more likely it becomes that installations configure multi-license support with the current set-up, which means that they would have to do some clean up to change the license information to be aligned with the standardized way suggested in this issue. |
@philippconzett thanks. If the goal is to keep the Dataverse community together perhaps the best place for the JSON files is where they already are, in the main repo. That way, they seem more official, they can be part of the guides, and if the JSON structure needs to evolve (new fields/columns like you say), it can happen in the same pull request as the code and database changes. I guess what I'm saying is, what if we consider the licenses in the main repo official already? And if we don't like something about them (they need more or different fields), what if we let them evolve in the main repo, at least for a while? There are currently 453 licenses in your spreadsheet. If we were start adding more licenses to the main repo, would you want all of them at once? (Do you plan to present all 453 to your users?) A subset? How many? Thanks. For others, here's a link to your spreadsheet: https://docs.google.com/spreadsheets/d/1f_-z6vWijOvIc0tI1ezWeDEgM3U9w5qynllfyNqWYU8/edit?usp=sharing |
Thanks, Phil! Keeping the JSON files in the main repo sounds reasonable. As for the number of licenses/JSON files, I only suggest to start with a small selection, as described above; see point 4 in the first posting. These 28 licenses are all marked with "true" in column M (=active) in the spreadsheet. I have now sorted the spreadsheet to make them appear on top. The JSON files of these licenses are in the folder "JSON files v.0.2" in the share Google folder: https://drive.google.com/drive/folders/11BF5tZ9K_S0rxrWErFQYgSCX_geQtHtq?usp=sharing. |
Thanks for pinging me @philippconzett. This issue reminds me of that "things, not strings" saying, which I think is usually used when talking about knowledge graphs, but it makes sense here. I think your idea in this issue will improve the chances that most Dataverse installations will use the same strings to describe the same things. I'm less sure it would improve interoperability "beyond Dataverse installations". What if, when a Dataverse repository that prefers displaying a "CC-0" license as "CC 0" harvests metadata from a source that uses "CC0", the Dataverse software could figure out that "CC0" is the same thing as "CC-0" and use that when displaying search results (like as facets)? Since the Dataverse software doesn't have facets for the Terms metadata, this problem isn't as noticeable now, so maybe we can cross that bridge when we get to it. |
Hi all! I hope everyone is doing well. I noted a similar problem in a different community, and just as a point of information it may be interesting to follow how they solve it: huggingface/datasets#4298 |
Thanks, @jggautier + @djbrooke! @jggautier I'm not sure I agree with you on interoperability beyond Dataverse installations. In my understanding, the main point with the DataCite Metadata Schema recommendations is to make harvested metadata interoperable. Of course, Dataverse, Dataverse installations or DataCite could create crosswalks/scripts to transform the exposed metadata into the desired DataCite format, but why not make the metadata available in a DataCite-aligned way to start with? I now realize that starting a discussion like this on GitHub is no good idea, as only a few people in the community systematically review GitHub issues. I'll raise the issue in the Dataverse Google group, because I think DataCite-aligned metadata is important for many Dataverse installations. Thanks! |
Please note, as I recently learned, that the Datacite Metadata Export exposed via OAI-PMH is not valid XML. The export also uses an outdated schema and a subset of the schemas possibilities (example is #7077). I agree with you we should discuss this somewhere else to include more people's views. |
I've raised the issue in the Dataverse Google group: https://groups.google.com/u/1/g/dataverse-community/c/4qSr0mkcyOw. |
I'm adding another illustration of why this feature request should be prioritized: Metadata from Dataverse-based repositories are currently not correctly harvested by DataCite. This includes the license information. So, if you compare a DataCite metadata record from let's say Pangaea, e.g., https://search.datacite.org/works/10.1594/pangaea.940188, you can download the metadata in different formats, and you'll find correct license information:
Based on this license information, the metadata are then harvested and indexed in other discovery services, e.g., Primo (see this discussion thread in the Dataverse Google group). On the other hand, Dataverse-based repositories do not expose license information in the way DataCite expects, and thus the DataCite metadata records from Dataverse-based repositories are lacking license information. Here's an example from DataverseNO, and here's one from DataverseNL (@janvanmansum @4tikhonov), here one from the Australian Data Archive (@stevenmce), here one from Harvard Dataverse (@pdurbin @jggautier), here one from Jülich DATA (@poikilotherm), here one from Odum (@donsizemore), and here one from Scholars Portal (@amberleahey @kaitlinnewson @meghangoodchild). As you see (cf. the DataCite JSON file), the rightslist is empty:
As a result, if you search for data in Dataverse-based repositories in discovery services like Primo, you'll be told that you cannot access these datasets. There reason for this being that these services don't have access to the license information of these datasets and assume the are not Open Access. |
Dataverse does not send any rights information to Datacite - I believe it is the same as the datacite.xml metadata export. If we sent what we have now, it would be an improvement. |
@philippconzett don't worry, your PR is still on the global backlog board: |
JP and I just wrote some guidance on adding licenses the future: #10426 (comment) Please take a look and let us know what you think! |
Closing this issue in favor of #10883. |
Overview of the Feature Request
With version 5.10, the long-awaited multiple-license support was released (see release notes). Thanks to all contributors!
To better support interoperability between Dataverse installations and beyond Dataverse installations, I'd like to suggest standardizing the way standard license configuration is managed using multiple-license support as follows:
JSON elements:
Content:
Code - Permissive:
Code - Copyleft:
Code - Other:
Following the suggested guidelines above, I have created a Google spreadsheet containing the necessary information to create JSON files, and I created those files by running a bash file. All these documents are available in this Google folder (you might need to log in to access it).
At a later stage, this could of course be automated by retrieving information directly from SPDX and license issuers, possibly via a controlled vocabulary hosted on SKOSMOS.
What kind of user is the feature intended for?
The suggested feature is primarily intended for Sysadmins who need to install licenses on their Dataverse installation.
What inspired the request?
The implementation of multiple license support released in v5.10.
What existing behavior do you want changed?
The different Dataverse installations adding the same standard license with (slightly) different license information.
Any brand new behavior do you want to add to Dataverse?
No, thanks.
Any related open or closed issues to this feature request?
Multiple licences feature proposal #7440
The text was updated successfully, but these errors were encountered: