-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add attribute citation_id #160
Comments
Is it really o.k. to assign a doi to a document or a file (or dataset) and subsequently change the metadata? If it's not, I think we need to make that clear, and it would seem impossible to include the doi in a file to which it was assigned because that would involve altering the file. |
Just to be clear, here I'll refer to metadata as the one in the doi database, which is not explicitly shown in the data file. Yes, it is ok to modify the metadata associated to one doi later, an easy example is to include a new description in another language, or add a new reference that cites the dataset in consideration. A doi can be used to identify one very specific version of a dataset, maybe for reproducibility purpose, but that is not the only possible use. A doi can be used to assign all data collected from one project or a group of numerical simulations, even if that dataset is split into multiple files. Note that doi is not a checksum. If one finds a typo in the creator email, in the file or in the doi database, I don't see that as a reason to generate a new doi to reflect that change. |
Having DOI for the dataset is great. It provides an easier way to capture metrics of the data usage, provided that the data producer or repository has a way of minting and registering DOIs. At NASA ESDIS they register the DOI, but do not create it. If we work with the provider early enough we can "hold" a DOI for them so they can populate the field before they produce the whole dataset. As long as you don't make the field required it should be added in. I would also suggest that a citation field be added as well, so a user can see how the data should be cited. |
Zenodo.org also allows you to reserve a DOI before uploading your data, but this may not always be possible. At the CEDA archive, for instance, we like to verify the data before issuing a DOI: it is really up to the publisher. Would you consider an alternative, slightly more flexible approach such as:
This would also support other forms of permanent identifiers. CMIP, for instance, is using the Handle System (which is also the system used to maintain the DOIs, but has wider use). For massive numbers of files used by CMIP, DOIs for each file are not appropriate, but we do have a closely related identifier in each file. At the moment this uses a global attribute defined by CMIP .. it would be nice to have it brought into the CF standard (for use in CMIP7). The CMIP version would then be something like:
(to resolve this, paste it into the text box at http://proxy.handle.net/ ). PS: .. or we could use |
I second Martin's suggestion to broaden the scope a bit and allow both "doi:" and "hdl:" as prefixes, and also think it would be very wise in the long term to name the field ":resource_identifier" instead of just ":doi" to keep flexiblity. There would be little technical drawback in doing this: The DOI System is technically based on the Handle System, so both are compatible. Putting a "doi:" or "hdl:" in front of the string will not cause problems with resolutions of either, as both DOI and Handle resolvers understand these well. But recording the difference can make tool implementation easier that needs to treat these cases differently. |
There is, for the DOI, a question as to whether the DOI should be verifiable. This is a problem if you want to use reserved DOIs: the CF checker would not be able to validate a reserved DOI until after the file is published and the DOI is finally released. This creates a validation loophole which would not be serious if you are dealing with one or two files, but if you are processing hundreds, let alone millions as in CMIP6, this would not be acceptable. This could be avoided by using a collection DOI, as described by @castelao. If you want a file to include a resolvable string that references the file itself, the Handle is really a much better approach than the DOI, because you can build a robust system supporting validation before publication (whether people actually implement that is another question, but I believe the standard should at least support validation). @castelao : would your use-case be supported if the use of a DOI was recommended to be only for collection DOIs which can be validated before the file is published? |
Thanks @jhausman! I agree with the importance of a citation instruction. Cite data is a new thing, and there is much confusion on the best way to do that. At the moment I only suggest how to cite the Spray data (my work) in the landing page, but you're right, I should include it somehow in the netCDF. I would be more inclined to have that information added in one of the text fields like summary or comment. Maybe in the field references if clearly stated that it is the dataset itself reference. The main reason that I'm suggesting a field for the doi is to make it easier for machine reading, while the 'how to cite' would be certainly for human reading which could easily understand the embedded text. If the idea is the automation for the actual citation text, the doi.org has an API that returns that in different standards. In Python I use that like:
|
Thanks @martinjuckes , I wasn't aware of that standard for CMIP7. I like very much the idea of generalizing it. In that case, would it make sense a single file with both, doi and hdl at the same time? If so, the resource_identifier should allow a list of identifiers? About the checker, I think the solution is to have different levels of alerts for the checker. The production level checker would require a valid doi and/or hdl, while a development level would only create a warning if it couldn't resolve that. I've been using only a collection level DOI, but I think we should not restrict others to that. Yes @TobiasWeigel, I agree. In this case we should include the "doi:" or "hdl:". @jhausman, does NASA uses hdl? If so, how do you include them in the files? |
Allowing a list of identifiers looks like a good idea to me. There may be cases where people wish to record a collection level identifier and a file level identifier. I'm not keen on the idea of different outputs from the checker for different stages of production: I feel that this would be difficult to implement without getting into a discussion of the many different workflows which may be used to generate datasets with embedded identifiers, which could be a rather open-ended discussion. It may be better to point out that users may need to filter the checker output if they are running it on data with unpublished identifiers .. the nature of the filter would depend on their workflow. |
I think that CMIP7 is argument enough to change for resource_identifier, but I would like to wait a few days in case someone has a good argument against that or any other idea. @kenkehoe, what is your opinion about using resource_identifier instead? |
I think the checker shouldn't be expected to verify the contents of a resource_identifier attribute. That is a significant increase in scope for the checker. |
The proposed resource_identifier attribute is, in fact, functionally identical to the ACDD id attribute. I think we should leave it to ACDD to manage, rather than add it to CF. The same goes for cite metadata. If you feel that there is a need to add to, expand on, or improve the existing ACDD attributes, you should take it up with ESIP. |
Thanks for your input @JimBiardCics, but that is not correct. While ACDD defined the id as "An identifier for the data set, provided by and unique within its naming authority.", I wrote above:
I don't think that the attribute id should be changed, there is value for that as it is. I would rather use something that already exists, but I can't find any adequate one to assign DOIs. |
I agree that the id attribute is written in a way that precludes the use case of "using the same dataset DOI in multiple files". However, I believe Jim is correct when he says the proposed resource_identifier attribute is identical to the id attribute of ACDD. The difficulty is that the text defining the doi, "Digital object identifier (DOI) of the dataset." is easily misinterpreted to mean "the DOI of the entire dataset represented by this netCDF file." In which case the DOI many would expect it to be unique to this dataset. (Though I don't know that there is any requirement that a DOI has a 1-to-1 relationship with a dataset, maybe it's OK to mint as many as you want for a particular dataset?) (Guilherme, was this also proposed elsewhere? I was thinking I responded to it previously, but can't find that.) Originally I also wanted a more general solution (sometimes other things are used for identifiers, or even citations), but I think ACDD's 'id' is that more general solution. If the DOI is also the id, it can and should appear in both attributes. Given that, 'doi' is the exactly right name for this attribute. While I have minor issues with some of the justification and background arguments, overall I agree with the principle that it is helpful to have a specific place where the doi can be found. The two things I'd like to see changed:
|
@graybeal, those are all good points, thank you very much! I'm not familiar with the hdl identifier, and couldn't find much documentation on that. If hdl requires to be unique on the sense that can be used only once, to identify a single file, it would be a legitimate use for ACDD:id, and in that case I would be wrong in suggesting anything different for hdl than using the ACDD:id. I don't know hdl. I'm not aware of any other proposal for an attribute to address the DOI of the dataset that would allow convenient machine reading. In theory, one could mint several DOIs for the same object (N->1), but I can think in only a couple of situations that it would justify. More common is the possibility of one single DOI used for a collection of datasets (1->N), or subsets, which violates the uniqueness required by the ACDD:id. I strongly agree with your two last points, and I'll modify my proposal to include/clarify those issues. Based on #150 I guess I should edit my very first post?! If you have time, I would be interested in learning what are your minor issues with some of the justification and background so I could also address them in my proposal. |
@castelao I've got to disagree with your interpretation of the uniqueness concept in the ACDD id attribute. I also believe that naming an attribute Regarding the uniqueness of the ACDD id, it seems to me that you are interpreting a dataset as being equal to a file. I don't think that is a valid assumption, and I know that the ACDD Regarding the name, there are numerous competing schemes for providing persistent identifiers, doi being just one of them. There is an existing, widely used scheme for differentiating them—prefixing with <namespace>:, where <namespace> is "doi", "ark", "purl", "urn", etc. as @TobiasWeigel Again, I don't think CF is the right place for this attribute. I think we should leave this sort of metadata to ACDD. I'm sure the ESIP ACDD people would be happy to update or modify the |
Jim, I think these are good points, I think long ago I wrote the same points but didn't post them. I landed at a slightly different answer by assuming that the intent was to provide a limited service, that is, a service just for DOI entries, and that specifying how that should be done was a useful proposition.
While I agree that ACDD ID explicitly must be unique for every file, I do not think the intent is to make the doi attribute unique in the same way. (See @castelao's previous post and a narrow reading of the original description.)
I think someone who wants to can use any of those ID types, including DOIs, in the ACDD ID—but they have to be unique to each file.
The thing that arguably makes DOI deserving of its own field is that it is explicitly a citation mechanism, which is different than an identification mechanism. I'm not in love with DOIs as a perfect citation mechanism, and it is not the only citation mechanism (IRIs are accepted in many journals), but I think the science community has spoken to the value of DOIs, through their wide adoption. (And most of my long-ago objections to DOIs have been addressed.)
Should it be in CF? To be clear, I do *not* want to modify the ACDD id attribute, it seems exactly correct as specified. Adding this as a new attribute to ACDD would be fine in my view. But I think the ACDD is less widely used than CF and less well-known that CF, and so from that standpoint there's a benefit in including it in CF.
Of course, the id should be in CF too, even more so. It's a bit of a puzzle to me why it is not; I think that is a significant weakness in CF. (I could say it is not FAIR, but I can see the hackles going up all over the community at the F-word....) I would go so far as to say I'm uncomfortable putting the doi in CF, without putting the id in CF. Doing so would confuse the purpose of the doi attribute.
John
On Jun 12, 2019, at 6:42 AM, JimBiardCics <notifications@github.com<mailto:notifications@github.com>> wrote:
@castelao<https://github.com/castelao> I've got to disagree with your interpretation of the uniqueness concept in the ACDD id attribute. I also believe that naming an attribute doi is overly limiting.
Regarding the uniqueness of the ACDD id, it seems to me that you are interpreting a dataset as being equal to a file. I don't think that is a valid assumption, and I know that the ACDD id attribute is being used with the assumption that a dataset is composed of many files.
Regarding the name, there are numerous competing schemes for providing persistent identifiers, doi being just one of them. There is an existing, widely used scheme for differentiating them—prefixing with <namespace>:, where <namespace> is "doi", "ark", "purl", "urn", etc. as @TobiasWeigel<https://github.com/TobiasWeigel>
pointed out, so it seems to me that it is overly limiting to create a specific doi attribute. It is not difficult for humans or machines to scan an attribute for a string like "doi:10.2345/A5B3038" and know that it is a doi.
Again, I don't think CF is the right place for this attribute. I think we should leave this sort of metadata to ACDD. I'm sure the ESIP ACDD people would be happy to update or modify the id attribute or come up with a new attribute as needed. They are more knowledgable, as a group, about the persistent identifier topic than we are.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#160?email_source=notifications&email_token=AAJVJUERZ2LAYHSQVPPGG6LP2D4LHA5CNFSM4HCZ7YQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXQOQQI#issuecomment-501278785>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAJVJUB2C6OLI6UFCF6EXV3P2D4LHANCNFSM4HCZ7YQQ>.
========================
John Graybeal
Technical Program Manager
Center for Expanded Data Annotation and Retrieval /+/ NCBO BioPortal
Stanford Center for Biomedical Informatics Research
650-736-1632
|
@castelao @graybeal It seems to me that CF is focused much more on usage metadata than discovery metadata, whereas ACDD is focused much more on discovery. As such, I think it is a much better fit, even if it is less used. CF could direct people to ACDD rather than duplicate the effort. If the majority view is that we need an attribute like this in CF, then at the very minimum, let's not call the attribute |
@castelao Here are a few minor points.
This implies there can only be one DOI for the data in the data object. But imagine 3 layers of nested data -- at each layer a different DOI may apply, so the lowest-level data has 3 DOIs, in some sense. I might say instead 'designate a Digital Object Identifier (DOI) for the CF data object'. (I think the data object is a less ambiguous concept than the 'data' or 'data set'?) If there can be multiple DOIs, as suggested at the very end of the post, that needs to be part of the description, e.g., 'designate one or more Digital Object Identifiers...'. I think you should also explicitly declare the purpose of the DOI, namely it is for citation. Which suggests it should be unique (only point to one file, or else how do you know what data exactly are being cited, and which or how many data sets have that DOI?), but use cases may vary.
'on the exact' => 'in the exact'
delete 'start to', make 'in its' => 'in their'
The proxy part should not be dropped. I know it is not explicitly required for tracking, but however the original DOI is created and distributed should be supported for this attribute. It communicates information to the reviewer, and it allows exact comparison with all the other uses of the exact same DOI (many of which will be syntactical comparisons of the entire string, is my bet).
The cited policy does not require just DOIs, it explicitly says 'doi or URL' (and see the examples at the end of the document, that include a URL).
With this in mind, you may wish to alter your request to make it "Add citation attribute", and allow the content to be either DOI or IRI (which can be easily distinguished by people and computers, especially if they use the full string instead of clipping of the doi:// part).
'chunks of the dataset assigned by the DOI' is a bit confusing, I think you mean here 'different chunks of the dataset assigned the same DOI'? Which is redundant but clearer.
Again, not quite the standard and not the only practice.
If multiple DOIs are allowed, this should be specified throughout the documentation. |
@JimBiardCics I'm good with the principle. But since it isn't a unique ID, I would strong encourage a different name. (In fact the relationship is many-to-many, in the request: One ID can attach to many data sets, and many IDs can attach to one data set.) He has already separation by space-separated in character arrays. Since a few weird identifiers can have commas (ick), let's stick with the space approach. |
Persistently identifying each digital object has benefits independent from the citation case. In CMIP6, each file has received an individual hdl: identifier, and this helps with tracking file versions, replication, and makes it possible to slice the whole CMIP6 data space differently for different purposes using collections. These are important benefits at the cyberinfrastructure level that also provide indirect benefits for users through new services. Citation is a specific use case for persistent identification, and there are implications from assigning DOIs such as having citation metadata and ensuring persistence of the objects themselves that would not fit the CMIP6 file identification case. An underlying question is that of object granularity. In some cases, it may be totally fine to have a doi at the file level (identifying only that single file), but this is at least not the case for CMIP. In CMIP6 we now have the precedence of constructing collections of files as datasets, and those two levels each bear their own identifiers, and the identifiers are linked via metadata to describe the collection structure. This way, we can work with identifiers for (possibly not finalized, citable) objects, but can also assign DOIs at the level of granularity that is most appropriate for citation. In consequence, I would again motivate to
A narrower solution bears the risk that it would only apply to some cases where CF is relevant and this would certainly not be wise in the long term. |
I heartily agree with @TobiasWeigel. If we are going to have this attribute in CF, let's take this route. It might be good to make the attribute name plural to reflect the possibility of multiple elements. |
There does appear to be some overlap with the CMIP6 Handle use case and the ACDD It is clear that the ACDD |
Martin, on what basis do you conclude the following? I have never thought the form of the ACDD id, at least, was constrained in any way. (And examples in ACDD should definitely not be considered exhaustive!)
On Jun 13, 2019, at 9:20 AM, Martin <notifications@github.com<mailto:notifications@github.com>> wrote:
There is a distinction, however, in that ACDD is designed around identifiers of the form <naming authority>:<code>, where <code> is a unique identifier within the namespace managed by <naming_authority>.
Re the most recent suggestions: I think the proposal was for a citation device, not an identifier. If that is true, then to be clear, arguing that we should change it to an identifier is not 'taking a route', it is opposing the original request.
If we want an identifier that is in some way more specific in its resolvability/parseability than ACDD's id, then let's put in a new proposal for that. But if I recall correctly, this was discussed at the time for the 'id' (I was hoping the id could be resolvable) and it did not receive support, perhaps for backwards compatibility reasons.
John
|
Hello @graybeal , sorry if I have mis-understood the original purpose of the proposal. @castelao accepted the idea of generalising the proposed new attribute to support at least In other words, generalising to support an additional identifier is not opposing the original intent of supporting the use of Unlike the My interpretation of the ACDD |
Here is an overview of the persistent identifiers in common use right now. From this page.
Here's another, more in-depth discussion. They mention other persistent identifiers that have narrower focus, such as LSIDs and XRIs. It's worth noting that DOI is a profile of Handle. A survey of web discussions shows that there this whole domain is in a certain degree of flux, with some people particularly advocating for ARK over DOI for data. This is part of the reason I think we really need to go with a more generic name, such as |
I'm not sure what we (CF community) are adding by 'defining' this field. It is currently allowed in CF-compliant files, and unless I missed something, mentioning it in the CF docs doesn't make it any more useful. The proposal specifies syntax details that may not be appropriate in some cases. "For simplicity, the proxy part of the DOI is dropped, so it is composed by the suffix plus the prefix only" - this is putting an unnecessary restriction on the use of DOIs in CF files, since other standards may call for something different. As stated in the original proposal, the use of a DOI is required by some publishers, and so it's being used when appropriate (or when required, even if inappropriate, I guess). I believe everyone knows what it means, and how it is used - why does CF need to be involved at all? |
I'm sorry for the slow response. Thank you for all the inputs, I do appreciate them. As some of you already noted, I believe that we are mixing things here (id x citation) and I was probably responsible for starting that confusion with my initial proposal. Let me try to walk this through. Do we agree on the importance of a unique persistent identifier, whatever is your preferred solution? That would be something that could allow designating one specific file. Does not necessarily mean that everyone must use it, nor that there is one best solution for all, but is OK to assume that everyone agrees that we must have some good solution to assign an id for a file? For that purpose, the ACDD:id seems to be generic enough to accept the several possibilities discussed here (hdl, ark, urn, ...) as defended by @JimBiardCics. I agree with Jim that if we can use something that already exists, let's just do it. @martinjuckes and @TobiasWeigel, it makes total sense for me your approach to managing CMIP data. With so many variables, versions, members, etc, one must have a solid way to tag each file. I believe that hdl does not violate the ACCD:id definition, but ACDD:id brings the benefit of allowing less robust id systems for other smaller datasets that don't require a sophisticated identification. Some people will do a better job on that than others but is for the data provider to decide their path (and also pay the price of bad choices). I support Jim suggestion on using the field ACDD:id for the file identifier. Note that DOI could be used to track individual files. I'll limit in saying that I don't do that myself. I do not use DOI to track my data files. Another interesting point was raised by @graybeal - why CF does not have an id equivalent attribute? My impression is that such id is an operational matter, more than just discovery metadata. Thus it would be a legitimate case for CF according to @JimBiardCics distinction between CF and ACDD. I'm not sure that I want to suggest that, but if this is the case, we do have a precedent. All the global variables in CF are duplicated in ACDD (title, institution, source, history ...). If CF adopted the global attribute id, that would require much caution to avoid conflicts and guarantee backward compatibility. Maybe by using the exact same definition, which allows a quite broad spectrum of possibilities? Or it might require a new attribute like 'persistent_ids', as suggested by @JimBiardCics. I don't have an opinion yet if that would be worth the redundancy, but I agree that CF lacks that. Now we are finally getting on my point. The goal of my proposal was to address the support for an efficient citation and consequently track of scientific impact. The natural choice was to use what already exists and is well established in our scientific community, the DOI, so that's what I did. But I learned that there are other options for citation, so of course, it should be a generic field and allow different standards instead of restricted to doi. I believe that was the idea of @jhausman on her comment early in the discussion, but I didn't get it at that point. @graybeal suggested using a generic 'citation', maybe 'citation_id' would be more explicit. As a generic solution, the proxy part would obviously be required back. @TobiasWeigel and @martinjuckes, do I understand it right that you track your files with hdl, and recommend the users to cite the doi of the data collection? One thing is the file identification which would be a tag 1-1, another thing is the citation which is 1-N (with some edge cases of N-N), and I don't think that we can resolve the two of them at the same time without a high price. So I'll change my proposal for a generic field that would contain the citation identification. OK, now why add a citation identification? We could use the citation as a text in one of the available attributes like many already do, and that could even include the DOI (as a text) on it, and with a sufficiently long enough regexp one could find it. Also, CF indeed gives the freedom for each one do it as they want. That is already happening as I mentioned in my proposal. I already saw some variations: doi, DOI, digital_object_identifier, doi_url, ... Well, the same argument for why we use a standard table of names, or we suggest to use only 'comment' instead of 'Comment', 'COMMENT' or 'comments' is the reason why it would make a difference to define a standard that accepts the DOI. I cannot understand the argument of letting each one decide how to do it at the same time that we defend CF, ACDD, and so many other standards. I can see someone think that a citation id is not relevant enough to have an attribute, but I can't imagine a PI who wrote a proposal to produce some sort of data that would be against a better way to cite that data and receive credit for that data. This matter is not about ego, but to acknowledge the funding and efforts to achieve that data, and seek the chance to keep doing more. I think it is incoherent to approve the need to assign a DOI for the CF-Conventions document but neglect a DOI for the data. I believe that a citation field is so fundamental as knowing the title of that dataset. I think that CF should support the recognition for everyone, especially the funding agencies, responsible for producing each netCDF-CF dataset. I'll change my proposal for a generic citation identifier, that should allow other options than DOI. There are many other good points and ideas that I didn't mention in this comment, but I intend to include in the new version of the proposal. I'm very interested in hearing your feedback. |
@castelao : thanks for an excellent review. The revised/clarified focus on There is some overlap the CMIP handle and the ACDD I agree that traceability of data usage is becoming important, and the DOI is the best available mechanism, so there is a good case for having a specific place in a netCDF file and recommending that people use it. regards, |
@JimBiardCics, thanks. I like that distinction in concept, but as you pointed, there are some overlap and some cases that are not so clear for me. I have that dual feeling for citation_id. Before I go further on this argument, I have one basic question, is ACDD active? I've been using ACDD-1.3. It looks like that the last update was in early 2015, is that correct? I know that there are some people here that are closely related to that. If is that active, how is the proposing process? |
Great summary @JimBiardCics . @castelao It is my sense, based on questions and personal exchanges, that ACDD remains heavily used. Several large data provider communities adopted ACDD long ago and I'm pretty sure they all continue to use it. There are occasional questions on the list for managing ACDD, and those get discussed by multiple people and answered. There have only been a few requests for changes during that period, none of them are actively moving forward that I know of. (IIRC, one was going to be quite complicated and the author dropped it, a few others were considered either not appropriate or not likely to be adoptable by responders, and were likewise dropped I assume.) I like to imagine that there are not many requests because it's such a well-decided and well-structured specification… So while you won't see a lot of traffic about it, I think that doesn't mean it's not 'active' as a standard. And I think if a request like this comes to ACDD, the responses would give you an idea whether it would go through easily and quickly, only after some discussion, or not for a good while. Feel free to contact me off-line if you want to discuss. |
Just to clarify, I think that ACDD it is largely used. At least I use
myself together with CF. My question is how one could propose something new
to ACDD?
…On Wed, Jun 19, 2019 at 2:56 PM John Graybeal ***@***.***> wrote:
Great summary @JimBiardCics <https://github.com/JimBiardCics> . Point
taken about recovery of comment history, I'm not at all sure myself. Likely
Google knows.
@castelao <https://github.com/castelao> It is my sense, based on
questions and personal exchanges, that ACDD remains heavily used. Several
large data provider communities adopted ACDD long ago and I'm pretty sure
they all continue to use it.
There are occasional questions on the list for managing ACDD, and those
get discussed by multiple people and answered. There have only been a few
requests for changes during that period, none of them are actively moving
forward that I know of. (IIRC, one was going to be quite complicated and
the author dropped it, a few others were considered either not appropriate
or not likely to be adoptable by responders, and were likewise dropped I
assume.) I like to imagine that there are not many requests because it's
such a well-decided and well-structured specification…
So while you won't see a lot of traffic about it, I think that doesn't
mean it's not 'active' as a standard. And I think if a request like this
comes to ACDD, the responses would give you an idea whether it would go
through easily and quickly, only after some discussion, or not for a good
while. Feel free to contact me off-line if you want to discuss.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#160?email_source=notifications&email_token=AAOQXZLIPF7MKD4SKJP3GHLP3KTO7A5CNFSM4HCZ7YQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYDMVAY#issuecomment-503761539>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOQXZMY66PBJSV2Z5FVXY3P3KTO7ANCNFSM4HCZ7YQQ>
.
|
I've just caught up (I think on this). There are a couple of comments early on that i don't agree with around (intended) use of DOIs and the persistence of the object to which it points. Leaving aside the syntax (who owns the ID), there are some semantic and policy issues to discuss: Let's be clear, a DOI only points to a
I could write more, but you can see where I am going ... My preferred solution is to ensure that there is a UUID in the file, and a well known service which maps the UUID back to one or more DOIs ... |
@bnlawrence Thanks for chiming in! I expect that the CF community (and the ACDD community) is not all that interested in standing up a persistent UUID mapping service. I think the ACDD Do you see any problem with adding persistent identifiers to files if they are dereferencing to landing pages for collections? It seems to me that a file containing a persistent identifier and a creation date would provide sufficient information to allow a user to sort out the appropriate information about the collection/dataset at the web site when the DOI was dereferenced. In the nested / cross-cutting example you mentioned, it still seems that a well-rounded set of ACDD metadata with a DOI or DOIs wouldn't pose any real difficulty. But this is yet another reason I'm in favor of leaving this to ACDD. They have more people involved who know more about this. |
@castelao To propose something new re ACDD, send email to the ESIP Documentation list at esip-documentation@lists.esipfed.org (list info at http://lists.deltaforce.net/mailman/listinfo/esip-documentation). The ESIP Documentation cluster has the role of managing this specification. @bnlawrence A few nits below, but may I ask you to spell out where you are going? Are you saying you don't like the idea of a citation_id, or of allowing DOIs as the citation_id, or something else? Other thoughts:
Summary of A-B-C-D-E categories from https://doi.org/10.1098/rsta.2008.0237:
|
@JimBiardCics Yes, I'm strongly in favour of adding persistent identifiers to files, but not C or D metadata. I think that should be dealt with by web pages, and/or services, that bundle identifiers together and make the necessary links. I do want data to be citable and publishable ... @castelao I don't like the idea of putting something which carries "D" semantics in files, which means I don't like the name citation_id, and I don't like the idea of a DOI being in the file. I do however very much like the idea of putting in place the necessary information to construct that information post-fact. (What I like is of course not the end game here, so thanks for bringing it up, as always, I'll go for the consensus, even if I'm on the other side :-) --- in this context, I think the right community to discuss it might be ACDD ... ) My historical thinking on these issues (in the context of climate modelling) is on my blog: Streaming data is interesting. I don't think streams should have a DOI, but they should certainly have identifiers. Over the years I may not be winning on this ... but this usage conflicts with the notion of a digital object (singular) identifier ... and the use of DOI as a publication entity, not just an identifier (a la some of the discussion on my blog). It all comes down to what we think a DOI is for. Persistent identifiers in files - big yes, DOIs, no :-). |
@JimBiardCics said: As Bryan Lawrence points out in the blog posts he references in the github issue there is some conflation of purposes for persistent identifiers. I tend to see two top-level purposes for persistent identifiers within a netCDF file.
There are likely others, but these are the ones that occur to me. Within the second purpose I see a few different, related uses (and there are probably more):
It seems to me that it's worthwhile to provide a means to accomplish both top-level purposes within netCDF files. So what about DOIs in relation to the more general topic of persistent identifiers in netCDF files? The International DOI Foundationhttps://www.doi.org/index.html says this about DOIs in the Section 1.6.1https://www.doi.org/doi_handbook/1_Introduction.html#1.6.1 of their Handbook: DOI is an acronym for "digital object identifier", meaning a "digital identifier of an object". A DOI name is an identifier (not a location) of an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks. A DOI name can be assigned to any entity — physical, digital or abstract — primarily for sharing with an interested user community or managing as intellectual property. The DOI system is designed for interoperability; that is to use, or work with, existing identifier and metadata schemes. DOI names may also be expressed as URLs (URIs). ... Unique identifiers (names) are essential for the management of information in any digital environment. Identifiers assigned in one context may be encountered, and may be re-used, in another place (or time) without consulting the assigner, who cannot guarantee that his assumptions will be known to someone else. Persistence of an identifier can be considered an extension of this concept: interoperability with the future. Further, since the services outside the direct control of the issuing assigner are by definition arbitrary, interoperability implies the requirement of extensibility. Hence the DOI system is designed as a generic framework applicable to any digital object, providing a structured, extensible means of identification, description and resolution. The entity assigned a DOI name can be a representation of any logical entity. Based on this description of DOIs, it seems to me that a DOI is a valid, if poor choice for the first top-level purpose that I mentioned. It also seems to me that DOIs are well-suited for accomplishing the second purpose uses. They aren't the only way to accomplish these ends, but they certainly represent a way to do so. Grace and peace, Jim |
So I agree that you could use DOIs, but I think this is a lot of baggage at write time, and sends the wrong message about what a per file identifier is intended to achieve. You can have a unique identifier at write time via the uuid mechanism, and for the data workflow, I think that is all you want and need. All these other use cases nearly always deal with aggregations of files, and attaching a DOI to aggregations is fine by me. The CMIP tracking id in each file gives us all we need. I can see a case for having a CF tracking_id which would accomplish the same result. |
NOAA is including the DOI in the NetCDF files when the collection has a DOI. Each NetCDF file is a part of the associated collection and therefore is considered to be under that DOI's umbrella. I'll find a specific example for reference. Ideally, when user's use a subset of this collection, they will cite the resource with the DOI and provide some context as to the subset used (fileIDs, extent...). |
Isn't there a recursion problem? If you issue the DOI for the file and then
embed it, the file checksum will not match the one that got the DOI....
V. Balaji Office: +1-609-452-6516
Head, Modeling Systems Division, GFDL Mobile: +1-917-273-9824
Princeton University Email:
balaji@princeton.eduhttps://www.gfdl.noaa.gov/v-balaji-homepage
…On Wed, Jul 17, 2019 at 2:49 PM Anna Milan ***@***.***> wrote:
NOAA is including the DOI in the NetCDF files when the collection has a
DOI. Each NetCDF file is a part of the associated collection and therefore
is considered to be under that DOI's umbrella. I'll find a specific example
for reference. Ideally, when user's use a subset of this collection, they
will cite the resource with the DOI and provide some context as to the
subset used (fileIDs, extent...).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#160?email_source=notifications&email_token=ABQJZVGQ3LCAXFDM2IIBBTLP75SUPA5CNFSM4HCZ7YQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2GHLJA#issuecomment-512521636>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQJZVBQDQ4TBFX6JAPNC73P75SUPANCNFSM4HCZ7YQQ>
.
|
@balaji-gfdl, as I mentioned before, DOI is not a checksum, so this is not a problem. It is possible to register a DOI even before creating the data file. |
@amilan17 , if I understood it correctly, we do the same thing for Spray underwater gliders. Each data file goes with the DOI of the collection, so it is feasible for the users to cite the data. How do you include the DOI in the file? We use a global attribute named 'doi', and that is it, simple like that. What I would like to achieve here is a consensus so we all do it in the same way: doi, citation_id, or anything else, but let's follow the same procedure and take advantage of easy automation. |
First, @castelao my apologies for not following this thread. That is my error. Second, I feel this proposal has gone too far into the weeds. I think the original intent was to provide a reserved attribute name to indicate a DOI. I think we should stick to that scope. Anything dealing with what the DOI references or how the DOI is created or if the checker resolves the DOI to check if correct is outside the scope of this proposal. All the DOI attribute should do is specify that the text is a DOI link or "list" of DOIs. I don't think CF should dictate how a DOI should be used across all programs. That should be a decision for the data provider or DOI site. The linking and searching of the DOI can be done with some other tool with as much complexity as needed to resolve to the precision needed. I think the question of using an attribute of "doi" or "resource_identifier" or something else comes down to how we want to use it. If the intent is to just put the information into the file we can push everything into "references" and then have software parse and just figure it out. But that is not nice to data consumers. So for example using existing CF standards we could do something like this:
and then expect the user to figure it out. But we can also make life easier on the user by turning things into key:value pairs
so the user does not need to parse the text looking for a "doi:" keyword. I think that is the spirit of this proposal. If we also want to add "hdl" or some other attribute name to the reserved attribute list, we can do that. I suggest we make the proposal simply that there is a reserved attribute named "doi" that has a value of character array (for now since we have a string argument happening elsewhere). The Character array will be a space separated list (same as all other character array attributes) that can list one or more DOIs.
I also suggest we place no restrictions on it needing to be global attribute only. It can be used under a variable or now that we have groups in as many groups as needed. Standard supersedence between variable vs. global attributes exists where if defined at a variable and global level the variable DOI supersedes the global value for that variable. |
@kenkehoe If the proposal is for an attribute named |
Dear Gui @castelao This issue had a vigorous discussion, but did not come to a consensus or conclusion, last comment in July 2019. Should it be pursued, do you think? Best wishes and thanks Jonathan |
@JonathanGregory , thanks for checking. I still think it is an important issue but indeed it didn't come to a consensus. It is probably better to close this issue. I suggest just holding it for a few days. Maybe somebody else might be interested in moving/saving these discussions to another convention or standard. Thanks! |
It's time to close this one. Thank you all for your contributions. @justinbuck, @vturpin, @emmerbodc, @jenseva, some ideas and discussions here might be of your interest. |
THIS IS OUTDATED. I'm editing this proposal to reflect the discussions so far, but I'll save a copy of this original proposal.
Title: DOI attribute
Moderator: to be defined
Requirement Summary: Optional DOI attribute in section Description of file contents (2.6.2).
Technical Proposal Summary: Add a new optional attribute to designate the Digital Object Identifier (DOI) of the data contained in the CF data object.
Benefits: DOIs allow easy automation for tracking the scientific impact of the data on the exact same fashion that scientific publications are tracked with DOIs. Anyone involved in the resulted data can be recognized, including funding agencies.
Status Quo: An increasing number of scientific journals start to require a DOI for the dataset used in the publication. Many groups already include DOI as an attribute in its NetCDF-CF datasets but without a standard, thus hard to automate.
Detailed Proposal: The only modification required would be in section 2.6.2: Description of file contents. In the bottom, after item comment, it would be added:
As mentioned in the 2.6.2 section, all attributes are optional, and the doi would follow the same rule.
This propose was developed with the help of @kenkehoe
Reasoning:
DOI is a de facto standard to track academic publications, thus providing the foundation for some measurement of scientific impact. There is a clear intention by the scientific community to also track the scientific impact of data and software, thus giving proper credit for who makes those available. The strategy adopted by AMS journals, and more recently by AGU, was to require citation of the dataset DOI used in any publication in the references list (https://www.ametsoc.org/ams/index.cfm/publications/authors/journal-and-bams-authors/formatting-and-manuscript-components/references/dataset-references/).
The use of DOI for datasets will increase. A few groups already include the dataset DOI in its NetCDF-CF data files, but without a standard, it is hard for a machine to keep track of that.
Justification:
Tiny background on DOI:
A dataset DOI can have a record referring the DOI of the NetCDF-CF documentation, thus giving a metric of the impact of cf-conventions document. Adding additional metadata to the DOI or additional DOIs in the future do not change the DOI and allow for future updates without needing to update or reprocess netCDF data file.
Details
Example
The text was updated successfully, but these errors were encountered: