-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should the database be versioned and referenced? #9
Comments
Regarding the version number, we need to decide on a format.
It seems like DOI is the way to go to reference a formal release like this, and that seems easy to set up with Zenodo. This raises some related questions:
|
FYI, I actually already made a formal release for the database in August 2017 using Zenodo (the GitHub-Zenodo integration is set up to mint a DOI with a release here): 10.5281/zenodo.838833. (This was for a conference paper) |
So it looks like you used |
I did, and I agree this might be a reasonable approach. Perhaps we can just issue a new monthly release at the end of a given month if there were any changes made during that month? In the (probably rare) cases where someone wants to cite the database before that release has been archived, it would be like citing the dev version of software that hasn't been released... seems fine to me. |
This still leaves open my question about a CHANGELOG—to me, it makes sense to have an easy-to-read document detailed the differences between releases. Yes, this info is in the git commit messages (though sometimes not in particularly easy-to-find ways), but many people who we want to use and cite the database may not be familiar with that, or find it challenging to navigate, while the CHANGELOG messages are associated with the release itself in plain writing. |
So the CHANGELOG would essentially be [Unreleased]
* Added Methyl Valerate data
[August 2017]
* Added butanol data That seems fine, although I'm not sure what people would do with that information, because I don't care when the data was added, just that it was added. I guess it would also indicate when we made database version upgrades touching all the files, which would probably be valuable. However, I still foresee scaling issues as the database grows, and I think we should put some more thought into how people should reference individual files or subsets of the files, if they want to. See here (and the links at the bottom) for more info about referencing data files: http://libguides.lib.msu.edu/citedata (I haven't looked through those references yet, but will do so soon). |
Ugh... my preference would be just to reference a release of the database, since generally that would be done when using one or a subset of the files anyways. To me, this would be analogous to citing SciPy when using one or two subroutines—you mention what you used, but cite the whole package/library. In our case, the recommended practice could be to cite the database, and then the reference(s) associated with the file(s) used. |
The case that I'm concerned about is
Do you think that situation is going to occur decently often or do you think it'll be an outlier? Personally, I can see that happening fairly frequently once we get a decent sized database (and decent size user base...). I would prefer to fix problems like this now rather than when it becomes a real problem, particularly since I think part of the solution is our permanent identifier scheme, which we really really shouldn't change once its in place (and also it would probably be better not to change the versioning scheme either). The question then is how should the user cite the database version in that case? Should the user retool their application to support the newer format so they can cite the most recent database version? Would the CHANGELOG be detailed enough for the user to go back and look? (Probably not, if their file has been through several revisions before they downloaded it.) They could go back through the git history and find the release that most closely matches their version of the file (probably using something like git blame), but that seems error prone and requires users to be fairly expert git users. Unfortunately, I don't really have a good idea for a solution... which means I'm just being a pain in the butt. Ugh, I hate it when I do that. I'll try to come up with something (and maybe one of those references I linked will have something...) And, FWIW, I don't think that assigning a DOI or similar identifier to individual files will work (like some of the other databases do), because then a user would have to cite every one of those, which really doesn't scale for even moderately sized datasets. |
Even DataCite recognizes this isn't an easy problem to solve. On Pg. 11 of the PDF here: https://schema.datacite.org/meta/kernel-4.1/doc/DataCite-MetadataKernel_v4.1.pdf they mention
|
I guess there are two things here that are related: How do we define a version number for the database? and How should people reference/cite the database?
The text was updated successfully, but these errors were encountered: