-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduction/scope: storage vs publication #17
Comments
@stephenturner I agree with you completely. It's difficult to tease apart rules for storing data and rules for publishing data. I think it's important that we don't just cast issues like metadata and discoverability aside, but this is the space to consider issues like size, privacy, etc, and give them much greater prominence. |
I had same thoughts as @stephenturner re scope, is the focus on either storing, publishing or both? Either way, I think our audience should be ordinary scientists, for whom data set size is usually not an issue (at least in my field). As context, we recently published a data paper (in press at Ecology) with data on plant biomass from over 170 different studies. By disciplinary standards this is a big dataset, but all up the final csv files are < 10Mb. We'll have copies on github, Ecological Archives, and possibly elsewhere (see #10). Those who have really big data (in terms of size) have more challenges, but are far fewer and possibly better equipped to deal with those challenges. Plus the technology of really big data is moving very quickly, so an article on possible solutions may date faster than it can be published. |
@dfalster I'd be careful judging which scientists are ordinary based on whether they have issues with data size :). In all seriousness though, I have to disagree with you - I don't think this article should focus on data that's 10MB. Sure, big data is relative, and big often lies in the data complexity, not the size in terms of bytes. But if data is small enough that it can effortlessly mirrored, backed up, and versioned using rsync or similar across all of Dropbox, GitHub, Google Docs, Figshare, Zenodo, S3, and cheap flash drives, then I don't think we can write anything that hasn't already been written. Agreed, we definitely run the risk of writing something that will be obsolete within a year or two, but that's the nature of this area. |
It's not unusual for models to generate outputs in the GB order. I've had one reach the 1TB bar in a few days. This will pose all sorts of issues to store the data, both locally (I don't have a 1TB hard drive on my laptop) and remotely. I think we should emphasize that even though most data are small (in size), it's going to be increasingly common to deal with large datasets, so we should prepare ourselves (and our students) |
I'd also point out that this paper is going to be submitted to PLOS Computational Biology, a field where data >>10 MB is very common (though whether computational biologists can be considered "ordinary" is a separate question 😉). |
One thing that we don't want to lose here is that a lot of our rules will likely apply equally to the <10MB and >>10MB data. For instance, lousy (or non-existent) metadata ( #22 and #11) is a problem for both small and large data sets or if you fail to backup your data ( #10) it is just as big a loss to you if you lose your entire small or entire large dataset. In short, I don't think size of a dataset should be used as a criteria for our rules or the scope of the discussion. |
@jhollist is right -- why not keep this discussion for the conclusion of the paper? |
Thanks for sharing your thoughts! I should have added that I am of course On 27 February 2015 at 05:50, Timothée Poisot notifications@github.com
|
Both here and on the original SWC issue I sense that we're conflating issues of data storage versus data publication. If your entire career's data is the size of iris, the distinction between the two is less. You can keep local copies in multiple locations trivially, while having a public-facing GH repo, figshare, whatever. But very different situation when you've got on the order of scores-hundreds of GB to TB+ range data. Espeically when (god forbid), you're not interested in or allowed to publish your data.
I think it would be useful to set out both here and in the introduction what exactly we're hoping to accomplish with this set of rules.
The text was updated successfully, but these errors were encountered: