introduction/scope: storage vs publication #17

stephenturner · 2015-02-20T15:09:17Z

Both here and on the original SWC issue I sense that we're conflating issues of data storage versus data publication. If your entire career's data is the size of iris, the distinction between the two is less. You can keep local copies in multiple locations trivially, while having a public-facing GH repo, figshare, whatever. But very different situation when you've got on the order of scores-hundreds of GB to TB+ range data. Espeically when (god forbid), you're not interested in or allowed to publish your data.

I think it would be useful to set out both here and in the introduction what exactly we're hoping to accomplish with this set of rules.

PBarmby · 2015-02-20T15:32:34Z

Seems to me that the "10 Simple Rules" paper linked by @karawoo in #3 covers a lot of the publication or near-publication issues. (Those folks are mostly astronomers for whom privacy/patents/etc are not important.) Suggests that storage should be the focus here?

emhart · 2015-02-24T06:15:12Z

@stephenturner I agree with you completely. It's difficult to tease apart rules for storing data and rules for publishing data. I think it's important that we don't just cast issues like metadata and discoverability aside, but this is the space to consider issues like size, privacy, etc, and give them much greater prominence.

dfalster · 2015-02-26T06:51:43Z

I had same thoughts as @stephenturner re scope, is the focus on either storing, publishing or both?

Either way, I think our audience should be ordinary scientists, for whom data set size is usually not an issue (at least in my field). As context, we recently published a data paper (in press at Ecology) with data on plant biomass from over 170 different studies. By disciplinary standards this is a big dataset, but all up the final csv files are < 10Mb. We'll have copies on github, Ecological Archives, and possibly elsewhere (see #10). Those who have really big data (in terms of size) have more challenges, but are far fewer and possibly better equipped to deal with those challenges. Plus the technology of really big data is moving very quickly, so an article on possible solutions may date faster than it can be published.

stephenturner · 2015-02-26T10:18:24Z

@dfalster I'd be careful judging which scientists are ordinary based on whether they have issues with data size :).

In all seriousness though, I have to disagree with you - I don't think this article should focus on data that's 10MB. Sure, big data is relative, and big often lies in the data complexity, not the size in terms of bytes. But if data is small enough that it can effortlessly mirrored, backed up, and versioned using rsync or similar across all of Dropbox, GitHub, Google Docs, Figshare, Zenodo, S3, and cheap flash drives, then I don't think we can write anything that hasn't already been written. Agreed, we definitely run the risk of writing something that will be obsolete within a year or two, but that's the nature of this area.

tpoisot · 2015-02-26T14:33:46Z

It's not unusual for models to generate outputs in the GB order. I've had one reach the 1TB bar in a few days. This will pose all sorts of issues to store the data, both locally (I don't have a 1TB hard drive on my laptop) and remotely.

I think we should emphasize that even though most data are small (in size), it's going to be increasingly common to deal with large datasets, so we should prepare ourselves (and our students)

karawoo · 2015-02-26T18:20:56Z

I'd also point out that this paper is going to be submitted to PLOS Computational Biology, a field where data >>10 MB is very common (though whether computational biologists can be considered "ordinary" is a separate question 😉).

jhollist · 2015-02-26T18:47:50Z

One thing that we don't want to lose here is that a lot of our rules will likely apply equally to the <10MB and >>10MB data. For instance, lousy (or non-existent) metadata ( #22 and #11) is a problem for both small and large data sets or if you fail to backup your data ( #10) it is just as big a loss to you if you lose your entire small or entire large dataset. In short, I don't think size of a dataset should be used as a criteria for our rules or the scope of the discussion.

tpoisot · 2015-02-26T18:50:31Z

@jhollist is right -- why not keep this discussion for the conclusion of the paper?

dfalster · 2015-02-26T20:56:43Z

Thanks for sharing your thoughts! I should have added that I am of course
happy if the article focuses on the challenges of storing larger data sets,
that is indeed more novel, even if potential audience is smaller.

On 27 February 2015 at 05:50, Timothée Poisot notifications@github.com
wrote:

@jhollist https://github.com/jhollist is right -- why not keep this
discussion for the conclusion of the paper?

—
Reply to this email directly or view it on GitHub
#17 (comment)
.

emhart closed this as completed Oct 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduction/scope: storage vs publication #17

introduction/scope: storage vs publication #17

stephenturner commented Feb 20, 2015

PBarmby commented Feb 20, 2015

emhart commented Feb 24, 2015

dfalster commented Feb 26, 2015

stephenturner commented Feb 26, 2015

tpoisot commented Feb 26, 2015

karawoo commented Feb 26, 2015

jhollist commented Feb 26, 2015

tpoisot commented Feb 26, 2015

dfalster commented Feb 26, 2015

introduction/scope: storage vs publication #17

introduction/scope: storage vs publication #17

Comments

stephenturner commented Feb 20, 2015

PBarmby commented Feb 20, 2015

emhart commented Feb 24, 2015

dfalster commented Feb 26, 2015

stephenturner commented Feb 26, 2015

tpoisot commented Feb 26, 2015

karawoo commented Feb 26, 2015

jhollist commented Feb 26, 2015

tpoisot commented Feb 26, 2015

dfalster commented Feb 26, 2015