Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduction/scope: storage vs publication #17

Closed
stephenturner opened this issue Feb 20, 2015 · 9 comments
Closed

introduction/scope: storage vs publication #17

stephenturner opened this issue Feb 20, 2015 · 9 comments

Comments

@stephenturner
Copy link
Collaborator

Both here and on the original SWC issue I sense that we're conflating issues of data storage versus data publication. If your entire career's data is the size of iris, the distinction between the two is less. You can keep local copies in multiple locations trivially, while having a public-facing GH repo, figshare, whatever. But very different situation when you've got on the order of scores-hundreds of GB to TB+ range data. Espeically when (god forbid), you're not interested in or allowed to publish your data.

I think it would be useful to set out both here and in the introduction what exactly we're hoping to accomplish with this set of rules.

@PBarmby
Copy link
Collaborator

PBarmby commented Feb 20, 2015

Seems to me that the "10 Simple Rules" paper linked by @karawoo in #3 covers a lot of the publication or near-publication issues. (Those folks are mostly astronomers for whom privacy/patents/etc are not important.) Suggests that storage should be the focus here?

@emhart
Copy link
Owner

emhart commented Feb 24, 2015

@stephenturner I agree with you completely. It's difficult to tease apart rules for storing data and rules for publishing data. I think it's important that we don't just cast issues like metadata and discoverability aside, but this is the space to consider issues like size, privacy, etc, and give them much greater prominence.

@dfalster
Copy link
Contributor

I had same thoughts as @stephenturner re scope, is the focus on either storing, publishing or both?

Either way, I think our audience should be ordinary scientists, for whom data set size is usually not an issue (at least in my field). As context, we recently published a data paper (in press at Ecology) with data on plant biomass from over 170 different studies. By disciplinary standards this is a big dataset, but all up the final csv files are < 10Mb. We'll have copies on github, Ecological Archives, and possibly elsewhere (see #10). Those who have really big data (in terms of size) have more challenges, but are far fewer and possibly better equipped to deal with those challenges. Plus the technology of really big data is moving very quickly, so an article on possible solutions may date faster than it can be published.

@stephenturner
Copy link
Collaborator Author

@dfalster I'd be careful judging which scientists are ordinary based on whether they have issues with data size :).

In all seriousness though, I have to disagree with you - I don't think this article should focus on data that's 10MB. Sure, big data is relative, and big often lies in the data complexity, not the size in terms of bytes. But if data is small enough that it can effortlessly mirrored, backed up, and versioned using rsync or similar across all of Dropbox, GitHub, Google Docs, Figshare, Zenodo, S3, and cheap flash drives, then I don't think we can write anything that hasn't already been written. Agreed, we definitely run the risk of writing something that will be obsolete within a year or two, but that's the nature of this area.

@tpoisot
Copy link
Collaborator

tpoisot commented Feb 26, 2015

It's not unusual for models to generate outputs in the GB order. I've had one reach the 1TB bar in a few days. This will pose all sorts of issues to store the data, both locally (I don't have a 1TB hard drive on my laptop) and remotely.

I think we should emphasize that even though most data are small (in size), it's going to be increasingly common to deal with large datasets, so we should prepare ourselves (and our students)

@karawoo
Copy link
Collaborator

karawoo commented Feb 26, 2015

I'd also point out that this paper is going to be submitted to PLOS Computational Biology, a field where data >>10 MB is very common (though whether computational biologists can be considered "ordinary" is a separate question 😉).

@jhollist
Copy link
Collaborator

One thing that we don't want to lose here is that a lot of our rules will likely apply equally to the <10MB and >>10MB data. For instance, lousy (or non-existent) metadata ( #22 and #11) is a problem for both small and large data sets or if you fail to backup your data ( #10) it is just as big a loss to you if you lose your entire small or entire large dataset. In short, I don't think size of a dataset should be used as a criteria for our rules or the scope of the discussion.

@tpoisot
Copy link
Collaborator

tpoisot commented Feb 26, 2015

@jhollist is right -- why not keep this discussion for the conclusion of the paper?

@dfalster
Copy link
Contributor

Thanks for sharing your thoughts! I should have added that I am of course
happy if the article focuses on the challenges of storing larger data sets,
that is indeed more novel, even if potential audience is smaller.

On 27 February 2015 at 05:50, Timothée Poisot notifications@github.com
wrote:

@jhollist https://github.com/jhollist is right -- why not keep this
discussion for the conclusion of the paper?


Reply to this email directly or view it on GitHub
#17 (comment)
.

@emhart emhart closed this as completed Oct 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants