-
-
Notifications
You must be signed in to change notification settings - Fork 134
Where should scientists store their data? #797
Comments
The two main non-NCBI/EBI places in bio are figshare and dryad. I'm not sure either is appropriate here. A bigger worry for me is whether it's even legal to make this public! In the US, I think not -- shotgun data from humans can't truly be "anonymized" AFAIK, unless it's actually a mixture of samples, but I'm not an expert there. This might be of interest: https://d0.awsstatic.com/whitepapers/architecting-for-genomic-data-security-and-compliance-in-aws-executive-overview.pdf although I haven't read it myself yet. If it's just a matter of storing ~large amounts of data, figshare has indicated interest in doing so; happy to introduce. I personally prefer AWS S3, which gives you public links and access control; this is how I make data public. Within the lab, we had good luck storing things as EBS snapshots on EC2, which gave everybody read-only access to the raw data. I really liked this as a way to share during the analysis period. iPlant has a data storing system but I'm not sure how open it is to biomedical data. I'm interested in seeing what other people have to say! |
Zenodo (www.zenodo.org) is a great option - it's a publicly funded project with a commitment to preservation. Free and takes all kinds of research outputs. Plus you get a DataCite DOI and there's github integration! |
There is not a one-size-fits-all solution. Things we try to consider (not a complete list, but...):
|
What exactly are we talking about here, depositing data in a public place to make it accessible to others, or storing data locally for analysis, etc? The first seems like a solved problem to me, upload to GEO/SRA/dbGaP/whatever public repository is appropriate, with metadata etc on something like figshare. I also struggle with the second - how to store my data locally in the most cost-effective way, thinking about backups, and then sharing data with collaborators, and choosing what data to share publicly, if any. I use my university's paid storage, but it isn't ideal because it's difficult to share/access outside the institution. AWS S3 is expensive. I don't want to manage my own hardware. I'd love to hear what others are doing. |
I can explain the solution that I have encountered here at the University of Oslo, although it might not always be transferable to other institutions. We have a HPC unit at the main IT unit, and they administer both compute and storage resources. You can both apply for (get access for free) or buy access to their resources. This ensures closeness between data and compute facilities, and they also do everything regarding backups and other sysadmin things. There is also a specially designed solution for sensitive data, which other institutions can adopt if they are interested. TL;DR: check out if your institution has access to HPC resources, if so, talk to them. |
@stephenturner If you are generating you own data, specially in a medical environment, I'm not sure you can do anything else than manage your own hardware. You need fast access to data to analyze it, thus hardrives near a cluster. In both of my institutes (Institut Curie and Mines ParisTech), we manage our own hardware. The data is too sensitive in many cases to store elsewhere. The Curie Institute (a research facility focused on cancer research with a very large bioinformatics department) has a team dedicated to storage couple of Pb), while setting this up at Mines ParisTech (where no one else but our team does any data analysis on more than a couple of Gb) was more complicated, but we ended up buying a reasonable amount of Tb drives to plug in our cluster. In all cases, sharing is done through GEO/dbgap (except for patient data) after publication if the data allows it, or through sending hard drives by mail, if the data are too large. |
My data load is generally small enough that it can fit on a couple of hard drives, suitably backed up both locally and remotely. In astronomy we have the luxury of not having to worry about privacy! For distributing data, one option for people who are astronomers, and/or in Canada, is VOSpace, run by CANFAR. I use it for stuff that I want to distribute to other people either publically or within a group; saves me from having to run a server or share hundred-MB files via Dropbox. I'm not exactly sure what their usage policies are but will check and report back. A blog post may be in the works. I have more to say about this; look for an upcoming blog post. |
My question to @stephenturner is: do you suspect that AWS is padding the costs significantly? If so, how much? And if not, why should we be expecting free stuff? :) |
I completely agree with @seandavi. Coming from an institute (TGAC, UK) which, by it's very nature, needs to run and have access to a mixed model of data availability and persistence, we have to be flexible with options. We are a large scale data generator so have a large 5PB+ centralised storage system which we make use of internally and externally in a collaborative fashion, but we also see the need to deposit in public repositories, not only for the benefits of long-term dissemination but also for policy compliance (taxpayers money rightly equals open data ASAP). As for large datasets, I would rather see better integration with existing repositories that are facilitated by key players, in a similar vein to the NCBI/EBI/DDBJ mirroring agreement but with better support for discovery and interoperability between them. Depositing data is one thing, but finding and using it after the event is strangely difficult. iPlant/Galaxy etc have made headway in this space, but again the mixed model needs to be taken into account - how long, how fast, privacy? |
Illumina (+ PacBio) shop at The University of Queensland, Australia. WGS data for about 1400 bacterial pathogens. For our important^ data (raw data, PBS scripts, publication data) - 1 copy on hdd/thumb drive it arrived on, 1 copy on HPC (https://ncisf.org/barrinehpc), 1 copy mirrored locally on our NAS. 1 copy on University tape mirrored across two sites. 1 copy on RDSI (https://www.rdsi.edu.au). Around 600 strains under embargo at the ENA (had an automatic/programatic submission tool developed). That is about 7 copies! What a waste of space. Looking at better tools for efficient data managmenet. Think git annex MAY fit the bill - https://git-annex.branchable.com/ Current worry: we need about 20X more space to store PacBio compared to Illumina. ^important will vary BUT we are confident in the reproducibility (DR testing) of our analysis pipelines and thus prefer replication over backing up everything. |
Most research intensive universities in the UK now have some form of research data facility with resources and staff to help with this sort of thing (because EPSRC, one of our biggest science funders, says "thou shalt not have money without an institutional data plan": http://www.epsrc.ac.uk/about/standards/researchdata/). For me this has boiled down to the need to write a short document each time I've written a grant application outlining what data I'm going to collect (so that, e.g. legal constraints can be identified), writing some funds for data storage into the grant, and then having access to some file space that gets backed up without much thought on my part. This does not replace the need to think about how to make the data public, but does make the basic storage easer and means that the data somewhere where we can process it without too much hassle. A concrete example I was only tangentially involved with is seismic data collected as part of various multi-year deployments in Africa. This data is collected from the instruments during field work campaigns every six-months or so and arrives from the field on portable hard drives. This data is then copied into a "raw data" directory in the institutional repository. Before any work can be done on the data it must be converted into a commonly used data format (removing complexities such as the way the instrument responds to ground acceleration and timing glitches) and have metadata added. Data in this common format is also retained in the repository alongside data we have downloaded from public data servers. This is the starting point for much analysis and modelling. Finally, the processed data in the common format is uploaded to a public data repository at http://www.iris.edu/ - this is necessary as a condition of funding, an expectation for publication, and helps boost the impact of the research. The big advantage with this approach is that scientists don't have to worry about dealing with procurement or management of the storage hardware, and it's easy to grow the storage space as needed. The costs have always appeared reasonable (more expensive than just buying disks, but cheaper than commercial rates) and the data is available when needed (as long as you need in on a machine inside the University). Getting copies to people or machines offsite tends to be more of a hassle and multiple copies at other institutions and on national HPC facilities seems to be inevitable, but that's probably OK as there are usually institutional storage resources there. |
I definitely recommend reviewing sample storage systems that facilitate unique machine readable identification of samples, e.g. by using storage containers with QR codes printed on them. Also, resist the temptation of making up some identifier scheme that encodes various descriptive aspects (researcher, sample type, etc.). Descriptions wil change (which is good), but identifiers need to be stable, so keep the concerns of identification and description separate and use identifiers that are "nondescript". (Apologies if I'm writing the obvious here, but I've dealt with legacy systems where primary key formats have changed over time, causing nightmarish ambiguities, and if this paragraph prevents such designs in the future I consider writing this paragraph to be good use of bandwidth.) I entirely agree with the suggestions by others above, talk to the local IT / bioinformatics units (using any support they can offer almost certainly saves you from wasting time by re-inventing solutions) and develop a data management plan. An important part of this strategy is safeguarding the original primary data -- all analyses should be carried out on copies of these. (Again apologies for stating something possibly obvious but I've seen USB drives supposedly containing master copies of data that had been changed / edited, accidentally, inadvertently or even deliberately.) |
Climate impact community here - from very small to hundreds of TB climate sim output. Commercial 3rd party storage not accepted here. Company might go away, might limit access, might not backup as promised. Also what you believe is anonymous might well not be -> big trouble ahead. First choice is storage space provided by local institution (dedicated for that purpose - not FTP) (or associated HPC center). With backups you don't have to care about, in a professionally managed data-center. That is also worth a bit of your budget. In-lab for small volumes say < 8TB have a little server, 2 x 4TB disks, mirrored, Linux or BSD + ZFS, possibly UPS, firewalled. E.g. this one. For non-replicateable data that would only be acceptable with additional off-site backup. E.g. encrypt it and push it to Amazon Glacier. |
At an institute level, we don't have 'proper' centralised scientific computing, so how and where we store data depends on the project, and who is managing/analysing it. We do not (but should, IMO) have a common data storage policy, central money to underpin storage for multiple groups, nor a common pipeline for managing or integrating NGS/other suitable data. We don't have the bandwidth (or budget) to put large datasets on S3/Glacier or similar, so that's not an option we consider often. As a result, the best-managed bioinformatics (as opposed to the GIS and other projects) data goes onto a local, large central NAS (IIRC ≈110TB, GlusterFS) with a sensible - but manually managed - filesystem naming convention. What happens in subdirectories within that is project/scientist-dependent. We back up/archive the NAS locally (≈40TB 'live' read-only, and to tape). Some of this data is made available for downstream analysis via Galaxy but, to my knowledge, most is not. An unknown amount of data is acquired by individual PIs/groups, and may not make it onto the NAS or any other common storage - or, at least, there is no account of what does and does not make it there, and currently no requirement for PIs to inform IT/bioinformaticians about what has been obtained. I don't think that this is good - but it's the situation at the moment that we have to work with. On a personal level - I keep reference datasets and analytical results on the central NAS, working with copies of these on the local cluster/desktop/laptop as appropriate. Especially recently, I try to make as much as possible Makefile-friendly for reproducibility, with Makefiles/documentation and results stored on the NAS. I also keep backup copies of critical data/results on a number of rotated external hard drives, and a second, smaller 5TB NAS that is separate from the central storage/archiving. |
Several people have mentioned looking at your university's HPC center (don't forget to look at regional and national centers), so I'll just say take a look at your library too. Many are setting up institutional repositories for these kinds of data (may not be published, may have security/privacy issues, may want to have embargo the data until a certain date so you have time to publish, etc.). It's possible that they aren't quite ready for multiple TB's of data for every publication from every prof on campus, but it's certainly worth asking. If they can ingest the data -- unfortunately, that's a big if -- they should have a full "archive-quality" back-end ready to manage it (they'll run checksums, make copies at remote data centers, periodically verify those copies, migrate data as needed, etc.) -- you wouldn't have to be involved with any of that. (I just transitioned from HPC to a research computing effort in our university's library system, so I'm diving head-first into this area) |
Though a few people have said "check your library", I'd rephrase to say "ask your librarian". They can help identify an appropriate repository, institutional or otherwise. In my experience librarians can be incredibly helpful (also in locating existing data), but many researchers do not realize or regularly turn to this exceptional resource. |
Many people have mentioned AWS S3 for storing data. There are a number of AWS-compatible solutions for storing data which work with a private cloud, if data privacy requirements are such: |
A relevant tangential question is how should scientists store data?
From Ten Simple Rules for Reproducible Computational Research:
In the future, it would be great if we could discover datasets from a URL (e.g. with HTML containing extra RDFa tags) retrieve the data and its metadata, infer that our columns match (because they labelled with URIs), determine when and how the data was collected, and automatically perform unbiased statistical analysis; RDF, JSONLD, and CSVW make this possible. |
Storing physical units and quantities for each factor could also be helpful.
|
@ctb I think you raise a great point in your question to @stephenturner; I suppose one could pose the same question to many of the institutional data facilities mentioned in this thread: are they truly out-competing Amazon in value-for-dollar? If so, does that mean Amazon is padding it's prices or serving a different market or something else? If not, what does that mean for the institutional facilities? Or maybe some such facilities are already using S3 as a backend? For instance, Drobox uses S3 as it's backend data store, but charges less than $10 / Month for 1 TB with no data transfer fee, while S3 costs > $30 / TB / month. (Of course most users aren't using their full TB, but presumably an institutional facility could follow similar logic). Nonetheless @stephenturner 's comments resonate with me too: It's easy to feel like ~ $30 per TB per month+data transfer charges sounds pricey compared to the cost of disk-drives today. After you add redundancy, backup power, housing, cooling, hardware replacement etc the difference seems smaller, but it's easy to feel like you're overpaying either way. I'd be curious to hear from the perspective of those working at institutional level facilities if they feel that the calculus is obvious as @ctb hints at (which I read as S3 prices approximate the minimum cost) or whether there's a lot to be gained by pooling hardware for data storage at the institutional level. (p.s. clearly just focusing on the cost issues of bit wise storage here, I think others have quite rightly addressed the more relevant and overlooked issues of discoverability, metadata, reuse, and so forth which are largely separate elements). |
Re talking to your librarian, @dlebauer, I took your advice and showed this thread to MacKenzie Smith, the head librarian at UC Davis. She pointed me at http://www.re3data.org/, "a global repository that lists more than 1000 qualified research data repositories". So that validates your point about talking to librarians, I think :) And @cboettig, I don't actually know for sure which way the calculus goes. A few years back it was firmly on the side of institutions and not Amazon, as long as the institutional system was at high occupancy and you ignored the impact of indirect costs on $$ spent on Amazon (which is quite expensive in the end, I think). However, competition may have eroded any difference significantly. The real point to make is the one you made much more thoroughly than me: scientists, academics, and academic institutions are typically horrible at uptime and backups and reliability over any period of time longer than 3-5 years (and I'm being generous with 3-5 yrs). We simply don't value that stuff until it goes badly. On the flip side, Amazon and Google and Rackspace etc have people who specialize in stability, and they pay them well, and I think they've gotten quite good at it (<- understatement of the decade). So I'm tremendously skeptical of the academic calculations of costs - apart from funny money calculations and more advanced money laundering techniques that skew the numbers towards institutions by ignoring true costs, you have to toss in a > 1% chance of "you rolled the dice wrong and you lost EVERYTHING." |
This is an interesting talk: https://www.youtube.com/watch?v=ImrYe0CBL-E describing the economy of a home-built openstack swift (think local S3). Note that one needs to account for lots of details, including personnel, when thinking about storage and any IT alternative to pay-for-service. |
https://github.com/ckan/ckan (Python) https://github.com/ckan/ckan/blob/master/Dockerfile |
Useful, thanks! It would seem reasonable for journals to provide data repository services in exchange for their fees and free content. |
Since this thread has drifted at times toward data publication as well as https://www.globus.org/data-publication On Mon, Feb 9, 2015 at 5:17 PM, Wes Turner notifications@github.com wrote:
|
@ctb I'd clarify that this would be true for many academic servers, any institutional resource considered an 'archival repository' (e.g. the California Digital Library should be sufficiently reliable, with perhaps a better prognosis for decadal longevity, if not 100% uptime, which is less essential. I think many of the , DataOne member nodes will fall in this category. |
Just keep in mind, if you want to analyze that data we are talking about, your compute-server needs to be well connected to the storage location (e.g. via RAID controller:). Any cloudy location can only be backup (unless, maybe, you use EC2 and go totally Amazon). |
@hvwaldow you've taken the words about co-location of data and computing right out of my ... keyboard[*], I thought of this when I wrote about storing original copies safely. I think that while holding master copies on slow storage is not a proper protection mechanism, it can in some cases deter people from trying to process directly on original data where they really should work with copies. What do others think -- is this a bad idea, in the same league of "security through obscurity", or is it ok to use speed as an additional argument to encourage good practices of keeping master copies and copies for processing separate? [*] if we assume that that's where words are left we thought of and then didn't type ;-) |
@jttkim Well, I'd favour "master copies" on a read-only filesystem ;). |
Great comments so far, lots to think about. Specifically to the question of how to store data that you don't want to make public but needs to be available for collaboration within a lab (or between labs). Like @gvwilson said I have Tbs not Pbs ... yet.
|
It seems to me that concerns about 'read-only' and concerns about privacy are both best addressed with encryption, as several folks have already suggested with regards to cloud storage. Of course there is some performance hit for this, but in my experience this would be practical at least for data on the order of TBs. I'm far from an expert on security and perhaps others on this thread can weigh in, but my understanding is that whether or not you are using cloud servers, strong encryption tools (and securely managed keys) are essential to data security. This way you are not relying on login-credentials alone and the storage provider cannot un-encrypt the data without the encryption keys. I think you get read-only as a bonus here, in that collaborators would have to download and un-encrypt the data before they could make changes, and encryption routines usually have good mechanisms in place to verify that the data have not changed. Without encryption I think there would be valid security questions in storing data on dropbox, S3, or others. Even with it their may be regulatory concerns: it's my understanding that the federal government recognizes Google as an approved cloud source for certain data (e.g. NOAA) but no other cloud provider. I don't know if the regulation would permit data storage on S3 even if it were in encrypted form. I think you'd have to ask your institution regarding your institution-level questions, but if you're willing to manage encryption I think S3 or similar alternatives should give you & your collaborators complete control without having to manage dedicated hardware. Re 2: You can use dropbox in place of S3 as a storage location for encrypted data, though I think that approach has scaling limitations. Dropbox also tends to promote the opposite philosophy of read-only; in that it syncs automatically and thus most users expect to work on the dropbox files directly, rather than first copy from dropbox and then compute. Re 3: I agree with the Github/bitbucket approach for scripts and pipelines, and I also recognize it's definitely a barrier to contributing (though not nearly as high a barrier as it was pre-Github with a sourceforge-style workflow!). I've found a heterogeneous workflow where only some collaborators handle the git side of things can actually work okay, but that's a topic for another thread. There's also the possible question of security here too: can these scripts be public? If not, are "private" repos really a sufficient level of security? (e.g. will sensitive data find its way into these scripts?) |
The following draft guide on where to keep your data landed in my mail box yesterday. It's written by the UK's Digital Curation Centre. They are asking for feedback. Maybe some of you can contribute, but I think it can also be useful for the data storage discussion. |
Ecologists can add their data to EcoData Retriever, as one option. Also, the Panton Principles prescribe an ethos about sharing and curating open data. They link to a protocol for implementing open data as well as a great reference table for open data licenses. |
Since in the comments above there was quite some mention of publicly available repositories for scientific data, this might be interesting: |
@ctb and others have raised the question about costs of data storage on the cloud vs the true cost of institutional repositories storing & serving data. @mbjones and colleagues have looked into this carefully wrt to DataONE repository network, and have found at present that costs of the cloud are roughly ~3-4 times that of managing the hardware directly; largely due to the network charges associated with moving data in & out of the cloud provider. @mbjones can perhaps clarify if I didn't really get this right. |
@emhart and others are planning to turn this into a paper - please move discussion to https://github.com/emhart/10-simple-rules-data-storage. |
"References" emhart/10-simple-rules-data-storage#3 @emhart |
To address @jstearns specific questions:
As a platform, I believe Amazon and the big cloud providers are more secure
Institution dependent, but I would expect "yes", especially if it was generated
I question the question :). What I like about cloud is that they simply don't
-1 on Dropbox for security; do some googling for some hair-raising stories.
yes, and yes :) I do wonder what you're doing that you're worried about people swiping your Also, like car accidents and murder, most data breaches and stealing occurr |
I’d like to add my 2c. I work at HPC centre (http://www.cyfronet.krakow.pl/en) in Poland dedicated to help polish scientists in their research. We and also other similar HPC centres in Poland provide computing and storage services for science. For example our centre hosts git and svn servers for user repositories and of course some database solutions.
I’d go to academic HPC centre that is located nearby, they’re usually quite big, have security team, and quite good security of services. In many ways they could be more willing to help you with setting your environment and adjust some services to your needs.
In many situations you could get SLA (service level agreement) or MoA in which you could specify for how long data needs to be stored and accessible for users.
Probably some DB solution could be good. It depends on size of your datasets.
.-) Best regards, Klemens |
A post-doc who is setting up a new bioinformatics lab asked, "Where should I store the data?" Right now, her group has samples in a freezer and sequence data from them archived on a couple of portable hard drives (with copies of some of that data on lab members' laptops). The data is from human subjects, but has been anonymized, and they're expecting to get more (but "more" means terabytes, not petabytes, at least in the near future). Options being discussed include everything from a paid Dropbox account to FTP space on the university's secure server.
Please use this issue to tell us what to do and why. (If you don't have permission to comment, please send Greg Wilson your GitHub username, and he'll fix that.) Once we've collected some advice, we'll turn it into a blog post that we can point to from our lessons.
And please note that comments from people who aren't bioinformaticians are equally welcome. What do people do in ecology? In economics? In astronomy? How well does it work? When doesn't it work, and why?
The text was updated successfully, but these errors were encountered: