-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sharing data on 3rd party infrastructure #111
Conversation
Creating some space for a section that should showcase how to share a dataset across more than a shared file system. Easiest to start with might be dropbox or similar.
The key to this is to use (in almost all cases) the appropriate git-annex special remote implementation to first push and later obtain file from a 3rd-party service. How to do that exactly very much depends on the service. For example: box.com/own-cloud/scieboThey are accessible through a WEBDAV interface, which git-annex has built-in support for
3rd-party services often limit the file size -- git annex can "chunk" files arbitrarily and hence is capable to easily circumvent such constraints -- on download, files will be merged back together to provide a seamless experience. rclone (Google, Amazon, MS, Dropbox, ...)The majority of commercial providers is made accessible via general workflowExamples of how things can be set up can be found in the demos here: https://www.datalad.org/for/data-publication The key is to realize that the remote 3rd-party storage services do not get to see the actual repository content (real filenames, directories, etc) -- those only live in the dataset/repo which can be hosted at another place or service (github/lab, ...). What is stored on dropbox and friends is merely a tree of annex objects. Only through this design it becomes possible to chunk files into smaller units, optionally encrypt content on its way from a local machine to a storage service, and avoid leakage of information via filenames. Therefore these places are not something a real person would take a look at, instead they are only meant to to be managed and accessed via datalad/git-annex (and in the case of encryption that is pretty much required). This can be a downside, as it will not be possible to pass a dropbox link to such a project to a random stranger that is not using datalad. But.... Relateddatalad also has some support for "exporting" data to other services. For example the https://git-annex.branchable.com/git-annex-export/ |
I spend the day wrangling |
...one thought: This could become a use case (its very step-by-steppy...). |
if we do #326, this needs minor changes as well. |
- link existing git-annex walk-throughs and instructions for building a webserver - reorder footnotes - improve readability
add another snippet to summary
ping @mih |
…r. git annex docs for more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautifully written, and a much needed addition to the documentation!
I left a bunch of comments that should be easy to address. I will push some more minor changes shortly.
The code in these sections is objectively ugly, but I think you managed to find the simplest path -- given the present state of publish
and friends. I hope that, over the next 6 months, we can make substantial improvements to the workflows described here.
s) Set configuration password | ||
q) Quit config | ||
n/s/q> n | ||
name> dropbox-remote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to use the -remote
suffix? No other remote has it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More or less arbitrary, but it helped to make the link between "you need to configure a special remote" and running the rclone config
command and it distinguishes the name of the remote from the name of the storage type (dropbox
). Especially in the git annex initremote
command it makes it less ambiguous whether target=dropbox-remote
is the created name or an existing value for this option. I've tried it without the prefix and it works (with slightly more ambiguousness), but personally I'm not annoyed enough by the -remote
prefix to sacrifice the unambigousness... I briefly thought about renaming it to "my-dropbox", but that would not make sense to someone I could be sharing with... I think I will leave that as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I'd say it is still best to qualify whose dropbox this is, or what it is for. There could be any number of additional dropbox remotes configured in a real dataset. For example, one could use two dropboxes, one shared with a select few and one that is semi public (connecting to the audience aspect you bring up in the chapter). Including some kind of discrimination identified in the name would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, true, I'll change it to dropbox-for-friends
.
`OwnCloud <https://git-annex.branchable.com/tips/owncloudannex/>`__, and many more. | ||
Here is the complete list: `git-annex.branchable.com/special_remotes/ <https://git-annex.branchable.com/special_remotes/>`_. | ||
|
||
For Dropbox, the relevant special-remote to configures is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything that follows would benefit a lot from better support on the datalad end datalad/datalad#1784
I think we should try to stick to a uniform identity throughout the book. After I started to replace `adswa` and `adina`, I saw that other places are inconsistent too -- so I stopped again.
Tests passed, I will merge, and then release it into the open! |
I'm creating some space for a section that should showcase how to share a dataset across more than a shared file system. Easiest to start with might be dropbox or similar, but I will yet have to read into how to do that.
This fixes #186.
TODO