Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharing data on 3rd party infrastructure #111

Merged
merged 52 commits into from
Jan 11, 2020
Merged

Sharing data on 3rd party infrastructure #111

merged 52 commits into from
Jan 11, 2020

Conversation

adswa
Copy link
Contributor

@adswa adswa commented Aug 19, 2019

I'm creating some space for a section that should showcase how to share a dataset across more than a shared file system. Easiest to start with might be dropbox or similar, but I will yet have to read into how to do that.

This fixes #186.

TODO

  • integrate in narrative
  • add summary

Creating some space for a section that should showcase how to share a dataset across more than a shared file system. Easiest to start with might be dropbox or similar.
@adswa adswa mentioned this pull request Aug 19, 2019
@mih
Copy link
Collaborator

mih commented Aug 25, 2019

The key to this is to use (in almost all cases) the appropriate git-annex special remote implementation to first push and later obtain file from a 3rd-party service. How to do that exactly very much depends on the service. For example:

box.com/own-cloud/sciebo

They are accessible through a WEBDAV interface, which git-annex has built-in support for

git annex initremote box.com \
   type=webdav url=https://dav.box.com/dav/team/project_one \
   chunk=50mb encryption=none

3rd-party services often limit the file size -- git annex can "chunk" files arbitrarily and hence is capable to easily circumvent such constraints -- on download, files will be merged back together to provide a seamless experience.

rclone (Google, Amazon, MS, Dropbox, ...)

The majority of commercial providers is made accessible via rclone (https://git-annex.branchable.com/special_remotes/rclone/; https://github.com/DanielDent/git-annex-remote-rclone)

general workflow

Examples of how things can be set up can be found in the demos here: https://www.datalad.org/for/data-publication

The key is to realize that the remote 3rd-party storage services do not get to see the actual repository content (real filenames, directories, etc) -- those only live in the dataset/repo which can be hosted at another place or service (github/lab, ...). What is stored on dropbox and friends is merely a tree of annex objects. Only through this design it becomes possible to chunk files into smaller units, optionally encrypt content on its way from a local machine to a storage service, and avoid leakage of information via filenames. Therefore these places are not something a real person would take a look at, instead they are only meant to to be managed and accessed via datalad/git-annex (and in the case of encryption that is pretty much required). This can be a downside, as it will not be possible to pass a dropbox link to such a project to a random stranger that is not using datalad. But....

Related

datalad also has some support for "exporting" data to other services. For example the export-to-figshare. The main difference is that this moves data out of versionen und decentralized tracking, and essentially "throws it over the wall". Alternatively, git annex provides "export/input" functionality that can be used to read and write from/to a particular "human-facing" representation (which is not a git repo), for example the content of a particular version of a particular branch.

https://git-annex.branchable.com/git-annex-export/
https://git-annex.branchable.com/git-annex-import/

@adswa
Copy link
Contributor Author

adswa commented Nov 27, 2019

I spend the day wrangling rclone special remotes, but now I have a draft for how to publish a dataset and its annexed content with a publication dependency on Dropbox. At this point, I need to think of a way to work this into the narrative...

@adswa
Copy link
Contributor Author

adswa commented Nov 27, 2019

...one thought: This could become a use case (its very step-by-steppy...).

@adswa adswa changed the title [WIP] Placeholder for sharing data on 3rd party infrastructure [WIP] Sharing data on 3rd party infrastructure Dec 10, 2019
@adswa
Copy link
Contributor Author

adswa commented Dec 22, 2019

if we do #326, this needs minor changes as well.

- link existing git-annex walk-throughs and instructions for building a webserver
- reorder footnotes
- improve readability
@adswa adswa changed the title [WIP] Sharing data on 3rd party infrastructure Sharing data on 3rd party infrastructure Jan 8, 2020
@adswa
Copy link
Contributor Author

adswa commented Jan 8, 2020

ping @mih

Copy link
Collaborator

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautifully written, and a much needed addition to the documentation!

I left a bunch of comments that should be easy to address. I will push some more minor changes shortly.

The code in these sections is objectively ugly, but I think you managed to find the simplest path -- given the present state of publish and friends. I hope that, over the next 6 months, we can make substantial improvements to the workflows described here.

docs/basics/101-138-sharethirdparty.rst Outdated Show resolved Hide resolved
s) Set configuration password
q) Quit config
n/s/q> n
name> dropbox-remote
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to use the -remote suffix? No other remote has it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More or less arbitrary, but it helped to make the link between "you need to configure a special remote" and running the rclone config command and it distinguishes the name of the remote from the name of the storage type (dropbox). Especially in the git annex initremote command it makes it less ambiguous whether target=dropbox-remote is the created name or an existing value for this option. I've tried it without the prefix and it works (with slightly more ambiguousness), but personally I'm not annoyed enough by the -remote prefix to sacrifice the unambigousness... I briefly thought about renaming it to "my-dropbox", but that would not make sense to someone I could be sharing with... I think I will leave that as is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'd say it is still best to qualify whose dropbox this is, or what it is for. There could be any number of additional dropbox remotes configured in a real dataset. For example, one could use two dropboxes, one shared with a select few and one that is semi public (connecting to the audience aspect you bring up in the chapter). Including some kind of discrimination identified in the name would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, true, I'll change it to dropbox-for-friends.

`OwnCloud <https://git-annex.branchable.com/tips/owncloudannex/>`__, and many more.
Here is the complete list: `git-annex.branchable.com/special_remotes/ <https://git-annex.branchable.com/special_remotes/>`_.

For Dropbox, the relevant special-remote to configures is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything that follows would benefit a lot from better support on the datalad end datalad/datalad#1784

docs/basics/101-138-sharethirdparty.rst Show resolved Hide resolved
docs/basics/101-138-sharethirdparty.rst Outdated Show resolved Hide resolved
docs/basics/101-139-gin.rst Show resolved Hide resolved
docs/basics/101-139-gin.rst Outdated Show resolved Hide resolved
docs/basics/101-139-gin.rst Outdated Show resolved Hide resolved
docs/basics/101-139-gin.rst Outdated Show resolved Hide resolved
docs/basics/101-140-summary.rst Outdated Show resolved Hide resolved
@adswa
Copy link
Contributor Author

adswa commented Jan 11, 2020

Thanks much for the comments! I created #344 in response to your commit message in
576f785

@adswa
Copy link
Contributor Author

adswa commented Jan 11, 2020

Tests passed, I will merge, and then release it into the open!

@adswa adswa merged commit a5a58c6 into master Jan 11, 2020
@adswa adswa deleted the share branch January 11, 2020 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use case: dataset hosting on GIN
2 participants