diff --git a/kbi/0007/index.rst b/kbi/0007/index.rst index 91d1db7..b3dccc6 100644 --- a/kbi/0007/index.rst +++ b/kbi/0007/index.rst @@ -1,6 +1,8 @@ .. index:: single: datalad; addurls +.. _kbi0007: + KBI0007: Create a DataLad dataset from a published collection of files ====================================================================== diff --git a/kbi/0028/index.rst b/kbi/0028/index.rst new file mode 100644 index 0000000..b2a0cf4 --- /dev/null +++ b/kbi/0028/index.rst @@ -0,0 +1,221 @@ +.. index:: + single: datalad; addurls + single: special remote; uncurl + +KBI0028: Create a DataLad dataset from Nextcloud (Sciebo) public share links +============================================================================ + +:authors: MichaƂ Szczepanik +:discussion: https://github.com/psychoinformatics-de/knowledge-base/pull/104 +:keywords: nextcloud, sciebo, webdav, sharing, addurls +:software-versions: datalad_0.19.2, datalad-next_1.0.0b3, webdav4_0.9.8, fsspec_2023.6.0, sciebo_10.12.2 + +A DataLad dataset can be created directly from an existing collection +of files in a cloud storage, using share URLs to provide file +access. `Nextcloud`_ storage platform (and, by extension, `Sciebo`_, a +Nextcloud-based regional university service) allows generation of +folder share URLs with optional password protection and expiration +time. Creating such share links, as well as granting access to +specific Nextcloud users, is an option for sharing data with managed +permissions. In such use case, DataLad is an optional method of +accessing and indexing data. + +This document deals specifically with files that were deposited in +Nextcloud without using DataLad. For publishing DataLad datasets to +Nextcloud, see the documentation of DataLad-next's +`create-sibling-webdav`_ command instead. + +This document extends the ``addurls``-based approach described in +:ref:`KBI0007` in two areas: it introduces the `uncurl`_ special +remote for transforming URLs and using credentials, and focuses on +Nextcloud-specific URL patterns. + +.. _nextcloud: https://nextcloud.com/ +.. _sciebo: https://hochschulcloud.nrw/ +.. _create-sibling-webdav: https://docs.datalad.org/projects/next/en/latest/generated/man/datalad-create-sibling-webdav.html +.. _uncurl: https://docs.datalad.org/projects/next/en/latest/generated/datalad_next.annexremotes.uncurl.html#module-datalad_next.annexremotes.uncurl + +Nextcloud URL patterns +---------------------- + +There are three primary ways in which a Nextcloud folder can be +shared. These will determine the URL patterns which can be used. + +Public share link, no password +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In a special (and simplest) case, if the sharing link for a folder is +created without password protection, links to individual files can be +created by appending ``/download?path=&files=`` (where +``path`` is a relative path to a directory, and ``name`` is the file +name). However, if the sharing link is password protected, such URL +would not work, as it would redirect to a login page (html document) +and not to the file content. + +In a general case (share links with or without password, as well as +sharing with named users), `Nextcloud's webdav access`_ can be +used. The remainder of the document only covers WebDAV URLs. + +.. _nextcloud's webdav access: https://docs.nextcloud.com/server/20/user_manual/en/files/access_webdav.html + +Named user share +^^^^^^^^^^^^^^^^ + +If a folder is shared with a named user, they will see it in their own +account like any other folder. In principle, access for a share +recipient would be analogous to that of an owner, and use an URL +starting with: + +.. code-block:: none + + https://example.com/nextcloud/remote.php/dav/files/USERNAME/ + +However, with Nextcloud (Sciebo) being a federated service, each user +may have a different instance URL to access their data. Additionally, +the URL includes the username, and each user may place the shared +directory in a different place within their home directory. + +Public share, password protected +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For a folder shared with a password-protected link, the access URLs +would start with: + +.. code-block:: none + + https://example.com/nextcloud/public.php/webdav + +The share token (part of the share link) needs to be provided as +username, and the (optional) share password as password. Note that +these are sent as credentials in the http(s) request header, and are not included in the +URL. + +URL pattern - summary +^^^^^^^^^^^^^^^^^^^^^ + +In summary, it is useful to represent the WebDAV URL as a combination +of the following components: + +.. code-block:: none + + /// + +where: + +* ```` is the instance URL + (``https://example.com/nextcloud/`` in given examples) +* ```` is either ``remote.php/dav/files/USERNAME/`` or + ``public.php/webdav`` +* ```` is the path to the shared folder in user's home + directory (none for public shares) +* ```` is the path to a particular file relative to the + shared folder (````) + +Listing files +------------- + +For generating the dataset using the ``addurls`` command, a list of file names (relative paths) and +their respective URLs is needed. These can be generated automatically, +e.g. with the `webdav4`_ and `fsspec`_ Python libraries. + +An example script is given below, using inline comments for explanations. + +The example assumes that user's webdav credentials are already known +to DataLad under the name ``webdav-mycred`` (if not, these can be +added with ``datalad credentials add``, or provided to the script in a +different way, e.g. as environment variables). + +.. _webdav4: https://pypi.org/project/webdav4/ +.. _fsspec: https://pypi.org/project/fsspec/ + +.. literalinclude:: list_files.py + :language: python + +This would produce the following csv file: + +.. code-block:: none + + name,href + file1.dat,/remote.php/dav/files/USERNAME/sharing/example/file1.dat + foo/file2.dat,/remote.php/dav/files/USERNAME/sharing/example/foo/file2.dat + ... + +Creating the dataset +-------------------- + +In a DataLad dataset, the process of accessing files that were added +via download URLs is handled by a `git-annex special remote`_. The +uncurl remote, available in the `DataLad-next`_ extension, provides +both the ability to reconfigure URLs and the access to DataLad-next's +credential workflow. It can be initialized as follows (optionally with +``autoenable=true``) inside a DataLad dataset that has been created: + +.. _git-annex special remote: https://git-annex.branchable.com/special_remotes/ +.. _DataLad-next: https://github.com/datalad/datalad-next + +.. code-block:: none + + git annex initremote uncurl type=external externaltype=uncurl encryption=none + +With a known URL pattern (see above), a match expression for the uncurl special remote can be defined upfront. Defining a match expression allows us to isolate identifiers (such as ``dirpath``, ``filepath``, etc) in the URL pattern, which becomes particularly useful when URLs need to be transformed in future. + +The regular expression below is relatively generic, with only the +``dirpath`` being given explicitly, and specific to the given +example. Note that if ``dirpath`` included spaces, they would have to +be `url-encoded`_; otherwise, the uncurl remote would split the +expression into two. Websites like `regex101`_ can be helpful in +building and understanding the expression: + +.. code-block:: none + + git annex enableremote uncurl match="(?Phttps://[^/]+)/(?Premote\.php/dav/files/[^/]+|public\.php/webdav)/(?Psharing/example)/(?P.*)" + +Finally, files are added to the dataset with ``datalad addurls`` using the previously generated csv file: + +.. code-block:: none + + datalad addurls listing.csv https://example.com/nextcloud{href} {name} + +.. _regex101: https://regex101.com +.. _url-encoded: https://www.w3schools.com/tags/ref_urlencode.asp + +Transforming URLs +----------------- + +Assuming the same user moves the folder in their Nextcloud account to +``some/other/place/``, access to the files in the same DataLad dataset +can be retained by setting the URL template of the uncurl remote. The +URL template has access to the same identifiers isolated previously +with the match expression, and in the case of this example can use +these defined parts with only ``dirpath`` having to change: + +.. code-block:: none + + git annex enableremote uncurl url='{instance}/{accesspath}/some/other/place/{filepath} + +A different user with whom the dataset is shared would have to +additionally replace ``accesspath``, and (possibly) ``instance``. + +A user with whom the access was shared via a link would need to change +``accesspath``, and would not be using ``dirpath``: + +.. code-block:: none + + git annex enableremote uncurl url='{instance}/public.php/webdav/{filepath} + +Credential caveats +------------------ + +Regardless of whether the files are accessed via the +``remote.php/dav/files/USERNAME/`` or ``public.php/webdav`` path, the +authentication realm for the given Nextcloud instance is the +same. This means users who already have DataLad credentials saved for +the given realm would see their requests for password-protected +links refused. As long as ``get`` does not support explicit +credentials, this can be circumvented by unsetting the credential +realm. + +If a share link is not password protected, the webdav access via +``public.php/webdav`` can still be used. However, this requires +creating a DataLad credential with the token as username, and a +nonempty password (e.g. a single space or ``xyz``) that would not be used. diff --git a/kbi/0028/list_files.py b/kbi/0028/list_files.py new file mode 100644 index 0000000..5beb36c --- /dev/null +++ b/kbi/0028/list_files.py @@ -0,0 +1,41 @@ +import csv +from pathlib import PurePosixPath + +from datalad.api import credentials +from webdav4.fsspec import WebdavFileSystem + +# Retrieve Nextcloud credentials from DataLad +cred = credentials( + "get", + name="webdav-mycred", + return_type="item-or-list", +) + +# Create a fsspec filesystem object, with user's Nextcloud home as root +fs = WebdavFileSystem( + "https://example.com/nextcloud/remote.php/dav/files/USERNAME/", + auth=(cred["cred_user"], cred["cred_secret"]), +) + +# Shared directory, contents of which should be listed +DIRNAME = "sharing/example" + +# List files in the shared directory, writing outputs to a csv file for addurls +with open("listing.csv", "wt") as urlfile: + writer = csv.writer(urlfile, delimiter=",") + writer.writerow(["name", "href"]) + + for dirpath, dirinfo, fileinfo in fs.walk(DIRNAME, detail=True): + # fileinfo is a dict, with file names as keys, + # and dicts with actual file info as values; + # we need path ({"name": "..."}) + # and URL compnent ({"href": "remote.php/dav/..."}) + for f in fileinfo.values(): + name = f["name"] + href = f["href"] + + # reported path is relative to root of fs object, + # what we need is relative to the directory that we walk + relpath = PurePosixPath(name).relative_to(DIRNAME) + + writer.writerow([relpath, href])