diff --git a/docs/basics/101-102-populate.rst b/docs/basics/101-102-populate.rst index ed114911d..1659ae24a 100644 --- a/docs/basics/101-102-populate.rst +++ b/docs/basics/101-102-populate.rst @@ -175,6 +175,8 @@ the directory ``books/``, and thanks to that commit message we have a nice human-readable summary of that action. .. findoutmore:: DOs and DON'Ts for commit messages + :name: fom-commitmessage + :float: **DOs** diff --git a/docs/basics/101-110-run2.rst b/docs/basics/101-110-run2.rst index 6ea2517d3..1f5768192 100644 --- a/docs/basics/101-110-run2.rst +++ b/docs/basics/101-110-run2.rst @@ -463,7 +463,7 @@ the ``--input`` and ``--output`` specification done before. can be accessed with an integer index, e.g., ``{inputs[0]}`` for the very first input. -.. findoutmore:: ... wait, what if I need a { or } character in my datalad run call? +.. findoutmore:: ... wait, what if I need a curly bracket in my datalad run call? If your command call involves a ``{`` or ``}`` character, you will need to escape this brace character by doubling it, i.e., ``{{`` or ``}}``. diff --git a/docs/basics/101-115-symlinks.rst b/docs/basics/101-115-symlinks.rst index 4566e2227..8c01edc8f 100644 --- a/docs/basics/101-115-symlinks.rst +++ b/docs/basics/101-115-symlinks.rst @@ -86,6 +86,8 @@ This process is often referred to as a file being *annexed*, and the object tree is also known as the *annex* of a dataset. .. windowsworkarounds:: What happens on Windows? + :name: woa_objecttree + :float: Windows has insufficient support for :term:`symlink`\s and revoking write :term:`permissions` on files. Therefore, :term:`git-annex` classifies it as a :term:`crippled filesystem` and has to stray from its default behavior. @@ -93,7 +95,7 @@ tree is also known as the *annex* of a dataset. **Why is that?** Data *needs* to be in the annex for version control and transport logistics -- the annex is able to store all previous versions of the data, and manage the transport to other storage locations if you want to publish your dataset. - But as the Findoutmore at the end of this section will show, the :term:`annex` is a non-human readable tree structure, and data thus also needs to exist in its original location. + But as the :ref:`Findoutmore in this section ` will show, the :term:`annex` is a non-human readable tree structure, and data thus also needs to exist in its original location. Thus, it exists in both places: its moved into the annex, and copied back into its original location. Once you edit an annexed file, the most recent version of the file is available in its original location, and past versions are stored and readily available in the annex. If you reset your dataset to a previous state (as is shown in the section :ref:`history`), the respective version of your data is taken from the annex and copied to replace the newer version, and vice versa. @@ -202,6 +204,7 @@ to manage the file system in a DataLad dataset (:ref:`filesystem`). .. findoutmore:: more about paths, checksums, object trees, and data integrity + :name: fom-objecttree But why does the target path to the object tree needs to be so cryptic? Does someone want to create diff --git a/docs/basics/101-123-config2.rst b/docs/basics/101-123-config2.rst index e46589ba3..a400061b9 100644 --- a/docs/basics/101-123-config2.rst +++ b/docs/basics/101-123-config2.rst @@ -455,6 +455,7 @@ with a dot, and finally converting all letters to lower case. The ``datalad.log. configuration option thus is the environment variable ``DATALAD_LOG_LEVEL``. .. findoutmore:: Some more general information on environment variables + :name: fom-envvar Names of environment variables are often all-uppercase. While the ``$`` is not part of the name of the environment variable, it is necessary to *refer* to the environment diff --git a/docs/basics/101-124-procedures.rst b/docs/basics/101-124-procedures.rst index cf77c835e..921334d80 100644 --- a/docs/basics/101-124-procedures.rst +++ b/docs/basics/101-124-procedures.rst @@ -75,7 +75,7 @@ only modify ``.gitattributes``, but can also populate a dataset with particular content, or automate routine tasks such as synchronizing dataset content with certain siblings. What makes them a particularly versatile and flexible tool is -that anyone can write their own procedures. If a workflow is +that anyone can write their own procedures (find a tutorial :ref:`here `. If a workflow is a standard in a team and needs to be applied often, turning it into a script can save time and effort. By pointing DataLad to the location the procedures reside in they can be applied, and by @@ -184,6 +184,8 @@ was applied. .. findoutmore:: Write your own procedures + :name: fom-procedures + :float: Procedures can come with DataLad or its extensions, but anyone can write their own ones in addition, and deploy them on individual machines, diff --git a/docs/basics/101-130-yodaproject.rst b/docs/basics/101-130-yodaproject.rst index ffe6b26ca..d0ca063b3 100644 --- a/docs/basics/101-130-yodaproject.rst +++ b/docs/basics/101-130-yodaproject.rst @@ -12,10 +12,11 @@ In principle, you can prepare YODA-compliant data analyses in any programming language of your choice. But because you are already familiar with the `Python `__ programming language, you decide to script your analysis in Python. Delighted, you find out that there is even -a Python API for DataLad's functionality that you can read about in the hidden -section below: +a Python API for DataLad's functionality that you can read about in :ref:`a Findoutmore `. .. findoutmore:: DataLad's Python API + :name: fom-pythonapi + :float: .. _python: @@ -159,9 +160,11 @@ For the purpose of this analysis, the DataLad handbook provides an ``iris_data`` dataset at `https://github.com/datalad-handbook/iris_data `_. You can either use this provided input dataset, or find out how to create an -independent dataset from scratch in the hidden section below. +independent dataset from scratch in a :ref:`dedicated Findoutmore `. .. findoutmore:: Creating an independent input dataset + :name: fom-iris + :float: If you acquire your own data for a data analysis, it will not magically exist as a DataLad dataset that you can simply install from somewhere -- you'll have @@ -186,16 +189,9 @@ independent dataset from scratch in the hidden section below. $ datalad create iris_data and subsequently got the data from a publicly available - `GitHub Gist `_ with a + `GitHub Gist `_, a code snippet or other short standalone information (more on Gists `here `__), with a :command:`datalad download-url` command: - .. findoutmore:: What are GitHub Gists? - - GitHub Gists are a particular service offered by GitHub that allow users - to share pieces of code snippets and other short/small standalone - information. Find out more on Gists - `here `__. - .. runrecord:: _examples/DL-101-130-102 :workdir: dl-101 :language: console @@ -765,9 +761,11 @@ an additional :command:`git push` [#f6]_ with the ``--tags`` option is required: Yay! Consider your midterm project submitted! Others can now install your dataset and check out your data science project -- and even better: they can -reproduce your data science project easily from scratch! +reproduce your data science project easily from scratch (take a look into the :ref:`Findoutmore ` to see how)! .. findoutmore:: On the looks and feels of this published dataset + :name: fom-midtermclone + :float: Now that you have created and published such a YODA-compliant dataset, you are understandably excited how this dataset must look and feel for others. diff --git a/docs/basics/101-132-advancednesting.rst b/docs/basics/101-132-advancednesting.rst index 8e0d1b18c..b20f79c9e 100644 --- a/docs/basics/101-132-advancednesting.rst +++ b/docs/basics/101-132-advancednesting.rst @@ -52,9 +52,11 @@ this *in the superdataset* in order to have a clean superdataset status. This point in time in DataLad-101 is a convenient moment to dive a bit deeper into the functions of the :command:`datalad status` command. If you are -interested in this, checkout the hidden section below. +interested in this, checkout the :ref:`dedicated Findoutmore `. .. findoutmore:: More on datalad status + :name: fom-status + :float: First of all, let's start with a quick overview of the different content *types* and content *states* various :command:`datalad status` commands in the course diff --git a/docs/basics/101-135-help.rst b/docs/basics/101-135-help.rst index 76a781cdb..d5bede58b 100644 --- a/docs/basics/101-135-help.rst +++ b/docs/basics/101-135-help.rst @@ -200,7 +200,7 @@ To get extensive information on what :command:`datalad status` does underneath t $ datalad status $ ... - You can find out a bit more on environment variable :ref:`in this footnote `. + You can find out a bit more on environment variable :ref:`in this Findoutmore `. The configuration variable can be used to set the log level on a user (global) or system-wide level with the :command:`git config` command:: diff --git a/docs/basics/101-141-push.rst b/docs/basics/101-141-push.rst index 05c7b6882..1c82b4c68 100644 --- a/docs/basics/101-141-push.rst +++ b/docs/basics/101-141-push.rst @@ -243,13 +243,7 @@ for more information). In practice, this default will most likely lead to the same outcome as when specifying ``none``: only your datasets history, but no annexed contents will be published. - - .. gitusernote:: Impact of auto-mode on git-annex - - On a technical level, the ``auto`` option leads to adding ``auto`` to the underlying - ``git annex copy`` command, which in turn publishes annexed contents based on the - `git-annex preferred content configuration `_ - of the sibling. + On a technical level, the ``auto`` option leads to adding ``auto`` to the underlying ``git annex copy`` command, which in turn publishes annexed contents based on the `git-annex preferred content configuration `_ of the sibling. In order to publish all annexed contents, one needs to specify ``--transfer-data all``. Alternatively, adding paths to the ``publish`` call will publish the specified diff --git a/docs/beyond_basics/101-147-riastores.rst b/docs/beyond_basics/101-147-riastores.rst index d20e95c6b..3931c1fe4 100644 --- a/docs/beyond_basics/101-147-riastores.rst +++ b/docs/beyond_basics/101-147-riastores.rst @@ -402,6 +402,7 @@ a single :command:`datalad push` to the RIA sibling suffices: .. runrecord:: _examples/DL-101-147-108 :language: console :workdir: dl-101/DataLad-101 + :lines: 1-25, 38- $ tree /home/me/myriastore @@ -426,7 +427,7 @@ As a demonstration, we'll do it for the ``midterm_project`` subdataset: With creating a RIA sibling to the RIA store and publishing the contents of the ``midterm_project`` subdataset to the store, a second dataset has been added to the datastore. Note how it is represented on the same hierarchy - level as the previous dataset, underneath its dataset ID: + level as the previous dataset, underneath its dataset ID (note that the output is cut off for readability): .. runrecord:: _examples/DL-101-147-111 @@ -438,6 +439,7 @@ As a demonstration, we'll do it for the ``midterm_project`` subdataset: .. runrecord:: _examples/DL-101-147-112 :language: console :workdir: dl-101/DataLad-101 + :lines: 1-25, 38-58 $ tree /home/me/myriastore diff --git a/docs/beyond_basics/101-171-enki.rst b/docs/beyond_basics/101-171-enki.rst index b806c798e..abdcb4d76 100644 --- a/docs/beyond_basics/101-171-enki.rst +++ b/docs/beyond_basics/101-171-enki.rst @@ -182,7 +182,7 @@ This initial sketch serves to highlight key differences and adjustments due to t # job handler should clean up workspace Just like the general script from the last section, this script can be submitted to any job scheduler -- here with a subject ID as a ``$subid`` command line variable and a job ID as environment variable as identifiers for the fMRIprep run and branch names. -At this point, the workflow misses a tweak that is necessary in fMRIprep to enable re-running computations. +At this point, the workflow misses a tweak that is necessary in fMRIprep to enable re-running computations (the complete file is in :ref:`this Findoutmore `. .. findoutmore:: Fine-tuning: Enable re-running @@ -195,95 +195,100 @@ At this point, the workflow misses a tweak that is necessary in fMRIprep to enab (cd freesurfer && rm -rf fsaverage "$subid") With this in place, the only things missing are a :term:`shebang` at the top of the script, and some shell settings for robust scripting with verbose log files (``set -e -u -x``). - You can find the full script with rich comments in the next findoutmore. + You can find the full script with rich comments in :ref:`this Findoutmore `. .. findoutmore:: See the complete bash script - - This script is placed in ``code/fmriprep_participant_job``: + :name: fom-enki + :float: p + + This script is placed in ``code/fmriprep_participant_job``. + For technical reasons (rending or the handbook), we break it into several blocks of code:: + + #!/bin/bash + + # fail whenever something is fishy, use -x to get verbose logfiles + set -e -u -x + + # we pass in "sourcedata/sub-...", extract subject id from it + subid=$(basename $1) + + # this is all running under /tmp inside a compute job, /tmp is a performant + # local filesystem + cd /tmp + # get the output dataset, which includes the inputs as well + # flock makes sure that this does not interfere with another job + # finishing at the same time, and pushing its results back + # importantly, we clone from the location that we want to push the + # results too + flock --verbose $DSLOCKFILE \ + datalad clone /data/project/enki/super ds + + # all following actions are performed in the context of the superdataset + cd ds + # obtain all first-level subdatasets: + # dataset with fmriprep singularity container and pre-configured + # pipeline call; also get the output dataset to prep them for output + # consumption, we need to tune them for this particular job, sourcedata + # important: because we will push additions to the result datasets back + # at the end of the job, the installation of these result datasets + # must happen from the location we want to push back too + datalad get -n -r -R1 . + # let git-annex know that we do not want to remember any of these clones + # (we could have used an --ephemeral clone, but that might deposite data + # of failed jobs at the origin location, if the job runs on a shared + # filesystem -- let's stay self-contained) + git submodule foreach --recursive git annex dead here .. code-block:: bash - #!/bin/bash - - # fail whenever something is fishy, use -x to get verbose logfiles - set -e -u -x - - # we pass in "sourcedata/sub-...", extract subject id from it - subid=$(basename $1) - - # this is all running under /tmp inside a compute job, /tmp is a performant - # local filesystem - cd /tmp - # get the output dataset, which includes the inputs as well - # flock makes sure that this does not interfere with another job - # finishing at the same time, and pushing its results back - # importantly, we clone from the location that we want to push the - # results too - flock --verbose $DSLOCKFILE \ - datalad clone /data/project/enki/super ds - - # all following actions are performed in the context of the superdataset - cd ds - # obtain all first-level subdatasets: - # dataset with fmriprep singularity container and pre-configured - # pipeline call; also get the output dataset to prep them for output - # consumption, we need to tune them for this particular job, sourcedata - # important: because we will push additions to the result datasets back - # at the end of the job, the installation of these result datasets - # must happen from the location we want to push back too - datalad get -n -r -R1 . - # let git-annex know that we do not want to remember any of these clones - # (we could have used an --ephemeral clone, but that might deposite data - # of failed jobs at the origin location, if the job runs on a shared - # filesystem -- let's stay self-contained) - git submodule foreach --recursive git annex dead here - - # checkout new branches in both subdatasets - # this enables us to store the results of this job, and push them back - # without interference from other jobs - git -C fmriprep checkout -b "job-$JOBID" - git -C freesurfer checkout -b "job-$JOBID" - # create workdir for fmriprep inside to simplify singularity call - # PWD will be available in the container - mkdir -p .git/tmp/wdir - # pybids (inside fmriprep) gets angry when it sees dangling symlinks - # of .json files -- wipe them out, spare only those that belong to - # the participant we want to process in this job - find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"'*' -delete - - # next one is important to get job-reruns correct. We remove all anticipated - # output, such that fmriprep isn't confused by the presence of stale - # symlinks. Otherwise we would need to obtain and unlock file content. - # But that takes some time, for no reason other than being discarded - # at the end - (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) - (cd freesurfer && rm -rf fsaverage "$subid") + # checkout new branches in both subdatasets + # this enables us to store the results of this job, and push them back + # without interference from other jobs + git -C fmriprep checkout -b "job-$JOBID" + git -C freesurfer checkout -b "job-$JOBID" + # create workdir for fmriprep inside to simplify singularity call + # PWD will be available in the container + mkdir -p .git/tmp/wdir + # pybids (inside fmriprep) gets angry when it sees dangling symlinks + # of .json files -- wipe them out, spare only those that belong to + # the participant we want to process in this job + find sourcedata -mindepth 2 -name '*.json' -a ! -wholename "$1"'*' -delete + + # next one is important to get job-reruns correct. We remove all anticipated + # output, such that fmriprep isn't confused by the presence of stale + # symlinks. Otherwise we would need to obtain and unlock file content. + # But that takes some time, for no reason other than being discarded + # at the end + (cd fmriprep && rm -rf logs "$subid" "$subid.html" dataset_description.json desc-*.tsv) + (cd freesurfer && rm -rf fsaverage "$subid") + + .. code-block:: bash - # the meat of the matter, add actual parameterization after --participant-label - datalad containers-run \ - -m "fMRIprep $subid" \ - --explicit \ - -o freesurfer -o fmriprep \ - -i "$1" \ - -n code/pipelines/fmriprep \ - sourcedata . participant \ - --n_cpus 1 \ - --skip-bids-validation \ - -w .git/tmp/wdir \ - --participant-label "$subid" \ - --random-seed 12345 \ - --skull-strip-fixed-seed \ - --md-only-boilerplate \ - --output-spaces MNI152NLin6Asym \ - --use-aroma \ - --cifti-output - # selectively push outputs only - # ignore root dataset, despite recorded changes, needs coordinated - # merge at receiving end - flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin - flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin - - # job handler should clean up workspace + # the meat of the matter, add actual parameterization after --participant-label + datalad containers-run \ + -m "fMRIprep $subid" \ + --explicit \ + -o freesurfer -o fmriprep \ + -i "$1" \ + -n code/pipelines/fmriprep \ + sourcedata . participant \ + --n_cpus 1 \ + --skip-bids-validation \ + -w .git/tmp/wdir \ + --participant-label "$subid" \ + --random-seed 12345 \ + --skull-strip-fixed-seed \ + --md-only-boilerplate \ + --output-spaces MNI152NLin6Asym \ + --use-aroma \ + --cifti-output + # selectively push outputs only + # ignore root dataset, despite recorded changes, needs coordinated + # merge at receiving end + flock --verbose $DSLOCKFILE datalad push -d fmriprep --to origin + flock --verbose $DSLOCKFILE datalad push -d freesurfer --to origin + + # job handler should clean up workspace Pending modifications to paths provided in clone locations, the above script and dataset setup is generic enough to be run on different systems and with different job schedulers. @@ -295,9 +300,11 @@ Job submission Job submission now only boils down to invoking the script for each participant with a participant identifier that determines on which subject the job runs, and setting two environment variables -- one the job ID that determines the branch name that is created, and one that points to a lockfile created beforehand once in ``.git``. Job scheduler such as HTCondor have syntax that can identify subject IDs from consistently named directories, for example, and the submit file can thus be lean even though it queues up more than 1000 jobs. -You can find the submit file used in this analyses in the findoutmore below. +You can find the submit file used in this analyses in :ref:`this Findoutmore `. .. findoutmore:: HTCondor submit file + :name: fom-condor + :float: The following submit file was created and saved in ``code/fmriprep_all_participants.submit``: diff --git a/docs/glossary.rst b/docs/glossary.rst index bd1719d24..aca4b6ca3 100644 --- a/docs/glossary.rst +++ b/docs/glossary.rst @@ -135,7 +135,7 @@ Glossary environment variable A variable made up of a name/value pair. Programs using a given environment variable will use its associated value for their execution. - You can find out a bit more on environment variable :ref:`in this footnote `. + You can find out a bit more on environment variable :ref:`in this Findoutmore `. ephemeral clone dataset clones that share the annex with the dataset they were cloned from, without :term:`git-annex` being aware of it. diff --git a/docs/intro/narrative.rst b/docs/intro/narrative.rst index 0040104d0..9982874c6 100644 --- a/docs/intro/narrative.rst +++ b/docs/intro/narrative.rst @@ -53,6 +53,7 @@ in the example below (it shows the creation of a DataLad dataset): When copying code snippets into your own terminal, do not copy the leading ``$`` -- this only indicates that the line is a command, and would lead to an error when executed. +Don't worry :ref:`if you do not want to code along `, though. The book is split into different parts. The upcoming chapters are the *Basics* that intend to show you the core DataLad functionality @@ -90,6 +91,7 @@ You can decide for yourself whether you want to check them out: .. only:: latex .. findoutmore:: For curious minds + :name: fom-intro Sections like this contain content that goes beyond the basics necessary to complete a challenge. @@ -197,6 +199,8 @@ share and publish with DataLad. :width: 70% .. findoutmore:: I can not/do not want to code along... + :name: fom-lazy + :float: If you do not want to follow along and only read, there is a showroom dataset of the complete DataLad-101 project at diff --git a/docs/usecases/HCP_dataset.rst b/docs/usecases/HCP_dataset.rst index 79cf8dd63..0e282943f 100644 --- a/docs/usecases/HCP_dataset.rst +++ b/docs/usecases/HCP_dataset.rst @@ -434,7 +434,7 @@ retrieve data right away. keyring.set_password("datalad-hcp-s3", "secret_id", ) Alternatively, one can set their credentials using environment variables. - For more details on this method, :ref:`see this footnote `. + For more details on this method, :ref:`see this Findoutmore `. .. code-block:: bash diff --git a/docs/usecases/reproducible_neuroimaging_analysis.rst b/docs/usecases/reproducible_neuroimaging_analysis.rst index c4b82a962..28fc5026f 100644 --- a/docs/usecases/reproducible_neuroimaging_analysis.rst +++ b/docs/usecases/reproducible_neuroimaging_analysis.rst @@ -698,6 +698,7 @@ is wrapped into a ``datalad containers-run`` command with appropriate .. runrecord:: _examples/repro2-120 :language: console :workdir: usecases/repro2/glm_analysis + :lines: 1-12, 356- $ datalad containers-run --container-name fsl -m "sub-02 1st-level GLM" \ --input sub-02/1stlvl_design.fsf \ @@ -777,6 +778,7 @@ time steps. .. runrecord:: _examples/repro2-124 :language: console :workdir: usecases/repro2/glm_analysis + :lines: 1-17, 362- $ datalad rerun --branch verify --onto ready4analysis --since ready4analysis