Skip to content

Commit

Permalink
Merge pull request #790 from datalad-handbook/mnt-dvc
Browse files Browse the repository at this point in the history
Maintenance - Speed up sections with imagenette dataset
  • Loading branch information
adswa authored Nov 30, 2021
2 parents dbca2b0 + a59b8cd commit 68d4475
Show file tree
Hide file tree
Showing 41 changed files with 136 additions and 145 deletions.
2 changes: 1 addition & 1 deletion docs/basics/101-116-sharelocal.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ dataset. Here is how it looks like:


$ cd mock_user
$ datalad clone ../DataLad-101 --description "DataLad-101 in mock_user"
$ datalad clone --description "DataLad-101 in mock_user" ../DataLad-101

This will install your dataset ``DataLad-101`` into your room mate's home
directory. Note that we have given this new
Expand Down
42 changes: 17 additions & 25 deletions docs/beyond_basics/101-168-dvc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ If you have never used DVC, `its technical docs <https://dvc.org/doc/command-ref
Be mindful: DVC (as DataLad) comes with a range of commands and concepts that have the same names, but differ in functionality to their Git namesake.
Make sure to read the `DVC documentation <https://dvc.org/doc/command-reference>`_ for each command to get more information on what it does.

.. importantnote:: Running this tutorial requires DataLad version 0.13.4 or higher

Running this tutorial requires DataLad version 0.13.4 or higher

Setup
^^^^^

Expand Down Expand Up @@ -172,24 +176,23 @@ As they are only *staged* but not *committed*, we need to commit them (into Git)

The DVC project is now ready to version control data.
In the tutorial, data comes from the "Imagenette" dataset.
First, the data needs to be `downloaded from an Amazon S3 bucket <https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz>`_ as a compressed tarball and extracted into the ``data/raw/`` directory of the repository.
This data is available `from an Amazon S3 bucket <https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz>`_ as a compressed tarball, but to keep the download fast, there is a smaller two-category version of it on the :term:`Open Science Framework (OSF)`.
We'll download it and extract it into the ``data/raw/`` directory of the repository.

.. runrecord:: _examples/DL-101-168-109
:workdir: DVCvsDL/DVC
:language: console

### DVC
# download the data
$ curl -s https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz \
-O imagenette2-160.tgz
$ wget -q https://osf.io/d6qbz/download -O imagenette2-160.tgz
# extract it
$ tar -xzf imagenette2-160.tgz
# move it into the directories
$ cp -r imagenette2-160/train data/raw/
$ cp -r imagenette2-160/val data/raw/
# remove the archive and extracted folder
$ rm -rf imagenette2-160
$ rm imagenette2-160.tgz
$ mv train data/raw/
$ mv val data/raw/
# remove the archive
$ rm -rf imagenette2-160.tgz


The data directories in ``data/raw`` are then version controlled with the :command:`dvc add` command that can place files or complete directories under version control by DVC.
Expand Down Expand Up @@ -266,11 +269,10 @@ Here, we stick to the project organization of DVC though.

### DVC-DataLad
$ cd ../DVC-DataLad
# Requires >= 0.13.4!
$ datalad download-url \
--archive \
--message "Download Imagenette dataset" \
'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz' \
https://osf.io/d6qbz/download \
-O 'data/raw/'

At this point, the data is already version controlled [#f6]_, but the directory structure doesn't resemble that of the DVC dataset yet -- the extracted directory adds one unnecessary directory layer::
Expand All @@ -281,11 +283,10 @@ At this point, the data is already version controlled [#f6]_, but the directory
│   └── [...]
├── data
│   └── raw
│    └── imagenette-160
│   ├── train
│    │   ├──[...]
│    └── val
│   ├── [...]
│   ├── train
│    │   ├──[...]
│    └── val
│   ├── [...]
├── metrics
└── model

Expand All @@ -294,15 +295,6 @@ At this point, the data is already version controlled [#f6]_, but the directory
To make the scripts work, we move the raw data up one level.
This move needs to be saved.

.. runrecord:: _examples/DL-101-168-115
:workdir: DVCvsDL/DVC-DataLad
:language: console
:realcommand: mv data/raw/imagenette2-160/* data/raw/ && rmdir data/raw/imagenette2-160 && datalad save -m "Move data into preferred locations" | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)'

### DVC-DataLad
$ mv data/raw/imagenette2-160/* data/raw/ && rmdir data/raw/imagenette2-160
$ datalad save -m "Move data into preferred locations"

.. find-out-more:: How does DataLad represent modifications to data?

As DataLad always tracks files individually, :command:`datalad status` (or, alternatively, :command:`git status` or :command:`git annex status`) will show modifications on the level of individual files::
Expand Down Expand Up @@ -524,7 +516,7 @@ Currently, the dataset can thus be shared via :term:`GitHub` or similar hosting
:language: console
:realcommand: datalad get data/raw/val | grep -v '^\(copy\|get\|drop\|add\|delete\)(ok):.*(file)'
### DVC-DataLad
### DVC-DataLad2
$ datalad get data/raw/val
The data was retrieved by re-downloading the original archive from S3 and extracting the required files.
Expand Down
1 change: 0 additions & 1 deletion docs/beyond_basics/_examples/DL-101-168-103
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ $ datalad create -c text2git -c yoda DVC-DataLad
$ cd DVC-DataLad
$ mkdir -p data/{raw,prepared} model metrics
[INFO] Creating a new annex repo at /home/me/DVCvsDL/DVC-DataLad
[INFO] Scanning for unlocked files (this may take some time)
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
Expand Down
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-108
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
### DVC
$ git commit -m "initialize dvc"
[master ec6de77] initialize dvc
[master 38729a0] initialize dvc
9 files changed, 515 insertions(+)
create mode 100644 .dvc/.gitignore
create mode 100644 .dvc/config
Expand Down
12 changes: 5 additions & 7 deletions docs/beyond_basics/_examples/DL-101-168-109
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
### DVC
# download the data
$ curl -s https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz \
-O imagenette2-160.tgz
$ wget -q https://osf.io/d6qbz/download -O imagenette2-160.tgz
# extract it
$ tar -xzf imagenette2-160.tgz
# move it into the directories
$ cp -r imagenette2-160/train data/raw/
$ cp -r imagenette2-160/val data/raw/
# remove the archive and extracted folder
$ rm -rf imagenette2-160
$ rm imagenette2-160.tgz
$ mv train data/raw/
$ mv val data/raw/
# remove the archive
$ rm -rf imagenette2-160.tgz
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-110
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ To track the changes with git, run:

To track the changes with git, run:

git add data/raw/val.dvc data/raw/.gitignore
git add data/raw/.gitignore data/raw/val.dvc
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-113
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
### DVC
$ git commit -m "control data with DVC"
[master 597708c] control data with DVC
[master baac3ef] control data with DVC
3 files changed, 12 insertions(+)
create mode 100644 data/raw/train.dvc
create mode 100644 data/raw/val.dvc
12 changes: 7 additions & 5 deletions docs/beyond_basics/_examples/DL-101-168-114
Original file line number Diff line number Diff line change
@@ -1,19 +1,21 @@
### DVC-DataLad
$ cd ../DVC-DataLad
# Requires >= 0.13.4!
$ datalad download-url \
--archive \
--message "Download Imagenette dataset" \
'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz' \
https://osf.io/d6qbz/download \
-O 'data/raw/'
[INFO] Downloading 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz' into '/home/me/DVCvsDL/DVC-DataLad/data/raw/'
[INFO] Downloading 'https://osf.io/d6qbz/download' into '/home/me/DVCvsDL/DVC-DataLad/data/raw/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz (file)
add(ok): data/raw/imagenette2-160.tgz (file)
save(ok): . (dataset)
[INFO] Adding content of the archive /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz into annex AnnexRepo(/home/me/DVCvsDL/DVC-DataLad)
[INFO] Initiating special remote datalad-archives
[INFO] Finished adding /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz: Files processed: 13394, renamed: 13394, +annex: 13394
[INFO] Initiating special remote datalad-archives
[INFO] Finished adding /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz: Files processed: 2701, renamed: 2701, +annex: 2701
[INFO] Finished extraction
add-archive-content(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
action summary:
add (ok: 1)
add-archive-content (ok: 1)
download_url (ok: 1)
save (ok: 1)
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-124
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
### DVC
$ git add .dvc/config
$ git commit -m "add local remote"
[master de96b97] add local remote
[master 8f24d9f] add local remote
1 file changed, 4 insertions(+)
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-125
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
### DVC
$ dvc push
13396 files pushed
2703 files pushed
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-130
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
### DVC
$ cd DVC-2
$ dvc fetch data/raw/val.dvc
3925 files fetched
789 files fetched
3 changes: 2 additions & 1 deletion docs/beyond_basics/_examples/DL-101-168-131
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ $ cd DVC-DataLad-2
[INFO] Start receiving objects
[INFO] Start resolving deltas
[INFO] Completed clone attempts for Dataset(/home/me/DVCvsDL/DVC-DataLad-2)
[INFO] Scanning for unlocked files (this may take some time)
[INFO] scanning for annexed files (this may take some time)
[INFO] Remote origin not usable by git-annex; setting annex-ignore
[INFO] https://github.com/datalad-handbook/DVC-DataLad.git/config download failed: Not Found
install(ok): /home/me/DVCvsDL/DVC-DataLad-2 (dataset)
8 changes: 5 additions & 3 deletions docs/beyond_basics/_examples/DL-101-168-132
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
### DVC-DataLad
### DVC-DataLad2
$ datalad get data/raw/val
[INFO] To obtain some keys we need to fetch an archive of size 98.9 MB
[INFO] To obtain some keys we need to fetch an archive of size 15.1 MB
[INFO] datalad-archives special remote is using an extraction cache under /home/me/DVCvsDL/DVC-DataLad-2/.git/datalad/tmp/archives/8f2938add6. Remove it with DataLad's 'clean' command to save disk space.
get(ok): data/raw/val (directory)
action summary:
get (ok: 3926)
get (ok: 790)
6 changes: 5 additions & 1 deletion docs/beyond_basics/_examples/DL-101-168-141
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,15 @@ $ datalad push --to mysibling
[INFO] Determine push target
[INFO] Push refspecs
[INFO] Transfer data
[INFO] Update availability information
[INFO] Start enumerating objects
[INFO] Start counting objects
[INFO] Start compressing objects
[INFO] Start writing objects
[INFO] Start resolving deltas
[INFO] Finished push of Dataset(/home/me/DVCvsDL/DVC-DataLad)
publish(ok): . (dataset) [refs/heads/git-annex->mysibling:refs/heads/git-annex a80a9a74c..75118e5a7]
publish(ok): . (dataset) [refs/heads/git-annex->mysibling:refs/heads/git-annex 363fc913..a52bcf0f]
publish(ok): . (dataset) [refs/heads/master->mysibling:refs/heads/master [new branch]]
action summary:
copy (ok: 2701)
publish (ok: 2)
3 changes: 2 additions & 1 deletion docs/beyond_basics/_examples/DL-101-168-142
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
### DVC-DataLad
$ datalad drop data/raw/val
drop(ok): data/raw/val (directory)
action summary:
drop (ok: 3925)
drop (ok: 790)
3 changes: 3 additions & 0 deletions docs/beyond_basics/_examples/DL-101-168-143
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
### DVC-DataLad
$ datalad get data/raw/val
get(ok): data/raw/val (directory)
action summary:
get (ok: 790)
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-151
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ Updating lock file 'dvc.lock'

To track the changes with git, run:

git add dvc.yaml data/prepared/.gitignore dvc.lock
git add data/prepared/.gitignore dvc.yaml dvc.lock
6 changes: 3 additions & 3 deletions docs/beyond_basics/_examples/DL-101-168-154
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ prepare:
cmd: python src/prepare.py
deps:
- path: data/raw
md5: bc61c2dd1230bb5f1bb2cbcf9e21fe87.dir
size: 106551216
nfiles: 13397
md5: d39907b06425b95b440a692eb1af5ba4.dir
size: 16711927
nfiles: 2704
- path: src/prepare.py
md5: ef804f358e00edcfe52c865b471f8f55
size: 1231
Expand Down
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-155
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ $ dvc run -n train \
python src/train.py
Running stage 'train' with command:
python src/train.py
/home/adina/env/handbook2/lib/python3.9/site-packages/sklearn/linear_model/_stochastic_gradient.py:574: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
/home/adina/env/handbook2/lib/python3.9/site-packages/sklearn/linear_model/_stochastic_gradient.py:570: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn("Maximum number of iteration reached before "
Adding stage 'train' in 'dvc.yaml'
Updating lock file 'dvc.lock'
Expand Down
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-157
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ Updating lock file 'dvc.lock'

To track the changes with git, run:

git add dvc.lock dvc.yaml
git add dvc.yaml dvc.lock
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-159
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
### DVC
$ dvc metrics show
metrics/accuracy.json:
accuracy: 0.6920152091254753
accuracy: 0.8022813688212928
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-160
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ $ git push --set-upstream origin sgd-pipeline
$ git tag -a sgd-pipeline -m "Trained SGD as DVC pipeline."
$ git push origin --tags
$ dvc push
[sgd-pipeline c1db0bd] Add SGD pipeline
[sgd-pipeline 0268557] Add SGD pipeline
5 files changed, 71 insertions(+)
create mode 100644 dvc.lock
create mode 100644 dvc.yaml
Expand Down
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-165
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ $ git push --set-upstream origin random-forest
$ git tag -a random-forest -m "Random Forest classifier with 80.99% accuracy."
$ git push origin --tags
$ dvc push
[random_forrest 01c7106] Train Random Forrest classifier
[random_forrest 6890032] Train Random Forrest classifier
3 files changed, 11 insertions(+), 17 deletions(-)
Everything up-to-date
fatal: tag 'random-forest' already exists
Expand Down
2 changes: 1 addition & 1 deletion docs/beyond_basics/_examples/DL-101-168-166
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
$ dvc metrics show -T
workspace:
metrics/accuracy.json:
accuracy: 0.8073510773130546
accuracy: 0.8048162230671736
random-forest:
metrics/accuracy.json:
accuracy: 0.8187579214195184
Expand Down
6 changes: 3 additions & 3 deletions docs/beyond_basics/_examples/DL-101-168-178
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
$ datalad rerun --branch="randomforrest" -m "Recompute classification with random forrest classifier" ready-for-analysis..SGD
[INFO] checkout commit 7854e1a;
[INFO] run commit b997d05; (Train an SGD clas...)
[INFO] checkout commit 2f50499;
[INFO] run commit 152599e; (Train an SGD clas...)
[INFO] Making sure inputs are available (this may take some time)
unlock(ok): model/model.joblib (file)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
add(ok): model/model.joblib (file)
save(ok): . (dataset)
[INFO] run commit d991bf9; (Evaluate SGD clas...)
[INFO] run commit fbc0ddf; (Evaluate SGD clas...)
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
Expand Down
6 changes: 3 additions & 3 deletions docs/beyond_basics/_examples/DL-101-168-179
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
$ git diff SGD -- metrics/accuracy.json
diff --git a/metrics/accuracy.json b/metrics/accuracy.json
index ea992fb10..a137ca8f2 100644
index f847bc7c..38044953 100644
--- a/metrics/accuracy.json
+++ b/metrics/accuracy.json
@@ -1 +1 @@
-{"accuracy": 0.7782002534854245}
-{"accuracy": 0.752851711026616}
\ No newline at end of file
+{"accuracy": 0.8174904942965779}
+{"accuracy": 0.8136882129277566}
\ No newline at end of file
3 changes: 3 additions & 0 deletions docs/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,9 @@ Glossary
git-annex concept: The place where :term:`git-annex` stores available file contents. Files that are annexed get
a :term:`symlink` added to :term:`Git` that points to the file content. A different word for :term:`annex`.

Open Science Framework (OSF)
An open source software project that facilitates open collaboration in science research.

pager
A `terminal paper <https://en.wikipedia.org/wiki/Terminal_pager>`_ is a program to view file contents in the :term:`terminal`. Popular examples are the programs ``less`` and ``more``. Some terminal output can be opened automatically in a pager, for example the output of a :command:`git log` command. You can use the arrow keys to navigate and scroll in the pager, and the letter ``q`` to exit it.

Expand Down
1 change: 0 additions & 1 deletion docs/usecases/_examples/ml-101
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
$ datalad create imagenette
[INFO] Creating a new annex repo at /home/me/usecases/imagenette
[INFO] scanning for unlocked files (this may take some time)
create(ok): /home/me/usecases/imagenette (dataset)
9 changes: 6 additions & 3 deletions docs/usecases/_examples/ml-102
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,17 @@ $ cd imagenette
$ datalad download-url \
--archive \
--message "Download Imagenette dataset" \
'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz'
[INFO] Downloading 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz' into '/home/me/usecases/imagenette/'
'https://osf.io/d6qbz/download'
[INFO] Downloading 'https://osf.io/d6qbz/download' into '/home/me/usecases/imagenette/'
[INFO] Adding content of the archive /home/me/usecases/imagenette/imagenette2-160.tgz into annex AnnexRepo(/home/me/usecases/imagenette)
[INFO] Initiating special remote datalad-archives
[INFO] Finished adding /home/me/usecases/imagenette/imagenette2-160.tgz: Files processed: 13397, +annex: 13397
[INFO] Finished adding /home/me/usecases/imagenette/imagenette2-160.tgz: Files processed: 2701, +annex: 2701
[INFO] Finished extraction
download_url(ok): /home/me/usecases/imagenette/imagenette2-160.tgz (file)
save(ok): . (dataset)
add-archive-content(ok): /home/me/usecases/imagenette (dataset)
action summary:
add (ok: 1)
add-archive-content (ok: 1)
download_url (ok: 1)
save (ok: 1)
1 change: 0 additions & 1 deletion docs/usecases/_examples/ml-103
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
$ cd ../
$ datalad create -c text2git -c yoda ml-project
[INFO] Creating a new annex repo at /home/me/usecases/ml-project
[INFO] scanning for unlocked files (this may take some time)
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
Expand Down
2 changes: 1 addition & 1 deletion docs/usecases/_examples/ml-104
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ $ datalad clone -d . ../imagenette data/raw
[INFO] Cloning dataset to Dataset(/home/me/usecases/ml-project/data/raw)
[INFO] Attempting to clone from ../imagenette to /home/me/usecases/ml-project/data/raw
[INFO] Completed clone attempts for Dataset(/home/me/usecases/ml-project/data/raw)
[INFO] scanning for unlocked files (this may take some time)
[INFO] scanning for annexed files (this may take some time)
install(ok): data/raw (dataset)
add(ok): data/raw (file)
add(ok): .gitmodules (file)
Expand Down
Loading

0 comments on commit 68d4475

Please sign in to comment.