Skip to content

Commit

Permalink
Merge pull request #6 from iterative/get-started-updates
Browse files Browse the repository at this point in the history
finish restructuring and get-started generation updates
  • Loading branch information
shcheklein authored Aug 21, 2019
2 parents 903feb7 + d174cae commit b5bf116
Show file tree
Hide file tree
Showing 6 changed files with 85 additions and 63 deletions.
5 changes: 0 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
# Custom
*.zip
/tmp


# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
24 changes: 16 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,32 @@ Started](https://dvc.org/doc/get-started) and other sections of the DVC docs.
Please make sure you have these available on the environment where these scripts
will run:

- Git
- Python (with `pip`)
- [Git](https://git-scm.com/)
- [Python](https://www.python.org/) 3 (with `python3` and [pip](https://pypi.org/project/pip/) commands)
- [Virtualenv](https://virtualenv.pypa.io/en/stable/)

## Scripts

Each example DVC project is in each of the root folders:
Each example DVC project is in each of the root directories (below). `cd` into
the directory first before running the desired script, for example:

```console
$ cd example-get-started
$ ./deploy.sh
```

<!-- ### dataset-registry -->

### example-get-started

- `generate.sh` - generates the `example-get-started` DVC project from
- `deploy.sh`: Makes and deploys code archive from
[example-get-started/code](example-get-started/code) (downloaded as part of
the `generate.sh`) to S3.
> Requires AWS CLI and write access to `s3://dvc-public/code/get-started/`.
- `generate.sh`: Generates the `example-get-started` DVC project from
scratch. A source code archive is downloaded from S3 the same way as in
[Connect Code and Data](https://dvc.org/doc/get-started/connect-code-and-data).

> If you change the [source code](code/src/) files in this repo, run
> `deploy.sh` first, to make sure that the `code.zip` archive is up to date.
- `deploy.sh` - deploys code archive that is downloaded as part of the
`generate.sh` to S3.
> Requires AWS CLI and write access to `s3://dvc-share/get-started/`.
3 changes: 3 additions & 0 deletions example-get-started/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Custom
*.zip
/tmp
11 changes: 6 additions & 5 deletions example-get-started/code/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ Please report any issues in

![](https://dvc.org/static/img/example-flow-2x.png)

Get Started is a step by step introduction into basic DVC concepts. It doesn't
_Get Started_ is a step by step introduction into basic DVC concepts. It doesn't
go into details much, but provides links and expandable sections to learn more.

The idea of the project is a simplified version of the
[tutorial](https://dvc.org/doc/tutorial). It explores the natural language
[Tutorial](https://dvc.org/doc/tutorial). It explores the natural language
processing (NLP) problem of predicting tags for a given StackOverflow question.
For example, we want one classifier which can predict a post that is about the
Python language by tagging it `python`.
Expand All @@ -19,7 +19,7 @@ Python language by tagging it `python`.

Start by cloning the project:

```dvc
```console
$ git clone https://github.com/iterative/example-get-started
$ cd example-get-started
```
Expand All @@ -28,7 +28,7 @@ Now let's install the requirements. But before we do that, we **strongly**
recommend creating a virtual environment with a tool such as
[virtualenv](https://virtualenv.pypa.io/en/stable/):

```dvc
```console
$ virtualenv -p python3 .env
$ source .env/bin/activate
$ pip install -r src/requirements.txt
Expand Down Expand Up @@ -127,7 +127,8 @@ but right after you for Git clone and [`dvc pull`](https://man.dvc.org/pull) to
download files that are under DVC control, the structure of the project should
look like this:

```sh
```console
$ tree
.
├── auc.metric # <-- DVC metric compares baseline and bigrams
├── data # <-- Directory with raw and intermediate data
Expand Down
18 changes: 11 additions & 7 deletions example-get-started/deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,18 @@ pushd $PACKAGE_DIR
zip -r $PACKAGE src/*
popd

# Requires AWS CLI and write access to `s3://dvc-share/get-started/`.
# Requires AWS CLI and write access to `s3://dvc-public/code/get-started/`.
mv $PACKAGE_DIR/$PACKAGE .
aws s3 cp --acl public-read $PACKAGE s3://dvc-share/get-started/$PACKAGE
aws s3 cp --acl public-read $PACKAGE s3://dvc-public/code/get-started/$PACKAGE

# Testing
wget https://dvc.org/s3/get-started/$PACKAGE -O $TEST_PACKAGE
# Sanity check
wget https://code.dvc.org/get-started/$PACKAGE -O $TEST_PACKAGE
unzip $TEST_PACKAGE -d $TEST_DIR
# TODO: Print some info. on what to look for here.
cmp $PACKAGE $TEST_PACKAGE

echo "\nNo output should be produced by the following cmp and diff commands:\n"

cmp $PACKAGE $TEST_PACKAGE # Expected output: nothing
rm -f $TEST_PACKAGE
diff -r $PACKAGE_DIR $TEST_DIR
cp -f $PACKAGE_DIR/README.md $TEST_DIR
diff -r $PACKAGE_DIR $TEST_DIR # Expected output: nothing
rm -fR $TEST_DIR
87 changes: 49 additions & 38 deletions example-get-started/generate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,16 @@
# x Print commands and their arguments as they are executed.
set -eux

THIS="$( cd "$(dirname "$0")" ; pwd -P )"
HERE="$( cd "$(dirname "$0")" ; pwd -P )"
REPO_NAME="example-get-started"
REPO_PATH="../$REPO_NAME"
REPO_PATH="$HERE/build/$REPO_NAME"

if [ -d "$REPO_PATH" ]; then
echo "Repo $REPO_NAME already exists, remove it first"
echo "Repo $REPO_PATH already exists, remove it first."
exit 1
fi

mkdir $REPO_PATH
mkdir -p $REPO_PATH
pushd $REPO_PATH

git init
Expand All @@ -25,37 +25,42 @@ source .env/bin/activate
echo '.env/' >> .gitignore

git add .
git commit -a -m "initialize Git"
git tag -a "0-empty" -m "Git is initialized"
git commit -m "Initialize Git repository"
git tag -a "0-empty" -m "Git initialized"

pip install dvc[s3]

dvc init
git commit -m "initialize DVC"
git tag -a "1-initialize" -m "DVC is initialized"
git commit -m "Initialize DVC project"
git tag -a "1-initialize" -m "DVC initialized."

# Remote active on this environment only for writing to HTTP redirect above.
dvc remote add -d --local storage s3://dvc-public/remote/get-started

# Actual remote for generated project (read-only). Redirect of S3 bucket below.
dvc remote add -d storage https://remote.dvc.org/get-started
dvc remote add -d --local storage s3://dvc-storage/get-started
git commit -a -m "add default HTTP remote"
git tag -a "2-remote" -m "remote initialized"

git add .
git commit -a -m "Configure default HTTP remote (read-only)"
git tag -a "2-remote" -m "Read-only remote storage configured."

mkdir data
wget https://dvc.org/s3/get-started/data.xml -O data/data.xml
wget https://data.dvc.org/get-started/data.xml -O data/data.xml
dvc add data/data.xml
git add data/.gitignore data/data.xml.dvc
git commit -m "add raw data to DVC"
git tag -a "3-add-file" -m "data file added"
git commit -m "Add raw data with to project"
git tag -a "3-add-file" -m "Data file added."
dvc push

mkdir src
wget https://dvc.org/s3/get-started/code.zip
wget https://code.dvc.org/get-started/code.zip
unzip code.zip
rm -f code.zip
echo "dvc[s3]" >> src/requirements.txt
cp $THIS/code/README.md $REPO_PATH
cp $HERE/code/README.md $REPO_PATH
git add .
git commit -m 'add source code'
git tag -a "4-sources" -m "source code added"
git commit -m 'Add source code files to repo'
git tag -a "4-sources" -m "Source code added."

pip install -r src/requirements.txt

Expand All @@ -64,8 +69,8 @@ dvc run -f prepare.dvc \
-o data/prepared \
python src/prepare.py data/data.xml
git add data/.gitignore prepare.dvc
git commit -m "add data preparation stage"
git tag -a "5-preparation" -m "first transformation stage added"
git commit -m "Create data preparation stage"
git tag -a "5-preparation" -m "First pipeline stage (data preparation) created."
dvc push

dvc run -f featurize.dvc \
Expand All @@ -74,54 +79,60 @@ dvc run -f featurize.dvc \
python src/featurization.py \
data/prepared data/features
git add data/.gitignore featurize.dvc
git commit -m "add featurization stage"
git tag -a "6-featurization" -m "featurization stage added"
git commit -m "Create featurization stage"
git tag -a "6-featurization" -m "Featurization stage created."
dvc push

dvc run -f train.dvc \
-d src/train.py -d data/features \
-o model.pkl \
python src/train.py data/features model.pkl
git add .gitignore train.dvc
git commit -m "add train stage"
git tag -a "7-train" -m "train stage added"
git commit -m "Create training stage"
git tag -a "7-train" -m "Training stage created."
dvc push

dvc run -f evaluate.dvc \
-d src/evaluate.py -d model.pkl -d data/features \
-M auc.metric \
python src/evaluate.py model.pkl data/features auc.metric
git add .gitignore evaluate.dvc auc.metric
git commit -m "add evaluation stage"
git tag -a "baseline-experiment" -m "baseline experiment"
git tag -a "8-evaluation" -m "evaluation stage added"
git commit -m "Create evaluation stage"
git tag -a "baseline-experiment" -m "Baseline experiment"
git tag -a "8-evaluation" -m "Baseline evaluation stage created."
dvc push

sed -e s/max_features=5000\)/max_features=6000\,\ ngram_range=\(1\,\ 2\)\)/ -i "" \
src/featurization.py

dvc repro evaluate.dvc
git commit -a -m "try using bigrams"
git tag -a "bigrams-experiment" -m "bigrams experiment"
git tag -a "9-bigrams" -m "bigrams version added"
git commit -a -m "Reproduce evaluation stage using bigrams"
git tag -a "bigrams-experiment" -m "Bigrams experiment"
git tag -a "9-bigrams" -m "Bigrams evaluation stage created."
dvc push

popd

echo "`cat <<EOF-
Install 'hub' and run:
hub create iterative/example-get-started -d "Get started DVC project" \
-h "https://dvc.org/doc/get-started"
if you'd like to create the repository from scratch.
The Git repo generated by this script is intended to be published on
https://github.com/iterative/example-get-started. Make sure the Github repo
exists firt.
Make sure to delete the exising one on Github, save the tags and put them back
via UI interface when you done.
To create it with https://hub.github.com/ for example, run:
Run these commands manually in the generated get-started repo to rewrite the
eixisting repo:
hub create iterative/example-get-started -d "Get Started DVC project" \
-h "https://dvc.org/doc/get-started"
If the Github repo already exists, run these commands to rewrite it:
cd build/example-get-started
git remote add origin git@github.com:iterative/example-get-started.git
git push --force origin master
git push --force origin --tags
You may remove the generated repo with:
rm -fR build
`"

0 comments on commit b5bf116

Please sign in to comment.