Data Version Control Tutorial

Example repository for the Data Version Control With Python and DVC article on Real Python.

To use this repo as part of the tutorial, you first need to get your own copy. Click the Fork button in the top-right corner of the screen, and select your private account in the window that pops up. GitHub will create a forked copy of the repository under your account.

Clone the forked repository to your computer with the git clone command

git clone git@github.com:YourUsername/data-version-control.git

Make sure to replace YourUsername in the above command with your actual GitHub username.

Happy coding!

Custom features for this fork

This is a fork from a tutorial example repo. I'm adding some new features like an script to classify images using the generated model. For more details read the original tutorial.

Tutorial: https://realpython.com/python-data-version-control/
Original repo: https://github.com/realpython/data-version-control

Install

git clone git@github.com:josecelano/data-version-control.git
cd data-version-control
conda create --name dvc python=3.8.2 -y
conda config --add channels conda-forge
conda install dvc scikit-learn scikit-image pandas numpy

Alternatively you can create the conda environment with:

conda env create --file environment.yml

Run

conda activate dvc

Generate csv files:

python3 src/prepare.py

Resize images to 100x100 and convert them to PNG format:

python3 src/prepare_images.py

The model can be trained with raw images or pre-processed images and use them. For the time being both options are hardcoded in train.py file. The default option is from pre-resized images.

Train the model:

python3 src/train.py

Evaluate the model with the test set:

python3 src/evaluate.py && cat metrics/accuracy.json

User the model to classify the image:

python3 src/predict.py

Sample output for predict.py script:

(dvc) josecelano@josecelano:~/Documents/github/josecelano/data-version-control$ python src/predict.py -i /home/josecelano/Documents/github/josecelano/data-version-control/data/raw/train/n03888257/n03888257_24024.JPEG
Predicting for image: " /home/josecelano/Documents/github/josecelano/data-version-control/data/raw/train/n03888257/n03888257_24024.JPEG "
['parachute']

This could be a golden test if you change something:

python src/train.py && python src/evaluate.py && cat metrics/accuracy.json

Accuracy should be almost the same. The trainning process is not deterministic.

Run workflow locally

We are using act to run GitHub Actions locally.

act usage:

act -h

Run workflow locally:

act -j build --secret-file .env
act -j show_changed_images --secret-file .env
...

With the j you can run only a single job.

Don't forget to add your Azure Blog Storage credentials to pull images from remote DVC storage. Otherwise, you will get this error:

| ERROR: failed to pull data from the cloud - Authentication to Azure Blob Storage requires either account_name or connection_string.
| Learn more about configuration settings at <https://man.dvc.org/remote/modify>
[Build the model/build]   ❌  Failure - Pull dataset from remote

Run workflow on GitHub

You need to add the secrets in .env.ci file:

AZURE_STORAGE_ACCOUNT='YOUR_STORAGE_ACCOUNT_NAME'
AZURE_STORAGE_KEY='YOUR_STORAGE_KEY'

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.dvc		.dvc
.github		.github
data		data
docs		docs
metrics		metrics
model		model
src		src
.dvcignore		.dvcignore
.env.ci		.env.ci
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Version Control Tutorial

Custom features for this fork

Install

Run

Run workflow locally

Run workflow on GitHub

New content

Links

Troubleshooting

TODO

About

Languages

License

josecelano/data-version-control

Folders and files

Latest commit

History

Repository files navigation

Data Version Control Tutorial

Custom features for this fork

Install

Run

Run workflow locally

Run workflow on GitHub

New content

Links

Troubleshooting

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Languages