This project contains many personal Data Science tips and tricks.
Look into the quickstart section for trying them out yourself.
Features:
- develop and test code in production environment (code server in container)
- high code reproducibility with deterministic environments (container & poetry)
- code review automation (pre-commit, black & isort, flake8, pydocstyle and mypy)
- personal Data Science paradigms
- data schema validity
- test setup
- utilizing interactive development possibility based on jupyter with vs code's python extension
Tested with linux and wsl2 on windows.
- Install docker & docker-compose
- Clone repository
-
Build docker image (
ide
service)DOCKER_BUILDKIT=1 COMPOSE_DOCKER_CLI_BUILD=1 docker-compose build ide
-
Start
ide
serviceDOCKER_BUILDKIT=1 COMPOSE_DOCKER_CLI_BUILD=1 docker-compose up ide
-
Access code server via browser:
localhost:8123
You are accessing vs code in the container.
Tip: use a chrome based browser and install thecode-server
app from that browser.
Thus, your vs code experience comes close of running native vs code (e.g. nearly 100% working keyboard shortcuts). -
Optional: Rename existing
.vscode/settings.json.example
to.vscode/settings.json
in the IDE to quickly apply suggested vscode configurations.
-
Use your host ssh credentials in the container.
Add your key to your hosts ssh-agent before starting the container.ssh-add
-
Provide host's git credentials to container:
git config --global user.name "Your Name" git config --global user.email "youremail@yourdomain.com"
Note: If you have no git credentials on your host system this can lead to ugly mounting behavior. Remove the
.gitconfig
mount indocker-compose.yaml
file, to remove this feature
Benefits:
- develop either on your local system or any remote machine you want. (remote development requires only port forwarding to your local system, e.g. via
ssh
) - no
does not run on my machine
problems. - deployed code (container) behaves the same way as when you are developing. (sharing deterministic environment)
- combining common ide features (like debugging and testing functionality) with - often essential - interactive
jupyter
development. - full open source solution. Not closed source like vs code's remote extensions.
Downsides:
- harder (but possible) customization when working with multiple people.
- being a little (1-2 months) behind the latest vs code releases, due to integration dependency of code server project.
High code reproducibility is very important when:
- checking out and testing new feature (branches) in development
- deploying code from development into a productive environment
- debugging productive code in a safe development environment
Deterministic environments are majorly achieved by two technologies used in this project:
- docker for deterministic virtualized images and containers
- poetry as a deterministic python package manager
Both technologies can be replaced via drop in replacements like podman for docker and pipenv for poetry.
Focusing on your main project goals and achieving them is our main goal.
One way of freeing time in code reviews is by automating and enforcing code standards.
First you need to decide what code standards you want to apply in your project. This project contains the following tools:
- black & isort for automated code formatting
- flake8 & mypy for linting and preventing common errors
- pydocstyle for documentation
Afterwards you can automate and thus enforcing them in your project.
With pre-commit you can run your tool suite before every git commit making it easy for everyone to follow them.
The project shows how to use pandera for defining data schemas with automatic schema checks.
Since the schemas are configured to be strict
we know data schema at every major step in the data pipeline.
This makes debugging and testing code straight forwards and easier for everyone not familiar with the project.
Separating tests into the following aspects worked in multiple projects:
- unit: deterministic function tests
- integration: especially checking schema validity incl. external sources
- end2end: quick run tests for more complex data pipelines (e.g. model training and scoring)
- infrastructure: end2end test including tests for infrastructure connectivity (e.g. querying hosted API)
tbd