Project features

This project contains many personal Data Science tips and tricks.
Look into the quickstart section for trying them out yourself.

Features:

develop and test code in production environment (code server in container)
high code reproducibility with deterministic environments (container & poetry)
code review automation (pre-commit, black & isort, flake8, pydocstyle and mypy)
personal Data Science paradigms
- data schema validity
- test setup
- utilizing interactive development possibility based on jupyter with vs code's python extension

1. Quickstart aka. developing in a container

Tested with linux and wsl2 on windows.

Prerequisite

Install docker & docker-compose
Clone repository

Start container

Build docker image (ide service)

DOCKER_BUILDKIT=1 COMPOSE_DOCKER_CLI_BUILD=1 docker-compose build ide

Start ide service

DOCKER_BUILDKIT=1 COMPOSE_DOCKER_CLI_BUILD=1 docker-compose up ide

Access code server via browser: localhost:8123
You are accessing vs code in the container.
Tip: use a chrome based browser and install the code-server app from that browser.
Thus, your vs code experience comes close of running native vs code (e.g. nearly 100% working keyboard shortcuts).
Optional: Rename existing .vscode/settings.json.example to .vscode/settings.json in the IDE to quickly apply suggested vscode configurations.

Build in functionality

Use your host ssh credentials in the container.
Add your key to your hosts ssh-agent before starting the container.
```
ssh-add
```
Provide host's git credentials to container:
```
git config --global user.name "Your Name"
git config --global user.email "youremail@yourdomain.com"
```
Note: If you have no git credentials on your host system this can lead to ugly mounting behavior. Remove the .gitconfig mount in docker-compose.yaml file, to remove this feature

Developing in a container benefits and downsides:

Benefits:

develop either on your local system or any remote machine you want. (remote development requires only port forwarding to your local system, e.g. via ssh)
no does not run on my machine problems.
deployed code (container) behaves the same way as when you are developing. (sharing deterministic environment)
combining common ide features (like debugging and testing functionality) with - often essential - interactive jupyter development.
full open source solution. Not closed source like vs code's remote extensions.

Downsides:

harder (but possible) customization when working with multiple people.
being a little (1-2 months) behind the latest vs code releases, due to integration dependency of code server project.

2. High code reproducibility with deterministic environments

High code reproducibility is very important when:

checking out and testing new feature (branches) in development
deploying code from development into a productive environment
debugging productive code in a safe development environment

Deterministic environments are majorly achieved by two technologies used in this project:

docker for deterministic virtualized images and containers
poetry as a deterministic python package manager

Both technologies can be replaced via drop in replacements like podman for docker and pipenv for poetry.

3. Code review automation

Focusing on your main project goals and achieving them is our main goal.
One way of freeing time in code reviews is by automating and enforcing code standards.

First you need to decide what code standards you want to apply in your project. This project contains the following tools:

black & isort for automated code formatting
flake8 & mypy for linting and preventing common errors
pydocstyle for documentation

Afterwards you can automate and thus enforcing them in your project.
With pre-commit you can run your tool suite before every git commit making it easy for everyone to follow them.

4. Personal Data Science paradigms

Data Schema Validity

The project shows how to use pandera for defining data schemas with automatic schema checks.
Since the schemas are configured to be strict we know data schema at every major step in the data pipeline.
This makes debugging and testing code straight forwards and easier for everyone not familiar with the project.

Test Setup

Separating tests into the following aspects worked in multiple projects:

unit: deterministic function tests
integration: especially checking schema validity incl. external sources
end2end: quick run tests for more complex data pipelines (e.g. model training and scoring)
infrastructure: end2end test including tests for infrastructure connectivity (e.g. querying hosted API)

VS Code extensions

tbd

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.vscode		.vscode
patient_no_show		patient_no_show
scripts		scripts
tests/integration/patient_no_show		tests/integration/patient_no_show
.env		.env
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.MD		README.MD
docker-compose.yaml		docker-compose.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project features

1. Quickstart aka. developing in a container

Prerequisite

Start container

Build in functionality

Developing in a container benefits and downsides:

2. High code reproducibility with deterministic environments

3. Code review automation

4. Personal Data Science paradigms

Data Schema Validity

Test Setup

VS Code extensions

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

TobiRoby/data-science-gist

Folders and files

Latest commit

History

Repository files navigation

Project features

1. Quickstart aka. developing in a container

Prerequisite

Start container

Build in functionality

Developing in a container benefits and downsides:

2. High code reproducibility with deterministic environments

3. Code review automation

4. Personal Data Science paradigms

Data Schema Validity

Test Setup

VS Code extensions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages