MLentory Deployment

This folder contains all the necessary configuration files and scripts to deploy the MLentory system using Docker containers.

Structure

deployment/
├── docker-compose.yml                    # Main container orchestration file
├── hf_etl/                               # HuggingFace ETL service
│ ├── Dockerfile.gpu
│ ├── Dockerfile.no_gpu
│ └── run_extract_transform_load.py
├── scheduler/                            # Airflow scheduler configuration
│ ├── dags/
│ ├── logs/
│ ├── plugins/
│ ├── scripts/
│ └── requirements.txt
└── requirements.txt

Prerequisites

Docker and Docker Compose installed
NVIDIA Container Toolkit (for GPU support)
At least 8GB of RAM
(Optional) NVIDIA GPU with CUDA support

If you want further information on how to configure your machine to run the MLentory system please refer to the Installing prerequisites section.

Quick Start

Create the required Docker network:

docker network create mlentory_network

Choose your deployment profile:

Make sure to be in the deployment folder when running the following commands.

For GPU-enabled deployment:

docker-compose --profile gpu up -d

For docker compose version 2.0 or higher run:

docker compose --profile gpu up -d
docker-compose -d --profile up  gpu

For CPU-only deployment:

docker-compose --profile no_gpu up -d

For docker compose version 2.0 or higher run:

docker compose --profile no_gpu up -d

Running ETL Jobs

The ETL process can be triggered through Airflow or manually using the provided Python script:

docker exec hf_gpu python3 /app/hf_etl/run_extract_transfom_load.py

[options] Available options:

--save-extraction: Save extraction results
--save-transformation: Save transformation results
--save-load-data: Save load data
--from-date YYYY-MM-DD: Download models from specific date
--num-models N: Number of models to process
--output-dir DIR: Output directory for results

Services

The system consists of several containerized services:

Airflow Components:
- Scheduler (Port 8794)
- Webserver (Port 8080)
- PostgreSQL Database (Port 5442)
ETL Service (either GPU or no-GPU):
- HuggingFace model extraction
- Data transformation
- Data loading
Storage Services:
- PostgreSQL (Port 5432)
- Virtuoso RDF Store (Ports 1111, 8890)
- Elasticsearch (Ports 9200, 9300)

Accessing Services

Airflow UI: http://localhost:8080 (default credentials: admin/admin)
Virtuoso SPARQL endpoint: http://localhost:8890/sparql
Elasticsearch: http://localhost:9200
PostgreSQL: localhost:5432

Installing prerequisites

If you are in machine with a Unix based operating system you just need to install the Docker and Docker Compose services.

If you are in Windows we recommend installing the Windows subsystem for Linux (WSL 2) and install Ubuntu 20.04. The idea is to have a Linux machine inside Windows so that everything can run smoothly. Particularly when working with machine learning libraries using the Windows service for Docker can become troublesome.

Setting up Docker on Linux

For Linux distribution like Ubuntu, Debian, CentOS, etc, we do the following:

Update your existing list of packages:

sudo apt update

Install a few prerequisite packages which let apt use packages over HTTPS:

sudo apt install apt-transport-https ca-certificates curl software-properties-common

Add the GPG key for the official Docker repository:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Add the Docker repository to APT sources:

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"

Update the package database with the Docker packages:

sudo apt update

Install Docker:

sudo apt install docker-ce

Verify the installation:

sudo docker run hello-world

Manage Docker as Non-root User

If you don't want to write sudo before every command, do the following:

Create the docker group if it does not exist:

sudo groupadd docker

Add your user to the docker group:

sudo usermod -aG docker ${USER}

Log out and log back in for changes to take effect.
Verify you can run Docker commands without sudo:

docker run hello-world

Install Docker compose

Run this command to download the latest version of Docker Compose:

sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

Apply executable permissions to the binary:

sudo chmod +x /usr/local/bin/docker-compose

Verify the installation:

docker-compose --version

Setup NVIDIA GPUs

It is not necessary to have a gpu to run the Backend, but it will make the pipeline run faster.
You can follow the guide at https://docs.nvidia.com/cuda/wsl-user-guide/index.html if you want to setup the NVDIA GPUs in your WSL.
But in general you have to guarantee that you have the GPU drivers, the NVIDIA container toolkit, and you have CUDA toolkit install.
If you are using Windows with WSL you have to install the GPU drivers in Windows, otherwise just install the drivers in your host OS.
- In Windows you can check the NVIDIA GPU drivers at: https://www.nvidia.com/Download/index.aspx
- In Ubuntu you can check how to download the drivers at: https://ubuntu.com/server/docs/nvidia-drivers-installation
- Remember to restart your system after installation.

If you don't have CUDA drivers installed to use your GPU for ML development you can follow the instructions here: https://developer.nvidia.com/cuda-downloads

Update the default Docker DNS server

If you are using the WSL or a Linux distribution as your OS you need to configure the following in order for the private container network to resolve outside hostnames and interact correctly with the internet.

Install dnsmasq and resolvconf.

sudo apt update
sudo apt install dnsmasq resolvconf

Find your docker IP (in this case, 172.17.0.1):

root@host:~# ifconfig | grep -A2 docker0
docker0   Link encap:Ethernet  HWaddr 02:42:bb:b4:4a:50
          inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0

Edit /etc/dnsmasq.conf and add these lines:

sudo nano /etc/dnsmasq.conf

interface=docker0
bind-interfaces
listen-address=172.17.0.1

Create/edit /etc/resolvconf/resolv.conf.d/tail (you can use vim or nano) and add this line, you have to change the line there with the IP of your default network interface eth0:

nameserver 8.8.8.8

Re-read the configuration files and regenerate /etc/resolv.conf.

sudo resolvconf -u

Restart your OS. If you are using WSL run the following in your windows terminal:

wsl.exe --shutdown

Troubleshooting

If services fail to start, check:
- Docker daemon is running
- Required ports are available
- Sufficient system resources
- Network mlentory_network exists
For GPU-enabled deployment:
- Verify NVIDIA drivers are installed
- Check NVIDIA Container Toolkit is properly configured
- Run nvidia-smi to confirm GPU access

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MLentory Deployment

Structure

Prerequisites

Quick Start

Running ETL Jobs

Services

Accessing Services

Installing prerequisites

Setting up Docker on Linux

Manage Docker as Non-root User

Install Docker compose

Setup NVIDIA GPUs

Update the default Docker DNS server

Troubleshooting

Files

README.md

Latest commit

History

README.md

File metadata and controls

MLentory Deployment

Structure

Prerequisites

Quick Start

Running ETL Jobs

Services

Accessing Services

Installing prerequisites

Setting up Docker on Linux

Manage Docker as Non-root User

Install Docker compose

Setup NVIDIA GPUs

Update the default Docker DNS server

Troubleshooting