Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce ray job cold time #825

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0ea7a00
Initial push to reduce loading time for bart
agpituk Feb 7, 2025
2100cfc
Improved model download
agpituk Feb 7, 2025
cac5994
merged main
agpituk Feb 7, 2025
cf4aad0
merged main
agpituk Feb 10, 2025
bc691ad
Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time
agpituk Feb 11, 2025
c52b25c
Multiplatform build for cache image - added readme info
agpituk Feb 13, 2025
047193b
Merged main
agpituk Feb 13, 2025
0373e5c
Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time
agpituk Feb 13, 2025
edb2899
Increase lumigator runner size
agpituk Feb 14, 2025
dad7db7
Fixed missing s
agpituk Feb 14, 2025
dc77037
Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time
agpituk Feb 14, 2025
640f7df
Merged Main
agpituk Feb 14, 2025
93fbe69
Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time
agpituk Feb 14, 2025
c201dcb
Adding removed label to the redis volume
agpituk Feb 14, 2025
dddf260
Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time
agpituk Feb 14, 2025
8ed4552
Removing docker-compose volume for ray to check if it's introducing t…
agpituk Feb 17, 2025
62ba702
Deleting volume in Dockerfile + adding cache volume to Ray
macaab26 Feb 17, 2025
eff24b9
Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time
agpituk Feb 17, 2025
018b808
Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time
agpituk Feb 17, 2025
c74bb1e
precommit fixes
agpituk Feb 19, 2025
5601431
Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time
agpituk Feb 19, 2025
24a37f2
Moved new vars into the new build system, out of docker-compose
agpituk Feb 19, 2025
a1794d9
Reduce unnecesary comments
agpituk Feb 19, 2025
e5de144
Merged main
agpituk Feb 19, 2025
535d4eb
Fix some comments
agpituk Feb 19, 2025
4bf78ff
Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time
agpituk Feb 21, 2025
3b1c63c
Fixes to enable model preloading (#973)
aittalam Feb 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/lumigator_pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ jobs:

integration-tests:
name: Integration tests (SQLite)
runs-on: ubuntu-latest
runs-on: lumigator-integration-tests-runner
needs: lint
if: ${{ needs.lint.result == 'success' }}
strategy:
Expand Down Expand Up @@ -105,7 +105,7 @@ jobs:

integration-tests-postgres:
name: Integration tests (PostgreSQL)
runs-on: ubuntu-latest
runs-on: lumigator-integration-tests-runner
needs: lint
if: ${{ needs.lint.result == 'success' }}
strategy:
Expand Down Expand Up @@ -153,7 +153,7 @@ jobs:

notebook-integration-test:
name: Notebook integration tests
runs-on: ubuntu-latest
runs-on: lumigator-integration-tests-runner
needs: lint
if: ${{ needs.lint.result == 'success' }}
steps:
Expand Down
10 changes: 5 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ endef
# Launches Lumigator in 'development' mode (all services running locally, code mounted in)
local-up: config-generate-env
uv run pre-commit install
RAY_ARCH_SUFFIX=$(RAY_ARCH_SUFFIX) COMPUTE_TYPE=$(COMPUTE_TYPE) docker compose --env-file "$(CONFIG_BUILD_DIR)/.env" --profile local $(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) -f $(DEV_DOCKER_COMPOSE_FILE) up --watch --build
RAY_ARCH_SUFFIX=$(RAY_ARCH_SUFFIX) ARCH=${ARCH} COMPUTE_TYPE=$(COMPUTE_TYPE) docker compose --env-file "$(CONFIG_BUILD_DIR)/.env" --profile local $(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) -f $(DEV_DOCKER_COMPOSE_FILE) up --watch --build

local-down: config-generate-env
docker compose --env-file "$(CONFIG_BUILD_DIR)/.env" --profile local $(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) -f ${DEV_DOCKER_COMPOSE_FILE} down
Expand All @@ -130,18 +130,18 @@ start-lumigator: config-generate-env

# Launches lumigator with no code mounted in, and forces build of containers (used in CI for integration tests)
start-lumigator-build: config-generate-env
RAY_ARCH_SUFFIX=$(RAY_ARCH_SUFFIX) COMPUTE_TYPE=$(COMPUTE_TYPE) docker compose --env-file "$(CONFIG_BUILD_DIR)/.env" --profile local $(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) up -d --build
RAY_ARCH_SUFFIX=$(RAY_ARCH_SUFFIX) ARCH=${ARCH} COMPUTE_TYPE=$(COMPUTE_TYPE) docker compose --env-file "$(CONFIG_BUILD_DIR)/.env" --profile local $(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) up -d --build

# Launches lumigator with no code mounted in, and forces build of containers (used in CI for integration tests)
start-lumigator-build-postgres: config-generate-env
RAY_ARCH_SUFFIX=$(RAY_ARCH_SUFFIX) COMPUTE_TYPE=$(COMPUTE_TYPE) docker compose --env-file "$(CONFIG_BUILD_DIR)/.env" --profile local $(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) -f $(POSTGRES_DOCKER_COMPOSE_FILE) up -d --build
RAY_ARCH_SUFFIX=$(RAY_ARCH_SUFFIX) ARCH=${ARCH} COMPUTE_TYPE=$(COMPUTE_TYPE) docker compose --env-file "$(CONFIG_BUILD_DIR)/.env" --profile local $(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) -f $(POSTGRES_DOCKER_COMPOSE_FILE) up -d --build

# Launches lumigator without local dependencies (ray, S3)
start-lumigator-external-services: config-generate-env
docker compose --env-file "$(CONFIG_BUILD_DIR)/.env"$(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) up -d
ARCH=${ARCH} docker compose --env-file "$(CONFIG_BUILD_DIR)/.env"$(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) up -d

stop-lumigator: config-generate-env
RAY_ARCH_SUFFIX=$(RAY_ARCH_SUFFIX) COMPUTE_TYPE=$(COMPUTE_TYPE) docker compose --env-file "$(CONFIG_BUILD_DIR)/.env" --profile local $(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) -f $(POSTGRES_DOCKER_COMPOSE_FILE) down
ARCH=${ARCH} RAY_ARCH_SUFFIX=$(RAY_ARCH_SUFFIX) COMPUTE_TYPE=$(COMPUTE_TYPE) docker compose --env-file "$(CONFIG_BUILD_DIR)/.env" --profile local $(GPU_COMPOSE) -f $(LOCAL_DOCKERCOMPOSE_FILE) -f $(POSTGRES_DOCKER_COMPOSE_FILE) down
$(call remove_config_dir)

clean-docker-buildcache:
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ need to have the following prerequisites installed on your machine:
- On Linux, you need to follow the
[post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/).
- The system Python (version managers such as uv should be deactivated)
- At least 10 GB available on disk and allocated for docker, since some small language models will be pre downloaded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be great if we added in the docs (1) what models are downloaded (right now it's bart alone, right? I'd suggest roberta-large too for the bertscore metric), (2) their exact size (bart+roberta are less than 3GB), and (3) how this can be disabled if e.g. someone has no intention of ever running bart. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bart has to run right now to generate GT, that's why I added it as a kind of mandatory model. As it is, we can't disable it (apart from manually removing the service from docker-compose, which is not very user friendly I'd say.
We could maybe add a list of variables that is the models you want to predownload into Ray's cache. Would that work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that'd be great! For instance right now we are using roberta-large for bertscore evaluations so the models are already two, and having a list we could point users to makes it easy for them to customise it. Thank you!


You can run and develop Lumigator locally using Docker Compose. This creates four container
services networked together to make up all the components of the Lumigator application:
Expand Down
19 changes: 19 additions & 0 deletions cache/Dockerfile.model-inference
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Dockerfile.huggingface-cache
FROM python:3.11-slim

# Install required packages: transformers and huggingface_hub
RUN pip install --no-cache-dir transformers huggingface_hub

# Ensure the cache directory exists (snapshot_download will create its own subfolders)
RUN mkdir -p /home/ray/.cache/huggingface/hub

# Use the huggingface_hub API to download the model exactly as the Hub does.
# This will create a folder with the proper structure (e.g. blobs, refs, snapshots).
RUN python -c "\
from huggingface_hub import snapshot_download; \
model_path = snapshot_download('facebook/bart-large-cnn', cache_dir='/home/ray/.cache/huggingface/hub'); \
print('Model downloaded to:', model_path)\
"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to have more than one model, perhaps we could have something like

model_names = ["model/1", "model/2", ...]
for model_name in model_names:
    model_path = ...
    print(f"Model {model_name} downloaded to: {model_path}")

WDYT?
(also, I think model_path is relative to the container and might be misleading as the actual path is different)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely happy with the addition to have more than 1 model (look at my comment above). Not sure I follow about the path.
In the docker-compose, in this line
- huggingface_cache_vol:/home/ray/.cache/huggingface
we use the same path inside Ray (I did a few tests around this to get it right)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, sorry, I did not explain it properly!
What I meant is that we are printing "Model downloaded to " with that python code, and the will be the directory inside the container... Which makes no sense to the user because it is not where they will look for it if they need it (that is, the volume or the local path, not the container one).
As an example, let's say that we are storing this in the classical HF_HOME path. The user will see a message "Model blahblah downloaded to /home/ray/.cache/huggingface", but that is the folder in the container, not on their host.

# Exit immediately (this container’s only job is to populate the cache)
CMD ["/bin/true"]
23 changes: 19 additions & 4 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@ name: lumigator

services:

inference-model:
build:
context: .
dockerfile: cache/Dockerfile.model-inference
platform: linux/${ARCH}
command: /bin/true
volumes:
- huggingface_cache_vol:/home/ray/.cache/huggingface
profiles:
- local
minio:
labels:
ai.mozilla.product_name: lumigator
Expand Down Expand Up @@ -68,6 +78,8 @@ services:
depends_on:
redis:
condition: service_healthy
inference-model:
condition: service_completed_successfully
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this will take a while, what are we planning to do with the other services in the meantime? Options:

  • make all of them depend on ray (might not be needed, we can e.g. still upload datasets or directly check previous experiment results)
  • have something to prevent us from running ray-dependent workflows until ray is up
  • none of the above, but clearly communicate to the users that they'll have to wait a bit before running anything that requires ray (not ideal for beginners IMO)

ports:
- "6379:6379"
- "${RAY_DASHBOARD_PORT}:${RAY_DASHBOARD_PORT}"
Expand All @@ -83,18 +95,18 @@ services:
- -c
- |
set -eaux
mkdir -p /tmp/ray_pip_cache
mkdir -p /home/ray/.cache/ && mkdir -p /tmp/ray_pip_cache
sudo chmod -R 777 /home/ray/.cache/ && sudo chmod -R 777 /tmp/ray_pip_cache/ || true
RAY_JOB_ALLOW_DRIVER_ON_WORKER_NODES=1 RAY_REDIS_ADDRESS=redis:6379 ray start --head --dashboard-port=${RAY_DASHBOARD_PORT} --port=6379 --dashboard-host=0.0.0.0 --ray-client-server-port 10001
# If the file was mounted in a volume instead of
# a shared dir, permissions need to be setup
# ... || true allows this to fail (-e is set)
sudo chmod -R 777 /tmp/ray_pip_cache/ || true
RAY_JOB_ALLOW_DRIVER_ON_WORKER_NODES=1 RAY_REDIS_ADDRESS=redis:6379 ray start --head --dashboard-port=${RAY_DASHBOARD_PORT} --port=6379 --dashboard-host=0.0.0.0 --ray-client-server-port 10001
mkdir -p /tmp/ray/session_latest/runtime_resources/pip
rmdir /tmp/ray/session_latest/runtime_resources/pip/ && ln -s /tmp/ray_pip_cache /tmp/ray/session_latest/runtime_resources/pip
sleep infinity
shm_size: 2g
volumes:
- ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface
- huggingface_cache_vol:/home/ray/.cache/huggingface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any strong reason for moving this to a volume? This makes lumigator's cache not interoperable with the HF cache (that might already reside on users' machines).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely one of the problems this PR may introduce (on top of slower CI times) I moved this to a volume because we create that volume before, so we ensure the bart image is there (reducing first time to experiment). Without it being a volume, I'm not sure if I know how I can add this to ray cache.

- ray-pip-cache:/tmp/ray_pip_cache
deploy:
resources:
Expand Down Expand Up @@ -242,6 +254,9 @@ volumes:
redis-data:
labels:
ai.mozilla.product_name: lumigator
huggingface_cache_vol:
labels:
ai.mozilla.product_name: lumigator
ray-pip-cache:
labels:
ai.mozilla.product_name: lumigator
Loading