Skip to content

Conversation

allenwang28
Copy link
Contributor

@allenwang28 allenwang28 commented Oct 14, 2025

Most of the usage info is in the README

Example run: https://www.internalfb.com/mlhub/pipelines/runs/mast/allencwang-forge-mni5vl?job_attempt=0&version=0&tab=logs&env=PRODUCTION

Only tested 1.7B, 8B and 32B - please let me know if it doesn't work

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 14, 2025
@allenwang28 allenwang28 requested a review from LucasLLC October 16, 2025 17:51
@allenwang28 allenwang28 marked this pull request as ready for review October 16, 2025 17:51
@allenwang28 allenwang28 changed the title [wip] MAST client mode Enable MAST client mode Oct 16, 2025
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 11.36364% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.43%. Comparing base (633b219) to head (d33662b).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/forge/controller/launcher.py 11.36% 39 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #405      +/-   ##
==========================================
+ Coverage   64.69%   73.43%   +8.73%     
==========================================
  Files          79       81       +2     
  Lines        7775     7829      +54     
==========================================
+ Hits         5030     5749     +719     
+ Misses       2745     2080     -665     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@LucasLLC LucasLLC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredible. I love how all the manual steps have explanations and scripts. Really solid.

Comment on lines +99 to +125
export HF_HOME=/mnt/wsfuse/teamforge/hf

# Download the model (replace with your desired model)
huggingface-cli download Qwen/Qwen3-8B --local-dir /mnt/wsfuse/teamforge/hf_artifacts/qwen3_8b
```

#### 2. Hydrate the HuggingFace Cache

After downloading the weights, you need to hydrate the HuggingFace cache so that the transformers library can find the model metadata:

```bash
# Set HF_HOME to the OilFS path
export HF_HOME=/mnt/wsfuse/teamforge/hf

# Hydrate the cache for the model
python .meta/mast/hydrate_cache.py --model-id Qwen/Qwen3-8B
```

This ensures that when MAST runs with `HF_HUB_OFFLINE=1`, the transformers library can locate all necessary files from the cache.

### Directory Structure

Both cache and model files are stored under:
- **Cache**: `/mnt/wsfuse/teamforge/hf` (set via `HF_HOME`)
- **Model weights**: `/mnt/wsfuse/teamforge/hf/<model_name>`

Make sure your MAST config files point to the correct paths in `hf_artifacts`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you automate this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the million dollar question Vidhya lol

- Only worker roles (GPU hosts) are launched in MAST
- Client connects to workers remotely via provisioner
2. Detached mode (detached=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE.

@allenwang28 allenwang28 merged commit cf55407 into meta-pytorch:main Oct 16, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants