Skip to content

Commit c379849

Browse files
authored
Add missing FUSE_SRC to mast.py (#224)
I was getting this error: ``` facebook.hpc_scheduler.hpcscheduler.types.HpcSchedulerServiceException: WarmStorageSpec in task group 'torchrun' has a non-empty directory field ('checkpoint/infra') but has missing clusters field. If job requires warm storage cluster mount, please configure storage mount options by specifying region or conda storage mount environment variables. For more details please see wiki: https://www.internalfb.com/code/fbsource/fbcode/conda/mast/mount/README.md .If you need further assistance, please make a post in https://fb.workplace.com/groups/mast.users., error: HpcSchedulerErrorCode::INVALID_JOB_DEFINITION (errorCode: 402) ``` Looking at the wiki, I guessed that probably I was missing the 'cluster' info in the FUSE_SRC var, and I also noted that we already specify this var in mount.py in torchtitan, so I copied the value from there to mast.py and it works.
1 parent 9aebf3b commit c379849

File tree

1 file changed

+1
-0
lines changed

1 file changed

+1
-0
lines changed

mast/mast.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,7 @@ def train(
167167
mast_env.pop(env_var, None)
168168

169169
if not mast_env.get("ENABLE_AIRSTORE"):
170+
mast_env["FUSE_SRC"] = "ws://ws.ai.nha0genai/checkpoint/infra"
170171
mast_env["FUSE_SRC_PATH"] = "checkpoint/infra"
171172

172173
# Ensure that a dump dir is available

0 commit comments

Comments
 (0)