Skip to content

Conversation

@tianyu-l
Copy link
Contributor

Previously we have redundant field job_config.model.name and train_spec.name and we used to implicitly require them to be the same.

This ambiguity was exaggerated in #1740, after which we can tolerate the two fields to be different, without good reasons.

This PR removes train_spec.name.

For models inside torchtitan:

  • we rely on folder path as the implicit source of truth for "registering" models.
  • job_config.model.name will be used to search for models in the torchtitan/models folder and torchtitan/experiments folder

For models outside torchtitan:

  • users can still register TrainSpec, but with names explicitly specified so that job_config.model.name could be used to fetch the TrainSpec

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2025
@wwwjn
Copy link
Contributor

wwwjn commented Oct 10, 2025

The FLUX CI failure seems true. It seems not related to this PR, but introduced into titan long time ago. Because FLUX CI will be only triggered when you change something under flux/. I will create a PR to fix (as I'm the PoC)

@tianyu-l
Copy link
Contributor Author

@wwwjn thanks! Maybe you could stamp so I can merge?

@wwwjn wwwjn merged commit aa000a3 into main Oct 10, 2025
9 of 10 checks passed
@tianyu-l tianyu-l deleted the train_spec branch October 10, 2025 02:29
ruisizhang123 pushed a commit that referenced this pull request Oct 11, 2025
#1850 removed `name` field in
`TrainSpec`. The experiments in simple_fsdp should also be updated.
Otherwise it won't run.

#1776 added `use_flex_attn`
field to `apply_non_moe_tp()`, which is missing in simple_fsdp
experiments

```
NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name simple_fsdp.llama3 --compile.enable
```

```
NGPU=8 CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name simple_fsdp.deepseek_v3 --compile.enable
```
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
pytorch#1850 removed `name` field in
`TrainSpec`. The experiments in simple_fsdp should also be updated.
Otherwise it won't run.

pytorch#1776 added `use_flex_attn`
field to `apply_non_moe_tp()`, which is missing in simple_fsdp
experiments

```
NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name simple_fsdp.llama3 --compile.enable
```

```
NGPU=8 CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name simple_fsdp.deepseek_v3 --compile.enable
```
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 15, 2025
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 16, 2025
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 16, 2025
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 29, 2025
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants