Update happy path tests to use default configs#290
Update happy path tests to use default configs#290lee-goodfire wants to merge 46 commits intomainfrom
Conversation
The test was loading resid_mlp2 config which has n_features=100 in the IdentityCIError metric, but creating a model with only 5 features. This caused a validation error when the metric tried to verify the CI array shape. Added test overrides to update the eval_metric_configs.IdentityCIError to match the test model's n_features=5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Load default configs from registry for all 4 happy path tests - Override only test-specific parameters (C, steps, batch_size, etc.) - Fix ResidMLP test metric config to match 5-feature test model - Remove print statements and obvious comments per style guide 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…tric_configs The old loss coefficient fields (faithfulness_coeff, ci_recon_coeff, etc.) were deprecated and removed during config validation, leaving an empty loss_metric_configs list. This caused the total loss to be a plain tensor with no gradient graph, resulting in a RuntimeError during backward pass. Updated to use the new loss_metric_configs format with: - ImportanceMinimalityLoss (coeff: 1e-2, pnorm: 0.1) - CIMaskedReconLoss (coeff: 1.0) - StochasticReconLoss (coeff: 1.0) - StochasticReconLayerwiseLoss (coeff: 1.0) This matches the original coefficient values and follows the pattern used in other config files (tms_5-2_config.yaml, resid_mlp1_config.yaml). Test now passes successfully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Fixed multiple bugs in the GPT-2 test to work with the new registry-based config: - Changed model class check from PreTrainedModel subclass to hasattr from_pretrained - Added special handling for simple_stories_train models using from_run_info - Fixed tokenizer path to use config.tokenizer_name instead of config.pretrained_model_name - Fixed module patterns to match actual GPT2Simple structure (h.*.attn.q_proj instead of transformer.h.*.attn.c_attn) - Disabled eval metrics that reference layers not in target_module_patterns Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
| ) | ||
|
|
||
| # The test passes if optimize runs without errors | ||
| print("TMS SPD optimization completed successfully") |
There was a problem hiding this comment.
Removed (along with others) due to "tests should pass silently" principle.
|
Okay, I'm pretty happy with this one. This was almost completely one shotted by the new claude plugin I've put together (which goodfire people can find here) I've checked each line and to my eyes it seems good. This would pass my review as is, but I appreciate that e.g. Dan's eye will be better and identify more issues. But I'm happy to pass this off for further review. |
|
It's easiest to understand the pattern for tms and resid mlp2. The other models, ih especially, are slightly messier, but I think it seems warranted |
danbraunai-goodfire
left a comment
There was a problem hiding this comment.
tyty. Good to merge if I'm very unlikely to question any of the changes made to address the comments.
PR description comments
- I think it's a long and not information dense enough. Could probably get away with just a few line description and one-line Motivation.
- You should put "Closes #X" to close the issue automatically when the PR closes
There was a problem hiding this comment.
Currently, we don't have a saved ih_transformer target model, unlike the other models. Ideally we'd make one if we want to support this experiment. I guess doing something like this test does where it makes a new randomly initialized model and runs SPD on that seems reasonable. I'd make a note that it's doing that though, even just a "TODO: Use a real pretrained_model_path in the config instead of randomly initializing one"
tests/test_gpt2.py
Outdated
| "eval_metric_configs": [], # Disable eval metrics to avoid layer matching issues | ||
| } | ||
| config_dict = apply_nested_updates(base_config, test_overrides) | ||
| config = Config.model_validate(config_dict) |
There was a problem hiding this comment.
We've been using Config(**config_dict) elsewhere in the codebase. They both do the same thing. I'm pretty indifferent but would prefer consistency here.
tests/test_gpt2.py
Outdated
| "train_log_freq": 50, | ||
| "n_examples_until_dead": 200, # train_log_freq * batch_size |
There was a problem hiding this comment.
nit: I'd probably just put 999 or something, since this is never going to be used anyway. Don't think we did this previously but we should.
tests/test_gpt2.py
Outdated
| "eval_batch_size": 1, | ||
| "train_log_freq": 50, | ||
| "n_examples_until_dead": 200, # train_log_freq * batch_size | ||
| "task_config.max_seq_len": 16, |
There was a problem hiding this comment.
nit: I'd go smaller cause why not
tests/test_gpt2.py
Outdated
| eval_data_split="test[100:200]", | ||
| ), | ||
| ) | ||
| base_config = get_experiment_config_file_contents("ss_gpt2_simple") |
There was a problem hiding this comment.
I'd prefer either base_config_dict or base_config: dict[str, Any]. Otherwise people will confuse it for a Config object.
There was a problem hiding this comment.
Before the change this actually test ss_gpt2_config, which uses the transformers library gpt2 architecture and our custom model that was uploaded to huggingface.
Now it tests ss_gpt2_simple, which uses the simplestories gpt2_simple architecture and our custom model that was uploaded to wandb.
I think it's fine to keep this test, but I'd rename it to tests/test_ss_gpt2_simple.py.
But I do want to keep one test that has a pretrained_model_class from the transformers library, like the old one had. You may be able to test both configurations in this one test file with mark.parameterize or just a for loop. If doing that, the name of the file might be test_gpt2_configurations.py or something like that.
It would be nice to test the raw gpt too, i.e. the one in the gpt2_config.yaml. But you should test the runtime of that. Not worth it if it adds 5+ seconds to the slow tests.
There was a problem hiding this comment.
Renamed test_gpt2.py → test_gpt2_configs.py and added parametrized test covering both:
- ss_gpt2_simple (simple_stories_train GPT2Simple, wandb-hosted)
- ss_gpt2 (transformers.GPT2LMHeadModel from HuggingFace)
So we now test both the new config and a transformers library model like the old test had.
Re: raw gpt2 config - tested the runtime and it times out downloading the openwebtext dataset (8.8M examples), so didn't include it.
- Rename base_config -> base_config_dict to avoid confusion with Config objects - Use Config(**config_dict) instead of Config.model_validate() for consistency - Change n_examples_until_dead to 999 (never used in tests anyway) - Reduce max_seq_len in GPT2 test for faster execution - Rename test_gpt2.py -> test_gpt2_configs.py - Add parametrized test for both ss_gpt2_simple and ss_gpt2 configs - Add TODO comment in test_ih_transformer.py about needing pretrained model 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
It's been a little while since you said
So am requesting review again, but it's just in case. I think you're unlikely to question any of the changes made to address the comments. |
|
@lee-goodfire nice, lgtm. Could you comment on the open threads that you've completed them and resolve them? Typically as the author I'll comment "done" after I've implemented something, and leave for the reviewer to resolve them all. Not sure what the normal thing to do here is, but I guess something that doesn't require the author to manually check thing they commented on. |
Description
Updates all 4 happy path decomposition tests to load default configs from the experiment registry instead of manually defining them.
Related Issue
Closes #92
Motivation and Context
Tests were manually defining configs, which could diverge from default configs. This change ensures:
How Has This Been Tested?
Have run the tests. Tests run.
Does this PR introduce a breaking change?
No breaking changes. Tests maintain the same behavior, just load configs differently. The IH config fix ensures it works with current code (deprecated fields were already non-functional).