Update happy path tests to use default configs by lee-goodfire · Pull Request #290 · goodfire-ai/spd

lee-goodfire · 2025-12-05T22:16:33Z

Description

Updates all 4 happy path decomposition tests to load default configs from the experiment registry instead of manually defining them.

Related Issue

Closes #92

Motivation and Context

Tests were manually defining configs, which could diverge from default configs. This change ensures:

Tests use the same configs as production
Breaking changes to default configs will be caught by tests
Single source of truth for experiment configurations
Tests remain valid when default configs are updated

How Has This Been Tested?

Have run the tests. Tests run.

Does this PR introduce a breaking change?

No breaking changes. Tests maintain the same behavior, just load configs differently. The IH config fix ensures it works with current code (deprecated fields were already non-functional).

…tested

… than eval

The test was loading resid_mlp2 config which has n_features=100 in the IdentityCIError metric, but creating a model with only 5 features. This caused a validation error when the metric tried to verify the CI array shape. Added test overrides to update the eval_metric_configs.IdentityCIError to match the test model's n_features=5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Load default configs from registry for all 4 happy path tests - Override only test-specific parameters (C, steps, batch_size, etc.) - Fix ResidMLP test metric config to match 5-feature test model - Remove print statements and obvious comments per style guide 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…tric_configs The old loss coefficient fields (faithfulness_coeff, ci_recon_coeff, etc.) were deprecated and removed during config validation, leaving an empty loss_metric_configs list. This caused the total loss to be a plain tensor with no gradient graph, resulting in a RuntimeError during backward pass. Updated to use the new loss_metric_configs format with: - ImportanceMinimalityLoss (coeff: 1e-2, pnorm: 0.1) - CIMaskedReconLoss (coeff: 1.0) - StochasticReconLoss (coeff: 1.0) - StochasticReconLayerwiseLoss (coeff: 1.0) This matches the original coefficient values and follows the pattern used in other config files (tms_5-2_config.yaml, resid_mlp1_config.yaml). Test now passes successfully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Fixed multiple bugs in the GPT-2 test to work with the new registry-based config: - Changed model class check from PreTrainedModel subclass to hasattr from_pretrained - Added special handling for simple_stories_train models using from_run_info - Fixed tokenizer path to use config.tokenizer_name instead of config.pretrained_model_name - Fixed module patterns to match actual GPT2Simple structure (h.*.attn.q_proj instead of transformer.h.*.attn.c_attn) - Disabled eval metrics that reference layers not in target_module_patterns Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

lee-goodfire · 2025-12-05T22:56:20Z

tests/test_tms.py

    )

-    # The test passes if optimize runs without errors
-    print("TMS SPD optimization completed successfully")


Removed (along with others) due to "tests should pass silently" principle.

lee-goodfire · 2025-12-05T23:05:35Z

Okay, I'm pretty happy with this one.

This was almost completely one shotted by the new claude plugin I've put together (which goodfire people can find here)

I've checked each line and to my eyes it seems good. This would pass my review as is, but I appreciate that e.g. Dan's eye will be better and identify more issues. But I'm happy to pass this off for further review.

lee-goodfire · 2025-12-05T23:06:50Z

It's easiest to understand the pattern for tms and resid mlp2.

The other models, ih especially, are slightly messier, but I think it seems warranted

danbraunai-goodfire

tyty. Good to merge if I'm very unlikely to question any of the changes made to address the comments.

PR description comments

I think it's a long and not information dense enough. Could probably get away with just a few line description and one-line Motivation.
You should put "Closes #X" to close the issue automatically when the PR closes

danbraunai-goodfire · 2025-12-06T07:24:53Z

tests/test_ih_transformer.py

Currently, we don't have a saved ih_transformer target model, unlike the other models. Ideally we'd make one if we want to support this experiment. I guess doing something like this test does where it makes a new randomly initialized model and runs SPD on that seems reasonable. I'd make a note that it's doing that though, even just a "TODO: Use a real pretrained_model_path in the config instead of randomly initializing one"

danbraunai-goodfire · 2025-12-06T07:29:38Z

tests/test_gpt2.py

+        "eval_metric_configs": [],  # Disable eval metrics to avoid layer matching issues
+    }
+    config_dict = apply_nested_updates(base_config, test_overrides)
+    config = Config.model_validate(config_dict)


We've been using Config(**config_dict) elsewhere in the codebase. They both do the same thing. I'm pretty indifferent but would prefer consistency here.

danbraunai-goodfire · 2025-12-06T07:30:39Z

tests/test_gpt2.py

+        "train_log_freq": 50,
+        "n_examples_until_dead": 200,  # train_log_freq * batch_size


nit: I'd probably just put 999 or something, since this is never going to be used anyway. Don't think we did this previously but we should.

danbraunai-goodfire · 2025-12-06T07:30:55Z

tests/test_gpt2.py

+        "eval_batch_size": 1,
+        "train_log_freq": 50,
+        "n_examples_until_dead": 200,  # train_log_freq * batch_size
+        "task_config.max_seq_len": 16,


nit: I'd go smaller cause why not

danbraunai-goodfire · 2025-12-06T07:32:15Z

tests/test_gpt2.py

-            eval_data_split="test[100:200]",
-        ),
-    )
+    base_config = get_experiment_config_file_contents("ss_gpt2_simple")


I'd prefer either base_config_dict or base_config: dict[str, Any]. Otherwise people will confuse it for a Config object.

danbraunai-goodfire · 2025-12-06T07:42:16Z

tests/test_gpt2_configs.py

Before the change this actually test ss_gpt2_config, which uses the transformers library gpt2 architecture and our custom model that was uploaded to huggingface.

Now it tests ss_gpt2_simple, which uses the simplestories gpt2_simple architecture and our custom model that was uploaded to wandb.

I think it's fine to keep this test, but I'd rename it to tests/test_ss_gpt2_simple.py.

But I do want to keep one test that has a pretrained_model_class from the transformers library, like the old one had. You may be able to test both configurations in this one test file with mark.parameterize or just a for loop. If doing that, the name of the file might be test_gpt2_configurations.py or something like that.

It would be nice to test the raw gpt too, i.e. the one in the gpt2_config.yaml. But you should test the runtime of that. Not worth it if it adds 5+ seconds to the slow tests.

Renamed test_gpt2.py → test_gpt2_configs.py and added parametrized test covering both:

ss_gpt2_simple (simple_stories_train GPT2Simple, wandb-hosted)

ss_gpt2 (transformers.GPT2LMHeadModel from HuggingFace)

So we now test both the new config and a transformers library model like the old test had.

Re: raw gpt2 config - tested the runtime and it times out downloading the openwebtext dataset (8.8M examples), so didn't include it.

- Rename base_config -> base_config_dict to avoid confusion with Config objects - Use Config(**config_dict) instead of Config.model_validate() for consistency - Change n_examples_until_dead to 999 (never used in tests anyway) - Reduce max_seq_len in GPT2 test for faster execution - Rename test_gpt2.py -> test_gpt2_configs.py - Add parametrized test for both ss_gpt2_simple and ss_gpt2 configs - Add TODO comment in test_ih_transformer.py about needing pretrained model 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

lee-goodfire · 2025-12-18T17:06:03Z

It's been a little while since you said

Good to merge if I'm very unlikely to question any of the changes made to address the comments.

So am requesting review again, but it's just in case. I think you're unlikely to question any of the changes made to address the comments.

danbraunai-goodfire · 2025-12-19T10:40:49Z

@lee-goodfire nice, lgtm. Could you comment on the open threads that you've completed them and resolve them? Typically as the author I'll comment "done" after I've implemented something, and leave for the reviewer to resolve them all. Not sure what the normal thing to do here is, but I guess something that doesn't require the author to manually check thing they commented on.

leesharkey and others added 30 commits September 16, 2025 18:07

Geometric similarity comparison made consistent with other evals and …

b93b9d6

…tested

Replaced mean max cosine sim with mean max ABS cosine sim

cd5fda2

Configs for geom comparison runs

61d3408

Merge remote-tracking branch 'origin/main' into feature/geom_sim_compar

63c85f0

Minor modifications to make PR-ready

770a5c5

Merge remote-tracking branch 'origin/main' into feature/geom_sim_compar

49ba925

Update seed to be consistent with other configs again

364198e

Cleaned up some comments and other bits

57c2c76

Major update of PR following review: Now implemented as script rather…

2e7752d

… than eval

Merge remote-tracking branch 'origin/main' into feature/geom_sim_compar

4fbf807

Updated registry to delete old obselete experiments

98a6620

Merge branch 'main' into feature/geom_sim_compar

bede346

Merge branch 'main' into feature/geom_sim_compar

acc04f1

Reorganized compare_models into subdirectory and cleaned up config code

62bd77e

Merging

b84814a

Updated README.md

5173a6a

Added some example models to the config

181cac8

Getting rid of newline

8db7559

Minor changes to make the PR mergeable

0d05f0a

Merge branch 'main' of https://github.com/goodfire-ai/spd

8767194

Merge branch 'main' of https://github.com/goodfire-ai/spd

019eb2d

Merge branch 'main' of https://github.com/goodfire-ai/spd

b935b4c

Merge branch 'main' of github.com:goodfire-ai/spd

3d1edeb

Merge branch 'main' of github.com:goodfire-ai/spd

1dd738d

Merge branch 'main' of github.com:goodfire-ai/spd

956f3d4

Merge branch 'main' of github.com:goodfire-ai/spd

f7ad411

Merge branch 'main' of github.com:goodfire-ai/spd

ade1377

Merge branch 'main' of github.com:goodfire-ai/spd

08875a9

Merge branch 'main' of github.com:goodfire-ai/spd

7ca7037

Merge branch 'main' of github.com:goodfire-ai/spd

cbbdb61

leesharkey and others added 11 commits October 28, 2025 14:00

Merge branch 'main' of github.com:goodfire-ai/spd

267deb6

Merge branch 'main' of github.com:goodfire-ai/spd

f49e9e0

Merge branch 'main' of github.com:goodfire-ai/spd

22f7cfc

Merge branch 'main' of github.com:goodfire-ai/spd

ab5346d

Merge branch 'main' of github.com:goodfire-ai/spd

7cb528f

Merge branch 'main' of github.com:goodfire-ai/spd

01d1b6b

Merge branch 'main' of github.com:goodfire-ai/spd

a78fdc5

Merge branch 'main' of github.com:goodfire-ai/spd

296a8d2

Merge branch 'main' of github.com:goodfire-ai/spd

acf0574

lee-goodfire changed the title ~~Update happy path tests to use default configs~~ [TEST PR] Update happy path tests to use default configs Dec 5, 2025

leesharkey and others added 2 commits December 5, 2025 22:32

lee-goodfire commented Dec 5, 2025

View reviewed changes

lee-goodfire requested a review from danbraunai-goodfire December 5, 2025 23:05

danbraunai-goodfire approved these changes Dec 6, 2025

View reviewed changes

leesharkey and others added 3 commits December 18, 2025 16:10

Remove redundant comment in test_tms.py

b3bd2c6

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into feature/default_tests

ca7ad57

lee-goodfire requested a review from danbraunai-goodfire December 18, 2025 17:04

lee-goodfire changed the title ~~[TEST PR] Update happy path tests to use default configs~~ Update happy path tests to use default configs Dec 18, 2025

		"train_log_freq": 50,
		"n_examples_until_dead": 200, # train_log_freq * batch_size

Conversation

lee-goodfire commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lee-goodfire commented Dec 5, 2025

Uh oh!

lee-goodfire commented Dec 5, 2025

Uh oh!

danbraunai-goodfire left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lee-goodfire commented Dec 18, 2025

Uh oh!

danbraunai-goodfire commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lee-goodfire commented Dec 5, 2025 •

edited

Loading