Skip to content

Conversation

bigximik
Copy link
Contributor

@bigximik bigximik commented May 14, 2025

✨ Description

Creates Evaluator abstraction so additional evaluators beyond Loss can be added.

Adds an evaluate command that accepts the same training config and enables evaluation on the last checkpoint.

Includes some fixes.

Example: specifying multiple LossEvaluators

training:
  evaluators:
    the_stack:
      interval: 50
      evaluator:
        type: loss
        iterations: 25
        dataset_name: the_stack
    fineweb:
      interval: 100
      evaluator:
        type: loss
        iterations: 15
        dataset_name: fineweb
data:
  datasets:
    the_stack:
      type: file
      path: path/to/validation_the_stack_dataset.yaml
    fineweb:
      type: file
      path: path/to/validation_fineweb_dataset.yaml

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.

Base automatically changed from denis/generate_final to main May 20, 2025 14:50
@bigximik bigximik changed the title [work in progress] Refactoring of Evaluation and adding of evaluate command Refactoring of Evaluation and adding of evaluate command May 30, 2025
@bigximik bigximik requested a review from jlamypoirier May 30, 2025 16:31
@bigximik bigximik marked this pull request as ready for review June 2, 2025 07:01
hint=FieldHint.feature,
valid=skip_valid_if_none(check_field(Assert.gt, 0)),
)
class TrainingEvaluatorConfig(EvaluatorConfigBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should still inherit from IntervalConfig so it's simpler and we don't have the redundant run_interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer composition over multiple inheritance here, as IntervalConfig is a property of the evaluator wrapper (which TrainingEvaluatorConfig is), and not of the new entity derived from EvaluatorConfigBase. In my opinion, this makes the code much more readable.

However, multiple inheritance is already used in many places. So if you still prefer that approach after my explanation, I’m happy to refactor accordingly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple inheritance is only needed because of the EvaluatorConfigBase mixin, which is arguably not even necessary. I'd rather prioritize better usage (configs) over marginally simpler code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will use multiple inheritance but will retain EvaluatorConfigBase so I don't need to rewrite EvaluatorRunner then we will create a separate config for evaluate command in addition for it of accepting training config. We need to discuss what it should look like, so I’ve created an issue for it #285.

else 0
)

def get_evaluator(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unnecessary, we can just call evaluator.get_evaluator()

Copy link
Contributor Author

@bigximik bigximik Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained in details #222 (comment) and #222 (comment) but shortly it is to maintain proper encapsulation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no encapsulation needed though, TrainingEvaluatorConfig is a fixed class with an evaluator :EvaluatorConfig field, which dis dynamic but has a well-defined get_evaluator method.

Encapsulation would be needed if we allowed for a more generic scenario where evaluator.get_evaluator doesn't exist or has a different signature, ex. if we allowed for a more generic evaluator or a generalized TrainingEvaluatorConfig that doesn't have an evaluator. I don't really see this happening anytime soon...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I can't really work around here is that I need to return TrainingEvaluator and not evaluator.get_evaluator(). This is because TrainingEvaluator is responsible for handling whether evaluators should or should not run during training. Neither the concrete evaluators nor the EvaluatorRunner are aware of this—they simply execute.



@config_class()
class EvaluatorConfigBase(Config):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's necessary

Copy link
Contributor Author

@bigximik bigximik Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained in details #222 (comment) and #222 (comment) but shortly it is to maintain proper encapsulation

@bigximik
Copy link
Contributor Author

bigximik commented Jun 4, 2025

I was also thinking we should move the dataset definitions once we have evaluators, feel free to do it here or in a follow-up PR.

I will move it in the #282, so we can close this faster.


# @pytest.mark.extra_slow
@requires_cuda
def test_loss_validation_vs_inference(model_and_tokenizer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is worth it, it's kind of trivial.

@bigximik bigximik merged commit 4119854 into main Jun 19, 2025
4 checks passed
@bigximik bigximik deleted the denis/evaluate branch June 19, 2025 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants