fix(MBT): update auto gen mbt evals #3530

galzilber · 2025-12-31T08:58:22Z

I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.

Important

Refactor evaluator validation and configuration logic to use consolidated input fields and improve error handling.

Evaluator Validation:
- Updated _validate_evaluator_input() in evaluator.py to use a single consolidated input field for request models.
Configuration Extraction:
- Enhanced _get_config_fields() in evaluators_made_by_traceloop.py to derive defaults from nested config models.
Error Handling:
- Improved error handling for missing fields and invalid type annotations in evaluator.py and evaluators_made_by_traceloop.py.
Model Updates:
- Updated request models in request.py to use consolidated input fields.
- Updated response models in response.py to align with new request model structure.

^{This description was created by}^{for f25cc27. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

Release Notes

Refactor
- Updated evaluator validation logic to handle input models with a single consolidated field instead of multiple fields.
- Improved configuration field extraction to derive defaults from nested config models.
- Enhanced error handling for missing fields and invalid type annotations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-31T08:58:34Z

📝 Walkthrough

Walkthrough

The evaluator module refactors input validation and configuration handling to use nested model structures. Instead of unpacking input fields directly, the validation now passes input as a single field to the request model, while configuration extraction adapts to resolve and use nested config models rather than top-level fields.

Changes

Cohort / File(s)	Summary
Input Validation & Config Handling Refactoring `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py`, `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py`	Modified `_validate_evaluator_input` to pass input as a single `input=input` keyword argument instead of unpacking the input dict. Updated `_get_required_fields_` to inspect nested "input" field annotation for required fields. Rewrote `_get_config_fields_` to resolve and extract nested "config" model fields with proper guards for missing or invalid structures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A hop, skip, and validation refound,
Nested models now deeply bound,
Config fields dance in tidy rows,
Where input's structure elegantly flows,
Our evaluators leap to new heights! 🌟

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title mentions 'update auto gen mbt evals' which relates to the changes in evaluator files, but it is vague and generic, not clearly describing the main structural change (nested input/config fields).	Make the title more specific about the main change, e.g., 'fix(MBT): refactor evaluators to use nested input and config fields' or similar.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to 72ad88e in 2 minutes and 11 seconds. Click for details.

Reviewed 749 lines of code in 5 files
Skipped 0 files when reviewing.
Skipped posting 12 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:27

Draft comment:
Docstring now details the nested 'input' and optional 'config' structure. Ensure callers pass a flat dict as intended.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is asking the PR author to ensure that callers pass a flat dictionary as intended. It seems to be a request for confirmation of intended behavior, which violates the rule against asking the author to confirm their intention. Therefore, this comment should be removed.

2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:39

Draft comment:
Changed to 'request_model(input=input)' to enforce the nested payload structure. Verify that legacy clients are updated accordingly.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to verify that legacy clients are updated, which is against the rules. It does not provide a specific code suggestion or ask for a specific test to be written. Therefore, it should be removed.

3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:31

Draft comment:
List comprehension correctly extracts required fields from the nested 'input' model.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that something is done correctly, which is not necessary in a code review context.

4. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:53

Draft comment:
Handling Optional config types using origin is well implemented to extract the inner model.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply praises the implementation without offering any specific guidance or questions.

5. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/__init__.py:132

Draft comment:
The all export now includes updated evaluator models. Ensure auto-generation remains consistent with the API spec.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

6. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py:15

Draft comment:
AgentEfficiencyInput now includes a 'trajectory_prompts' field. Confirm that producers supply this required value.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

7. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py:87

Draft comment:
New field 'tool_output' added in AgentToolErrorDetectorInput; ensure that input providers include this field.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the author to ensure that input providers include a new field. This falls under the category of asking the author to ensure something is done, which is against the rules. The comment does not provide a specific suggestion or ask for a specific test to be written, so it should be removed.

8. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py:210

Draft comment:
InstructionAdherenceInput now requires a 'response' field. Validate that tasks provide this new field accordingly.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is asking the PR author to validate that tasks provide a new required field. It is not making a specific code suggestion or asking for a test to be written. It is more of a general request for validation, which violates the rule against asking the author to ensure behavior is intended or tested.

9. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/response.py:88

Draft comment:
Using 'pass_' alias in IntentChangeResponse is a smart workaround to avoid reserved keyword conflicts.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

10. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/response.py:115

Draft comment:
In PromptInjectionResponse, consider using a boolean for 'has_injection' if it represents a binary state.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

11. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py:151

Draft comment:
Typo suggestion: The example value 'gpt-4o' in ContextRelevanceConfigRequest might be a typo. Did you mean 'gpt-4'?
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% 'gpt-4o' is actually the correct name for OpenAI's GPT-4 Omni model, which was released in 2024. This is not a typo - it's a legitimate model identifier. The comment is factually incorrect because it assumes 'gpt-4o' is a mistake when it's actually a valid model name. The automated tool that generated this comment appears to be unaware of this model's existence. This comment should be deleted as it's incorrect and would mislead the PR author. I might be wrong if this codebase was written before GPT-4o was released, or if the codebase has specific conventions about which models to support. However, given that this is generated code (as indicated by the comment at the top of the file), it's likely pulling from a schema that intentionally includes 'gpt-4o'. Even if there were historical context, the file header shows this is auto-generated code from a JSON schema file. The example value 'gpt-4o' was intentionally included in the source schema, and 'gpt-4o' is a real, valid OpenAI model name. The comment is factually incorrect. This comment should be deleted because 'gpt-4o' is not a typo - it's the correct name for OpenAI's GPT-4 Omni model. The comment is factually incorrect and would mislead the PR author into making an unnecessary change.

12. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py:194

Draft comment:
The 'context' field in FaithfulnessInput appears to be incomplete. Please ensure the Field is finished with the intended parameters, such as closing parenthesis and examples if needed.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_ZBsLzqhcr9lMb5C7

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

nina-kollman · 2025-12-31T09:01:16Z

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py


-class AgentGoalAccuracyRequest(BaseModel):
+class AgentFlowQualityRequest(BaseModel):
+    config: Optional[AgentFlowQualityConfigRequest] = None


This is not optional

nina-kollman · 2025-12-31T09:01:35Z

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py


-class AgentToolErrorDetectorRequest(BaseModel):
+class AgentGoalCompletenessRequest(BaseModel):
+    config: Optional[AgentGoalCompletenessConfigRequest] = None


Not optional

ellipsis-dev

Important

Looks good to me! 👍

Reviewed f0587f1 in 2 minutes and 0 seconds. Click for details.

Reviewed 81 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:18

Draft comment:
The detailed docstring explaining the nested input structure was removed. Consider adding a brief note that the flat input is wrapped in an {'input': ...} dict for clarity.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This comment is asking the PR author to add back documentation that was intentionally removed. The rules state "Do NOT comment unless there is clearly a code change required" and "Do NOT make comments that are obvious or unimportant." This is a documentation preference comment, not a code issue. The PR author clearly decided to simplify the docstring, removing the detailed explanation. The comment is essentially asking them to reconsider that decision or add some of it back. This falls under making comments about things that aren't clear code issues. The functionality hasn't changed, only the documentation level of detail. Perhaps the detailed documentation was actually important for understanding how the function works, especially if the nested structure is non-obvious. The comment is trying to preserve useful information that helps future maintainers understand the code. While documentation can be helpful, the rules explicitly state not to comment unless there's a clear code change required. This is a documentation style preference, not a functional issue. The PR author intentionally simplified the docstring, and if they felt the detail was unnecessary, that's their judgment call. The code itself is clear enough at line 23 showing request_model(input=input). This comment should be deleted. It's asking the PR author to reconsider a documentation change they intentionally made, which is a style preference rather than a code issue. The rules state comments should only be made when there's clearly a code change required, not for documentation preferences.

2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:14

Draft comment:
The docstring for _get_required_fields was simplified. While brevity is good, consider mentioning that it extracts required fields from the nested 'input' model to aid maintainers.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, suggesting a documentation update. It doesn't provide a specific code suggestion or identify a potential issue with the code itself. According to the rules, purely informative comments should be removed.

3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:33

Draft comment:
The docstring and inline comments in _get_config_fields have been reduced. A brief note on handling Optional[ConfigModel] might be useful for clarity.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and suggests adding a note for clarity. It doesn't provide a specific code suggestion or point out a potential issue with the code. It violates the rule against making purely informative comments.

Workflow ID: wflow_mtYj9VqQPP2VJhE0

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

nina-kollman · 2025-12-31T09:02:39Z

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py

Verify all the places the config is optional - if it is really true.
As far as I know only model config is optional

ellipsis-dev

Important

Looks good to me! 👍

Reviewed f25cc27 in 1 minute and 32 seconds. Click for details.

Reviewed 31 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py:1

Draft comment:
Header filename updated to 'tmpp6g09qxu.json'. Verify that this change is intentional and consistent with configuration.
Reason this comment was not posted:
Confidence changes required: 30% <= threshold 50% None

2. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py:45

Draft comment:
Changed 'config' field from optional to required in AgentFlowQualityRequest. Confirm client code and docs reflect this potentially breaking change.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment explicitly asks the author to "Confirm" that client code and docs reflect the change. This directly violates the rule: "Do NOT ask the PR author to confirm their intention, to explain, to double-check things, to ensure the behavior is intended." The comment doesn't suggest a specific code change - it's purely asking for verification. Additionally, since this is generated code (indicated by the header), the author likely didn't manually make this change but rather regenerated the file from a schema. The comment doesn't provide actionable feedback on what to change in the code itself. However, this is a legitimate breaking change that could cause runtime errors for existing clients. Perhaps the comment is valuable as a warning about the impact, even if it's phrased as a request for confirmation. The comment does identify a real issue that could break existing code. While the breaking change is real, the comment doesn't suggest what code change should be made. It only asks for confirmation about external factors (client code and docs). According to the rules, comments should clearly indicate a code change that's required. If the comment suggested reverting the change or adding a migration path, it would be actionable. As written, it's just asking the author to verify things outside the PR. The comment should be deleted because it asks the author to "confirm" something rather than suggesting a specific code change. It violates the rule against asking authors to verify or ensure things. While it identifies a breaking change, it doesn't provide actionable guidance on what to do about it in the code.

3. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/response.py:1

Draft comment:
Header filename updated to 'tmpp6g09qxu.json' in response file. Ensure consistency with request file.
Reason this comment was not posted:
Confidence changes required: 30% <= threshold 50% None

Workflow ID: wflow_2q7lE5BOudvV7AZu

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

fix(MBT): update swaggr auto gen and algin the classes

72ad88e

fx

f0587f1

ellipsis-dev bot reviewed Dec 31, 2025

View reviewed changes

nina-kollman reviewed Dec 31, 2025

View reviewed changes

ellipsis-dev bot reviewed Dec 31, 2025

View reviewed changes

nina-kollman reviewed Dec 31, 2025

View reviewed changes

fx

f25cc27

ellipsis-dev bot reviewed Dec 31, 2025

View reviewed changes

nina-kollman approved these changes Dec 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(MBT): update auto gen mbt evals #3530

fix(MBT): update auto gen mbt evals #3530

Uh oh!

galzilber commented Dec 31, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

coderabbitai bot commented Dec 31, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

nina-kollman Dec 31, 2025

Uh oh!

nina-kollman Dec 31, 2025

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

nina-kollman Dec 31, 2025

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(MBT): update auto gen mbt evals #3530

Are you sure you want to change the base?

fix(MBT): update auto gen mbt evals #3530

Uh oh!

Conversation

galzilber commented Dec 31, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

nina-kollman Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

nina-kollman Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

nina-kollman Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galzilber commented Dec 31, 2025 •

edited by ellipsis-dev bot

Loading

coderabbitai bot commented Dec 31, 2025 •

edited

Loading