Skip to content

[Fix][CI] Address generator CI test fails when model stop reason is length#269

Merged
CharlieFRuan merged 6 commits intomainfrom
0908-ci-generator-fix
Sep 9, 2025
Merged

[Fix][CI] Address generator CI test fails when model stop reason is length#269
CharlieFRuan merged 6 commits intomainfrom
0908-ci-generator-fix

Conversation

@CharlieFRuan
Copy link
Collaborator

@CharlieFRuan CharlieFRuan commented Sep 9, 2025

Our unit test checks whether, for turns=3, the final conversation generates 3 EOS token. This will not be the case when the model's generation stop due to length. To address, we do the following:

  • In the dummy environment, change observation "turn {i}" to "give me another solution {i}", which might make more sense for the model
  • Increase max generate length from 1000 to 3000
  • Final guard is to check when the stop reason is not "stop", we don't check the number of EOS token IDs

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a flaky CI test by using larger models, more descriptive prompts, and adding a check for the model's stop reason before asserting on the output. The changes are generally well-aligned with the goal. I've identified a minor opportunity for code refactoring to improve maintainability and a potential bug in another test that could lead to similar flakiness. My review includes suggestions to address these points.

Comment on lines 59 to 62
MODEL_TO_GENERATION_PROMPT = {
"Qwen/Qwen2.5-1.5B-Instruct": "<|im_start|>assistant\n",
"unsloth/Llama-3.2-1B-Instruct": "<|start_header_id|>assistant<|end_header_id|>\n\n",
"Qwen/Qwen3-0.6B": "<|im_start|>assistant\n",
"Qwen/Qwen2.5-3B-Instruct": "<|im_start|>assistant\n",
"unsloth/Llama-3.2-3B-Instruct": "<|start_header_id|>assistant<|end_header_id|>\n\n",
"Qwen/Qwen3-1.7B": "<|im_start|>assistant\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this doesn't seem like the best fix. Idealy we keep models as small as possible

This is motivated by small models repeating themselves and hitting hte length limit?

Intuitively the agent loop exits based on three criteria:

  1. step limit: max_turns in our case
  2. budget: cost for openai models, in our case it's just max length
  3. model-initiated: model wants to exit.

It looks like we should just add 2. and this is fixed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I included solution 2, but in that case not much can be checked and hence makes the test a bit lenient.

But since this doesn't seem to happen frequently we can do it with a warning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert this model size change, but keep the max generate length increase

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

I also feel this test itself is not great given that we are just checking for total eos tokens and thus need to deal with these edge cases. We should revisit the test later.

@CharlieFRuan CharlieFRuan merged commit 45b6a98 into main Sep 9, 2025
3 checks passed
@CharlieFRuan CharlieFRuan deleted the 0908-ci-generator-fix branch September 9, 2025 03:53
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…ength (NovaSky-AI#269)

Our unit test checks whether, for turns=3, the final conversation
generates 3 EOS token. This will not be the case when the model's
generation stop due to length. To address, we do the following:
- In the dummy environment, change observation `"turn {i}"` to `"give me
another solution {i}"`, which might make more sense for the model
- Increase max generate length from 1000 to 3000
- Final guard is to check when the stop reason is not `"stop"`, we don't
check the number of EOS token IDs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments