[Fix][CI] Address generator CI test fails when model stop reason is length#269
[Fix][CI] Address generator CI test fails when model stop reason is length#269CharlieFRuan merged 6 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix a flaky CI test by using larger models, more descriptive prompts, and adding a check for the model's stop reason before asserting on the output. The changes are generally well-aligned with the goal. I've identified a minor opportunity for code refactoring to improve maintainability and a potential bug in another test that could lead to similar flakiness. My review includes suggestions to address these points.
| MODEL_TO_GENERATION_PROMPT = { | ||
| "Qwen/Qwen2.5-1.5B-Instruct": "<|im_start|>assistant\n", | ||
| "unsloth/Llama-3.2-1B-Instruct": "<|start_header_id|>assistant<|end_header_id|>\n\n", | ||
| "Qwen/Qwen3-0.6B": "<|im_start|>assistant\n", | ||
| "Qwen/Qwen2.5-3B-Instruct": "<|im_start|>assistant\n", | ||
| "unsloth/Llama-3.2-3B-Instruct": "<|start_header_id|>assistant<|end_header_id|>\n\n", | ||
| "Qwen/Qwen3-1.7B": "<|im_start|>assistant\n", |
There was a problem hiding this comment.
Hmm this doesn't seem like the best fix. Idealy we keep models as small as possible
This is motivated by small models repeating themselves and hitting hte length limit?
Intuitively the agent loop exits based on three criteria:
- step limit:
max_turnsin our case - budget: cost for openai models, in our case it's just max length
- model-initiated: model wants to exit.
It looks like we should just add 2. and this is fixed?
There was a problem hiding this comment.
Yea I included solution 2, but in that case not much can be checked and hence makes the test a bit lenient.
But since this doesn't seem to happen frequently we can do it with a warning.
There was a problem hiding this comment.
I will revert this model size change, but keep the max generate length increase
There was a problem hiding this comment.
Got it.
I also feel this test itself is not great given that we are just checking for total eos tokens and thus need to deal with these edge cases. We should revisit the test later.
…ength (NovaSky-AI#269) Our unit test checks whether, for turns=3, the final conversation generates 3 EOS token. This will not be the case when the model's generation stop due to length. To address, we do the following: - In the dummy environment, change observation `"turn {i}"` to `"give me another solution {i}"`, which might make more sense for the model - Increase max generate length from 1000 to 3000 - Final guard is to check when the stop reason is not `"stop"`, we don't check the number of EOS token IDs
Our unit test checks whether, for turns=3, the final conversation generates 3 EOS token. This will not be the case when the model's generation stop due to length. To address, we do the following:
"turn {i}"to"give me another solution {i}", which might make more sense for the model"stop", we don't check the number of EOS token IDs