-
Notifications
You must be signed in to change notification settings - Fork 59
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Description of the bug
Summary
Following the official math_multiturn.md documentation results in 0% completion rate and zero rewards. The documentation is missing a critical parameter: observation_template: '{observation}'.
What Should Happen
Training should achieve ~70% mean score and 100% completion rate (as shown in official training logs from 2025-12-20).
What Actually Happens
- Completion rate: 0% (all 128 samples incomplete)
- Mean reward: 0.0 (no rewards computed)
- Turns per episode: 11.0 (hits maximum limit)
- Response length: 5248 tokens (extremely long, model rambling)
Root Cause
The documentation shows:
interaction:
config:
env_port: <env_port>
env_host: <client_endpoint>
But the working configuration requires:
interaction:
config:
env_endpoint: http://<host>:<port>
observation_template: '{observation}' # ⚠️ MISSING FROM DOCS
The observation_template parameter is nowhere mentioned in the documentation (verified by reviewing complete docs).
Impact
This blocks all users following the official guide from successfully training math agents.
### Steps To Reproduce
1. Follow `docs/math_multiturn.md` exactly as written
2. Start the math environment server:
```bash
python opentinker/environment/math/math_tool_server.py --port 5000
3. Start training with the documented configuration:
python opentinker/client/math_tool_rl.py \
tokenizer_path=Qwen/Qwen2.5-3B-Instruct \
batch_size=16 \
interaction.config.env_port=5000 \
interaction.config.env_host=<your_host>
4. Observe validation results showing:
val/completion_rate: 0.0
val/mean_reward: 0.0
val/incomplete_samples: 64/64 (100%)
Expected: ~70% mean score, 100% completion rate
Actual: 0% completion rate, zero rewards
---
## **4. Additional Information**
```markdown
## Evidence: Documentation vs. Working Configuration
### Metrics Comparison
| Metric | Following Docs | Official Logs | Difference |
|--------|---------------|---------------|------------|
| `completion_rate` | 0.0% | 100% | ❌ |
| `mean_reward` | 0.0 | 0.617 | ❌ |
| `num_turns/mean` | 11.0 | 3.375 | 3x more |
| `response_length/mean` | 5248 | 594 | 8.8x longer |
| `incomplete_samples` | 100% | 0% | ❌ |
### Our Training Log (Following Docs - Broken)
[2026-01-19 22:39:55,983] Pre-training validation: val/mean_score: 0.0 val/completion_rate: 0.0 val/incomplete_samples: 64 (100%) val_game/mean_game_steps: 0.0
step:1 - training_step_reward:0.0 - completion_rate:0.0 - num_turns/mean:11.0 - response_length/mean:5248.328125
### Official Training Log (With observation_template - Working)
2025-12-20 22:31:11,795] Pre-training validation: val/mean_score: 0.7 val/completion_rate: 1.0 val/incomplete_samples: 0 (0%) val_game/mean_game_steps: 1.14
step:1 - training_step_reward:0.6171875 - completion_rate:1.0 - num_turns/mean:3.375 - response_length/mean:594.03125
## Environment
- **Repository**: Latest main branch
- **Model**: Qwen/Qwen2.5-3B-Instruct
- **GPU**: 4x NVIDIA H800 (80GB)
- **CUDA**: 12.8.1
- **Documentation reviewed**: Complete `math_multiturn.md` and `README.md`
## Proposed Fix
Update `docs/math_multiturn.md` to include:
```yaml
interaction:
config:
env_endpoint: http://<host>:<port>
job_id: <job_id>
max_steps: 5
observation_template: '{observation}' # Add this line with explanation
And add a note explaining this parameter is required for the model to correctly parse environment feedback.
### Additional Information
## Evidence: Documentation vs. Working Configuration
### Metrics Comparison
| Metric | Following Docs | Official Logs | Status |
|--------|---------------|---------------|--------|
| `completion_rate` | 0.0% | 100% | ❌ |
| `mean_reward` | 0.0 | 0.617 | ❌ |
| `num_turns/mean` | 11.0 | 3.375 | ❌ 3x more |
| `response_length/mean` | 5248 | 594 | ❌ 8.8x longer |
| `incomplete_samples` | 100% | 0% | ❌ |
| `mean_game_steps` | 0.0 | 1.19 | ❌ |
---
### Our Training Log (Following Documentation - NOT Working)
<details>
<summary>Click to expand full log</summary>
markdown
## Evidence: Documentation vs. Working Configuration
### Metrics Comparison
| Metric | Following Docs | Official Logs | Status |
|--------|---------------|---------------|--------|
| `completion_rate` | 0.0% | 100% | ❌ |
| `mean_reward` | 0.0 | 0.617 | ❌ |
| `num_turns/mean` | 11.0 | 3.375 | ❌ 3x more |
| `response_length/mean` | 5248 | 594 | ❌ 8.8x longer |
| `incomplete_samples` | 100% | 0% | ❌ |
| `mean_game_steps` | 0.0 | 1.19 | ❌ |
---
### Our Training Log (Following Documentation - NOT Working)
<details>
<summary>Click to expand full log</summary>
[2026-01-19 22:32:20,211][utils.http_training_client][INFO] - Initialized tracking with backends: ['console', 'wandb'] [ServiceClient] Passing multi_turn config to server: {'max_user_turns': 5, 'max_assistant_turns': 5, 'max_tokens_per_turn': 1024, 'weave_project': None, 'experiment_name': 'math_code_interpreter'} [2026-01-19 22:32:20,213][utils.http_training_client][INFO] - Setting generation config: {'temperature': 1, 'top_p': 1, 'max_new_tokens': 8192}
Training configuration:
Algorithm: agent_loop
Epochs: 5
Batch size: 16
Max turns: 5
[2026-01-19 22:32:20,246][utils.http_training_client][INFO] - Training for 5 epochs (2343 steps per epoch, 11715 total steps) [2026-01-19 22:34:37,354][utils.http_training_client][INFO] - Workers initialized successfully [2026-01-19 22:34:37,355][utils.http_training_client][INFO] - Running validation before training...
[2026-01-19 22:39:55,982][utils.http_training_client][INFO] - Validation game stats: win_rate=0.00% [2026-01-19 22:39:55,983][utils.http_training_client][INFO] - Pre-training validation: { 'val/mean_score': 0.0, 'val/std_score': 0.0, 'val/max_score': 0.0, 'val/min_score': 0.0, 'val_game/total_samples': 64, 'val_game/games_in_step': 0, 'val_game/incomplete_samples': 64, 'val_game/completion_rate': 0.0, 'val_game/mean_final_reward': 0.0, 'val_game/mean_sum_reward': 0.0, 'val_game/mean_sum_reward_all': 0.0, 'val_game/mean_avg_reward': 0.0, 'val_game/mean_game_steps': 0.0, 'val_game/mean_reward': 0.0, 'val_game/total_interactions': 320 }
step:1 - global_seqlen/mean:5245950976.0 - actor/entropy:11.931066513061523 - training_step_reward:0.0 - actor/kl_loss:0.0 - actor/pg_loss:0.0 - critic/score/mean:0.0 - critic/score/max:0.0 - critic/score/min:0.0 - critic/rewards/mean:0.0 - critic/rewards/max:0.0 - critic/rewards/min:0.0 - response_length/mean:5248.328125 - response_length/max:5290.0 - response_length/min:4354.0 - num_turns/min:11.0 - num_turns/max:11.0 - num_turns/mean:11.0 - timing_s/agent_loop/generate_sequences/mean:419.48593199058087 - timing_s/agent_loop/tool_calls/mean:0.0 - game/total_samples:128 - game/games_in_step:0 - game/incomplete_samples:128 - game/completion_rate:0.0 - game/mean_final_reward:0.0 - game/mean_sum_reward:0.0 - game/mean_avg_reward:0.0 - game/mean_game_steps:0.0 - game/mean_reward:0.0 - game/total_interactions:640
**Key Issues:**
- ❌ `completion_rate: 0.0` - No tasks completed
- ❌ `incomplete_samples: 128/128` - All samples failed
- ❌ `mean_reward: 0.0` - No rewards computed
- ❌ `num_turns/mean: 11.0` - Hit maximum turn limit
- ❌ `response_length/mean: 5248` - Extremely long responses
- ❌ `tool_calls/mean: 0.0` - No tool usage
- ❌ `mean_game_steps: 0.0` - Environment not progressing
</details>
---
### Official Training Log (With `observation_template` - Working)
<details>
<summary>Click to expand full log</summary>
[2025-12-20 22:28:36,413][http_training_client][INFO] - Initialized tracking with backends: ['console', 'wandb'] [ServiceClient] Passing multi_turn config to server: {'max_user_turns': 5, 'max_assistant_turns': 5, 'max_tokens_per_turn': 1024, 'weave_project': None, 'experiment_name': 'math_code_interpreter'} [2025-12-20 22:28:36,417][http_training_client][INFO] - Setting generation config: {'temperature': 1, 'top_p': 1, 'max_new_tokens': 8192}
Environment config: {'actor_rollout_ref': {'rollout': {'multi_turn': {'interaction_config_path': '/tmp/math_code_interpreter_interaction_config_tkywyksc.yaml', 'interaction_config_content': "interaction:\n- class_name: opentinker.environment.gym_environment_interaction.GymEnvironmentInteraction\n config:\n env_endpoint: http://172.22.224.251/:8088\n job_id: 0d46716b\n max_steps: 5\n observation_template: '{observation}'\n name: math_code_interpreter\n"}}}}
Training configuration:
Algorithm: agent_loop
Epochs: 1
Batch size: 16
Max turns: 5
[2025-12-20 22:28:36,474][http_training_client][INFO] - Training for 1 epochs (468 steps per epoch, 468 total steps) [2025-12-20 22:30:39,653][http_training_client][INFO] - Workers initialized successfully [2025-12-20 22:30:39,654][http_training_client][INFO] - Running validation before training...
[2025-12-20 22:31:11,794][http_training_client][INFO] - Validation game stats: win_rate=0.00% [2025-12-20 22:31:11,795][http_training_client][INFO] - Pre-training validation: { 'val/mean_score': 0.7, 'val/std_score': 0.45825756949558394, 'val/max_score': 1.0, 'val/min_score': 0.0, 'val_game/total_samples': 104, 'val_game/games_in_step': 104, 'val_game/incomplete_samples': 0, 'val_game/completion_rate': 1.0, 'val_game/mean_final_reward': 0.7019230769230769, 'val_game/mean_sum_reward': 0.7019230769230769, 'val_game/mean_sum_reward_all': 0.7019230769230769, 'val_game/mean_avg_reward': 0.6538461538461539, 'val_game/mean_game_steps': 1.1442307692307692, 'val_game/mean_reward': 0.7019230769230769, 'val_game/total_interactions': 119 }
step:1 - global_seqlen/mean:664253632.0 - actor/entropy:0.29363444447517395 - training_step_reward:0.6171875 - actor/pg_loss:0.02784726768732071 - critic/score/mean:0.6171875 - critic/score/max:1.0 - critic/score/min:0.0 - critic/rewards/mean:0.6171875 - critic/rewards/max:1.0 - critic/rewards/min:0.0 - response_length/mean:594.03125 - response_length/max:1944.0 - response_length/min:179.0 - num_turns/min:3.0 - num_turns/max:7.0 - num_turns/mean:3.375 - timing_s/agent_loop/generate_sequences/mean:7.187697933011805 - timing_s/agent_loop/tool_calls/mean:0.0 - game/total_samples:128 - game/games_in_step:128 - game/incomplete_samples:0 - game/completion_rate:1.0 - game/mean_final_reward:0.6171875 - game/mean_sum_reward:0.6171875 - game/mean_avg_reward:0.5807291666666666 - game/mean_game_steps:1.1875 - game/mean_reward:0.6171875 - game/total_interactions:152
**Key Success Indicators:**
- ✅ `completion_rate: 1.0` - All tasks completed
- ✅ `incomplete_samples: 0/128` - No failures
- ✅ `mean_reward: 0.617` - Rewards computed correctly
- ✅ `num_turns/mean: 3.375` - Efficient completion
- ✅ `response_length/mean: 594` - Concise responses
- ✅ `mean_game_steps: 1.19` - Environment working
</details>
---
## Configuration Difference
The **only** difference between our broken setup and the working setup is the presence of:
```yaml
observation_template: '{observation}'
This parameter is not mentioned anywhere in the official documentation.
Environment
Repository: https://github.com/open-tinker/OpenTinker (latest main branch, as of 2026-01-20)
Model: Qwen/Qwen2.5-3B-Instruct
GPU: 4x NVIDIA H800 (80GB each)
CUDA: 12.8.1
Python: 3.12
Documentation reviewed: Complete math_multiturn.md and README.md (verified via browser inspection)Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working