Skip to content

[BUG]: [Documentation Bug] Missing observation_template parameter in math_multiturn.md causes 0% completion rate #26

@tinycrown

Description

@tinycrown

Description of the bug

Summary

Following the official math_multiturn.md documentation results in 0% completion rate and zero rewards. The documentation is missing a critical parameter: observation_template: '{observation}'.

What Should Happen

Training should achieve ~70% mean score and 100% completion rate (as shown in official training logs from 2025-12-20).

What Actually Happens

  • Completion rate: 0% (all 128 samples incomplete)
  • Mean reward: 0.0 (no rewards computed)
  • Turns per episode: 11.0 (hits maximum limit)
  • Response length: 5248 tokens (extremely long, model rambling)

Root Cause

The documentation shows:

interaction:
  config:
    env_port: <env_port>
    env_host: <client_endpoint>
But the working configuration requires:
interaction:
  config:
    env_endpoint: http://<host>:<port>
    observation_template: '{observation}'  # ⚠️ MISSING FROM DOCS
The observation_template parameter is nowhere mentioned in the documentation (verified by reviewing complete docs).

Impact
This blocks all users following the official guide from successfully training math agents.
  

### Steps To Reproduce


1. Follow `docs/math_multiturn.md` exactly as written
2. Start the math environment server:
   ```bash
   python opentinker/environment/math/math_tool_server.py --port 5000
3. Start training with the documented configuration:
python opentinker/client/math_tool_rl.py \
    tokenizer_path=Qwen/Qwen2.5-3B-Instruct \
    batch_size=16 \
    interaction.config.env_port=5000 \
    interaction.config.env_host=<your_host>
4. Observe validation results showing:
  val/completion_rate: 0.0
  val/mean_reward: 0.0
  val/incomplete_samples: 64/64 (100%)
Expected: ~70% mean score, 100% completion rate
Actual: 0% completion rate, zero rewards

---

## **4. Additional Information**

```markdown
## Evidence: Documentation vs. Working Configuration

### Metrics Comparison

| Metric | Following Docs | Official Logs | Difference |
|--------|---------------|---------------|------------|
| `completion_rate` | 0.0% | 100% | ❌ |
| `mean_reward` | 0.0 | 0.617 | ❌ |
| `num_turns/mean` | 11.0 | 3.375 | 3x more |
| `response_length/mean` | 5248 | 594 | 8.8x longer |
| `incomplete_samples` | 100% | 0% | ❌ |

### Our Training Log (Following Docs - Broken)
[2026-01-19 22:39:55,983] Pre-training validation: val/mean_score: 0.0 val/completion_rate: 0.0 val/incomplete_samples: 64 (100%) val_game/mean_game_steps: 0.0

step:1 - training_step_reward:0.0 - completion_rate:0.0 - num_turns/mean:11.0 - response_length/mean:5248.328125

### Official Training Log (With observation_template - Working)
2025-12-20 22:31:11,795] Pre-training validation: val/mean_score: 0.7 val/completion_rate: 1.0 val/incomplete_samples: 0 (0%) val_game/mean_game_steps: 1.14

step:1 - training_step_reward:0.6171875 - completion_rate:1.0 - num_turns/mean:3.375 - response_length/mean:594.03125

## Environment
- **Repository**: Latest main branch
- **Model**: Qwen/Qwen2.5-3B-Instruct
- **GPU**: 4x NVIDIA H800 (80GB)
- **CUDA**: 12.8.1
- **Documentation reviewed**: Complete `math_multiturn.md` and `README.md`

## Proposed Fix
Update `docs/math_multiturn.md` to include:

```yaml
interaction:
  config:
    env_endpoint: http://<host>:<port>
    job_id: <job_id>
    max_steps: 5
    observation_template: '{observation}'  # Add this line with explanation
And add a note explaining this parameter is required for the model to correctly parse environment feedback.

### Additional Information

## Evidence: Documentation vs. Working Configuration

### Metrics Comparison

| Metric | Following Docs | Official Logs | Status |
|--------|---------------|---------------|--------|
| `completion_rate` | 0.0% | 100% | ❌ |
| `mean_reward` | 0.0 | 0.617 | ❌ |
| `num_turns/mean` | 11.0 | 3.375 | ❌ 3x more |
| `response_length/mean` | 5248 | 594 | ❌ 8.8x longer |
| `incomplete_samples` | 100% | 0% | ❌ |
| `mean_game_steps` | 0.0 | 1.19 | ❌ |

---

### Our Training Log (Following Documentation - NOT Working)

<details>
<summary>Click to expand full log</summary>
markdown
## Evidence: Documentation vs. Working Configuration
### Metrics Comparison
| Metric | Following Docs | Official Logs | Status |
|--------|---------------|---------------|--------|
| `completion_rate` | 0.0% | 100% | ❌ |
| `mean_reward` | 0.0 | 0.617 | ❌ |
| `num_turns/mean` | 11.0 | 3.375 | ❌ 3x more |
| `response_length/mean` | 5248 | 594 | ❌ 8.8x longer |
| `incomplete_samples` | 100% | 0% | ❌ |
| `mean_game_steps` | 0.0 | 1.19 | ❌ |
---
### Our Training Log (Following Documentation - NOT Working)
<details>
<summary>Click to expand full log</summary>
[2026-01-19 22:32:20,211][utils.http_training_client][INFO] - Initialized tracking with backends: ['console', 'wandb'] [ServiceClient] Passing multi_turn config to server: {'max_user_turns': 5, 'max_assistant_turns': 5, 'max_tokens_per_turn': 1024, 'weave_project': None, 'experiment_name': 'math_code_interpreter'} [2026-01-19 22:32:20,213][utils.http_training_client][INFO] - Setting generation config: {'temperature': 1, 'top_p': 1, 'max_new_tokens': 8192}

Training configuration:

Algorithm: agent_loop
Epochs: 5
Batch size: 16
Max turns: 5
[2026-01-19 22:32:20,246][utils.http_training_client][INFO] - Training for 5 epochs (2343 steps per epoch, 11715 total steps) [2026-01-19 22:34:37,354][utils.http_training_client][INFO] - Workers initialized successfully [2026-01-19 22:34:37,355][utils.http_training_client][INFO] - Running validation before training...

[2026-01-19 22:39:55,982][utils.http_training_client][INFO] - Validation game stats: win_rate=0.00% [2026-01-19 22:39:55,983][utils.http_training_client][INFO] - Pre-training validation: { 'val/mean_score': 0.0, 'val/std_score': 0.0, 'val/max_score': 0.0, 'val/min_score': 0.0, 'val_game/total_samples': 64, 'val_game/games_in_step': 0, 'val_game/incomplete_samples': 64, 'val_game/completion_rate': 0.0, 'val_game/mean_final_reward': 0.0, 'val_game/mean_sum_reward': 0.0, 'val_game/mean_sum_reward_all': 0.0, 'val_game/mean_avg_reward': 0.0, 'val_game/mean_game_steps': 0.0, 'val_game/mean_reward': 0.0, 'val_game/total_interactions': 320 }

step:1 - global_seqlen/mean:5245950976.0 - actor/entropy:11.931066513061523 - training_step_reward:0.0 - actor/kl_loss:0.0 - actor/pg_loss:0.0 - critic/score/mean:0.0 - critic/score/max:0.0 - critic/score/min:0.0 - critic/rewards/mean:0.0 - critic/rewards/max:0.0 - critic/rewards/min:0.0 - response_length/mean:5248.328125 - response_length/max:5290.0 - response_length/min:4354.0 - num_turns/min:11.0 - num_turns/max:11.0 - num_turns/mean:11.0 - timing_s/agent_loop/generate_sequences/mean:419.48593199058087 - timing_s/agent_loop/tool_calls/mean:0.0 - game/total_samples:128 - game/games_in_step:0 - game/incomplete_samples:128 - game/completion_rate:0.0 - game/mean_final_reward:0.0 - game/mean_sum_reward:0.0 - game/mean_avg_reward:0.0 - game/mean_game_steps:0.0 - game/mean_reward:0.0 - game/total_interactions:640


**Key Issues:**
- ❌ `completion_rate: 0.0` - No tasks completed
- ❌ `incomplete_samples: 128/128` - All samples failed
- ❌ `mean_reward: 0.0` - No rewards computed
- ❌ `num_turns/mean: 11.0` - Hit maximum turn limit
- ❌ `response_length/mean: 5248` - Extremely long responses
- ❌ `tool_calls/mean: 0.0` - No tool usage
- ❌ `mean_game_steps: 0.0` - Environment not progressing

</details>

---

### Official Training Log (With `observation_template` - Working)

<details>
<summary>Click to expand full log</summary>
[2025-12-20 22:28:36,413][http_training_client][INFO] - Initialized tracking with backends: ['console', 'wandb'] [ServiceClient] Passing multi_turn config to server: {'max_user_turns': 5, 'max_assistant_turns': 5, 'max_tokens_per_turn': 1024, 'weave_project': None, 'experiment_name': 'math_code_interpreter'} [2025-12-20 22:28:36,417][http_training_client][INFO] - Setting generation config: {'temperature': 1, 'top_p': 1, 'max_new_tokens': 8192}

Environment config: {'actor_rollout_ref': {'rollout': {'multi_turn': {'interaction_config_path': '/tmp/math_code_interpreter_interaction_config_tkywyksc.yaml', 'interaction_config_content': "interaction:\n- class_name: opentinker.environment.gym_environment_interaction.GymEnvironmentInteraction\n config:\n env_endpoint: http://172.22.224.251/:8088\n job_id: 0d46716b\n max_steps: 5\n observation_template: '{observation}'\n name: math_code_interpreter\n"}}}}

Training configuration:

Algorithm: agent_loop
Epochs: 1
Batch size: 16
Max turns: 5
[2025-12-20 22:28:36,474][http_training_client][INFO] - Training for 1 epochs (468 steps per epoch, 468 total steps) [2025-12-20 22:30:39,653][http_training_client][INFO] - Workers initialized successfully [2025-12-20 22:30:39,654][http_training_client][INFO] - Running validation before training...

[2025-12-20 22:31:11,794][http_training_client][INFO] - Validation game stats: win_rate=0.00% [2025-12-20 22:31:11,795][http_training_client][INFO] - Pre-training validation: { 'val/mean_score': 0.7, 'val/std_score': 0.45825756949558394, 'val/max_score': 1.0, 'val/min_score': 0.0, 'val_game/total_samples': 104, 'val_game/games_in_step': 104, 'val_game/incomplete_samples': 0, 'val_game/completion_rate': 1.0, 'val_game/mean_final_reward': 0.7019230769230769, 'val_game/mean_sum_reward': 0.7019230769230769, 'val_game/mean_sum_reward_all': 0.7019230769230769, 'val_game/mean_avg_reward': 0.6538461538461539, 'val_game/mean_game_steps': 1.1442307692307692, 'val_game/mean_reward': 0.7019230769230769, 'val_game/total_interactions': 119 }

step:1 - global_seqlen/mean:664253632.0 - actor/entropy:0.29363444447517395 - training_step_reward:0.6171875 - actor/pg_loss:0.02784726768732071 - critic/score/mean:0.6171875 - critic/score/max:1.0 - critic/score/min:0.0 - critic/rewards/mean:0.6171875 - critic/rewards/max:1.0 - critic/rewards/min:0.0 - response_length/mean:594.03125 - response_length/max:1944.0 - response_length/min:179.0 - num_turns/min:3.0 - num_turns/max:7.0 - num_turns/mean:3.375 - timing_s/agent_loop/generate_sequences/mean:7.187697933011805 - timing_s/agent_loop/tool_calls/mean:0.0 - game/total_samples:128 - game/games_in_step:128 - game/incomplete_samples:0 - game/completion_rate:1.0 - game/mean_final_reward:0.6171875 - game/mean_sum_reward:0.6171875 - game/mean_avg_reward:0.5807291666666666 - game/mean_game_steps:1.1875 - game/mean_reward:0.6171875 - game/total_interactions:152

**Key Success Indicators:**
- ✅ `completion_rate: 1.0` - All tasks completed
- ✅ `incomplete_samples: 0/128` - No failures
- ✅ `mean_reward: 0.617` - Rewards computed correctly
- ✅ `num_turns/mean: 3.375` - Efficient completion
- ✅ `response_length/mean: 594` - Concise responses
- ✅ `mean_game_steps: 1.19` - Environment working

</details>

---

## Configuration Difference

The **only** difference between our broken setup and the working setup is the presence of:

```yaml
observation_template: '{observation}'
This parameter is not mentioned anywhere in the official documentation.
Environment
Repository: https://github.com/open-tinker/OpenTinker (latest main branch, as of 2026-01-20)
Model: Qwen/Qwen2.5-3B-Instruct
GPU: 4x NVIDIA H800 (80GB each)
CUDA: 12.8.1
Python: 3.12
Documentation reviewed: Complete math_multiturn.md and README.md (verified via browser inspection)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions