Skip to content

[ci] CI coverage tracking #777

@zhuzilin

Description

@zhuzilin

As slime grows, we need more thorough CI coverage for an increasing number of backends, deployment patterns, and training features.
This issue serves as a living index of our existing CI tests so that both users and contributors can quickly see:

  • which backends are currently exercised by CI
  • which parallelism / optimizer / algorithmic features are being tested
  • where coverage gaps still exist

Over time this document should evolve into a compact “CI coverage map” of slime.

Megatron Backend CI

All Megatron-backend CI tests currently check at least the following invariants:

  • kl_loss == 0 on step 0
  • ppo_kl == 0 on step 0 of each rollout

These are used as sanity checks that the initial policy/value setup and reference policy wiring are correct.

tests/test_quick_start_glm4_9B.py

  • tp2, cp2
  • 3 steps, rollout 8 × 8, gbs=32
  • disaggregated
  • algo: GRPO

tests/test_qwen3_30B_A3B.py

  • tp4, pp, cp, ep, cpu adam
  • 3 steps, rollout 8 × 8, gbs=32
  • colocated
  • algo:
    • GSPO
    • Routing replay
    • TIS

tests/test_moonlight_16B_A3B.py

  • mla, tp4, pp, cp, ep, cpu adam
  • 3 steps, rollout 8 × 8, gbs=32
  • colocated

tests/test_qwen3_4B_ppo.py

  • tp4, pp, cp, ep, cpu adam
  • 3 steps, rollout 8 × 8, gbs=32
  • colocated
  • algo:
    • PPO

tests/test_qwen2.5_0.5B_gsm8k.py

  • no parallism
  • algo: grpo

tests/test_qwen2.5_0.5B_gsm8k_async.py

  • no parallism
  • algo:
    • GSPO
    • true on-policy

FSDP Backend CI

All FSDP-backend CI tests share a common set of checks (e.g., basic training sanity, loss decreasing, etc.).
More detailed invariants will be documented here as they are standardized across tests.

TODO: enumerate and unify the exact common assertions for FSDP tests.

tests/test_qwen3_4B_fsdp_true_on_policy.py

  • algo: grpo

tests/test_qwen3_0.6B_fsdp_distributed.py

tests/test_qwen3_0.6B_fsdp_colocated_2xGPU.py


TODO

  • PPO
  • rollout routing replay
  • deterministic training
  • distributed post
  • partial rollout
  • fault tolerance
  • mtp training

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions