OpenPipe · bradhilton · Sep 16, 2025 · Sep 16, 2025 · Sep 16, 2025
diff --git a/dev/llama-factory/.journal b/dev/llama-factory/.journal
@@ -0,0 +1,77 @@
+We need to add support for Llama Factory training.
+
+We have 2 H200s at our disposal.
+
+We need to successfully run a LoRA SFT job with Qwen/Qwen3-30B-A3B-Instruct-2507.
+
+User insists as much as possible be done via pyproject.toml dependencies.
+
+Can create new extra, `llama-factory`.
+
+Anything that cannot be done with `uv sync --extra llama-factory` must be documented.
+
+We will provide a reproducable script to run the training job with 2 H200s.
+
+What if the model, activations, optimizer, etc. doesn't fit? It should fit with LoRA, but we can try activation offloading and other tricks if we need to.
+
+If the LoRA format is not HF/vLLM compatible, we will need to document how to convert to and from llama-factory format.
+
+We will keep working until it works.
+
+User says to ignore dev/llama-factory/config.yaml, we don't need to update it or follow any of its patterns. We're starting fresh.
+
+We need to keep this journal up-to-date so work can be resumed later.
+
+2025-09-16  — Initial setup
+
+- Added optional extra `llama-factory` in root `pyproject.toml` with dependencies: `llamafactory>=0.9.2`, `deepspeed>=0.15.3`, and `datasets>=2.19.0`. This enables `uv sync --extra llama-factory` to install core training tools.
+- Next: create a reproducible 2xH200 LoRA SFT script targeting `Qwen/Qwen3-30B-A3B-Instruct-2507`, plus a ZeRO‑3 config for safety.
+
+Open questions
+
+1) Confirm the exact HF repo id for the model: is it `Qwen/Qwen3-30B-A3B-Instruct-2507`? If gated, ensure `HF_TOKEN` available.
+2) Preferred dataset for SFT? If none provided, we will wire a minimal built‑in dataset alias for smoke tests and leave a placeholder for user dataset paths.
+3) Output format: prefer HF‑compatible LoRA adapters. If LLaMA‑Factory saves internal format, document `export`/merge steps to HF PEFT/vLLM.
+
+Plan (next steps)
+
+- Added `scripts/llama_factory/qwen3_30b_2xH200_sft.sh` (executable) to launch with 2 GPUs.
+- Added `dev/llama-factory/configs/qwen3_30b_lora.yaml` (LoRA SFT config).
+- Added `dev/llama-factory/configs/deepspeed_zero3.json` (ZeRO‑3, CPU offload safety).
+
+Run steps
+
+1) Install deps: `uv sync --extra llama-factory`
+2) (If model is gated) export `HF_TOKEN=...`
+3) Launch 2xH200 SFT: `scripts/llama_factory/qwen3_30b_2xH200_sft.sh`
+4) Artifacts will be in `outputs/llamafactory/qwen3_30b_lora_sft`
+
+Fallbacks
+
+- If OOM: lower `lora_rank`, increase `gradient_accumulation_steps`, or reduce `cutoff_len`.
+- If still failing: switch `deepspeed_zero3` JSON to more aggressive offloading, or set `flash_attn: fa2` explicitly.
+
+Next
+
+- Document LoRA export/merge to HF PEFT and vLLM runtime.
+
+2025-09-16 — Progress update
+
+- Created isolated env at `dev/llama-factory/.venv` with `llamafactory==0.9.3` and compatible deps.
+- Removed DeepSpeed (nvcc not present) and ran pure Accelerate/DDP.
+- Fixed YAML keys to LLaMA-Factory schema; set `template: chatml` for Qwen3; switched to ONLINE dataset (`tatsu-lab/alpaca`) for smoke test; `max_steps: 20` and `save_steps: 10`.
+- Resolved model loading by upgrading `transformers` in venv to `4.52.4` to support `qwen3_moe`.
+- Installed `hf_transfer` to satisfy HF_HUB fast download env.
+- Training launched on 2x H200; model weights are loading and dataset preprocessing completed; training loop initializing (`max_steps` acknowledged). Will monitor until first checkpoint appears under `outputs/llamafactory/qwen3_30b_lora_sft`.
+
+2025-09-16 — Debug + relaunch (single GPU)
+
+- Fixed failing keys by removing `evaluation_strategy` and set `template: qwen3` (confirmed available in LLaMA‑Factory `TEMPLATES`).
+- Switched dataset config to list form with `dataset_dir: ONLINE` and `dataset: [tatsu-lab/alpaca]` to satisfy parser.
+- Relaunched on 1 GPU with `HF_HUB_ENABLE_HF_TRANSFER=0` to avoid hf_transfer dependency errors.
+- Status: process alive, GPU0 ~41.9 GiB allocated, model loaded, tokenizer/dataset preprocessed, trainer initializing. Awaiting first step/loss log and save at `save_steps: 10`.
+
+Next
+
+- Verify first checkpoint save; then run `llamafactory-cli export peft` to `exports/qwen3_30b_lora_peft` and a quick text-gen sanity check.
+- If OOM/throughput issues appear, reduce `cutoff_len` and/or increase `grad_accum`.
diff --git a/dev/llama-factory/EXPORT.md b/dev/llama-factory/EXPORT.md
@@ -0,0 +1,43 @@
+LLaMA‑Factory LoRA Export and Merge
+
+Goal: produce Hugging Face PEFT‑compatible adapters and merged full model weights suitable for vLLM.
+
+1) Export PEFT LoRA adapter (safe)
+
+Use LLaMA‑Factory CLI export to write a PEFT adapter folder from a training output dir:
+
+```bash
+llamafactory-cli export peft \
+  --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+  --adapter outputs/llamafactory/qwen3_30b_lora_sft \
+  --export_dir exports/qwen3_30b_lora_peft
+```
+
+This yields a HF‑style adapter directory usable with `peft` and `transformers`.
+
+2) Merge LoRA into base weights (for vLLM)
+
+If you need merged weights for inference engines that prefer full weights:
+
+```bash
+llamafactory-cli export merge \
+  --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+  --adapter outputs/llamafactory/qwen3_30b_lora_sft \
+  --export_dir exports/qwen3_30b_merged
+```
+
+3) Load in vLLM (example)
+
+Point vLLM to the merged model directory:
+
+```bash
+python -m vllm.entrypoints.api_server \
+  --model exports/qwen3_30b_merged \
+  --tensor-parallel-size 2
+```
+
+Notes
+
+- Ensure `HF_TOKEN` is set if the base model is gated.
+- For very large models, merging requires substantial CPU RAM and disk space.
+
diff --git a/dev/llama-factory/README.md b/dev/llama-factory/README.md
@@ -0,0 +1,146 @@
+LLaMA‑Factory LoRA SFT for Qwen3‑30B (2×GPU, FP, reproducible)
+
+This folder contains a reproducible setup to fine‑tune `Qwen/Qwen3-30B-A3B-Instruct-2507` with LoRA using LLaMA‑Factory. It supports 1‑GPU debug runs and 2‑GPU data‑parallel (torchrun) runs without quantization (full‑precision bf16). A small ONLINE dataset is wired for smoke tests.
+
+What you get
+
+- Isolated env under `dev/llama-factory/.venv` with pinned deps (Transformers 4.52.4, PEFT, etc.)
+- Training config: `configs/qwen3_30b_lora.yaml` (template=qwen3, bf16, LoRA, ONLINE dataset)
+- 2‑GPU FP run verified (both GPUs utilized)
+- Artifacts in `outputs/llamafactory/<run_name>` (HF PEFT adapter, tokenizer files, checkpoints)
+- Simple inference script snippet (base + adapter)
+
+Prerequisites
+
+- Linux + CUDA GPUs (tested on H200/Hopper). bf16 support recommended.
+- Python via `uv` (https://docs.astral.sh/uv/) installed on host.
+- Disk: ~40–60 GB HF cache + ~2 GB for adapter/checkpoints per short run.
+
+Setup
+
+```bash
+cd dev/llama-factory
+uv sync
+. .venv/bin/activate
+```
+
+Useful env vars (optional):
+
+```bash
+export TOKENIZERS_PARALLELISM=false
+export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
+# If hf_transfer is not installed, disable fast transfer:
+export HF_HUB_ENABLE_HF_TRANSFER=0
+```
+
+Config overview (`configs/qwen3_30b_lora.yaml`)
+
+- `model_name_or_path: Qwen/Qwen3-30B-A3B-Instruct-2507`
+- `template: qwen3` (Qwen3 chat template)
+- LoRA: rank=8, alpha=32, dropout=0.05, targets: `q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`
+- `bf16: true`, `optim: adamw_torch`, `learning_rate: 0.0002`
+- Dataset: `dataset_dir: ONLINE`, `dataset: [tatsu-lab/alpaca]` (swap with your dataset)
+- Output: `output_dir: outputs/llamafactory/qwen3_30b_lora_sft_fp2g` (change per run to avoid auto‑resume)
+- Quantization lines are present but commented. Leave commented for full‑precision training.
+
+Run training
+
+- 2 GPUs (recommended):
+
+```bash
+cd dev/llama-factory
+. .venv/bin/activate
+CUDA_VISIBLE_DEVICES=0,1 \
+HF_HUB_ENABLE_HF_TRANSFER=${HF_HUB_ENABLE_HF_TRANSFER:-0} \
+llamafactory-cli train configs/qwen3_30b_lora.yaml \
+  2>&1 | tee ../../logs/llf_qwen3_30b_2g_fp16_fresh.log
+```
+
+- 1 GPU (debug):
+
+```bash
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train configs/qwen3_30b_lora.yaml
+```
+
+Notes
+
+- Auto‑resume: LLaMA‑Factory/Transformers will resume if `output_dir` already contains checkpoints. To force a fresh run, set a new `output_dir` in the YAML.
+- GPU utilization: verify two trainer ranks are running and memory is allocated on both GPUs:
+  - `ps -ef | grep -E "torchrun|llamafactory/launcher.py"`
+  - `nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv,noheader,nounits`
+
+Outputs and artifacts
+
+- Example fresh run dir: `outputs/llamafactory/qwen3_30b_lora_sft_fp2g/`
+  - `adapter_model.safetensors`, `adapter_config.json` (HF PEFT adapter)
+  - `tokenizer_config.json`, `special_tokens_map.json`, `chat_template.jinja`
+  - `checkpoint-*` subfolders (if `save_steps` is set)
+
+Inference sanity check (base + adapter)
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+import torch
+
+base = "Qwen/Qwen3-30B-A3B-Instruct-2507"
+adapter = "dev/llama-factory/outputs/llamafactory/qwen3_30b_lora_sft_fp2g"
+
+model = AutoModelForCausalLM.from_pretrained(
+    base, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
+)
+model = PeftModel.from_pretrained(model, adapter, is_trainable=False)
+model.eval()
+
+tok = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
+prompt = "You are a helpful assistant.\n\nUser: Tell me a haiku about GPUs.\nAssistant:"
+inputs = tok(prompt, return_tensors="pt").to(next(model.parameters()).device)
+with torch.no_grad():
+    out = model.generate(**inputs, max_new_tokens=64, temperature=0.7, top_p=0.9)
+print(tok.decode(out[0], skip_special_tokens=True))
+```
+
+Switching datasets
+
+- To use ONLINE HF datasets, set `dataset_dir: ONLINE` and replace the list under `dataset:` with your dataset name(s).
+- For local JSON/JSONL/Parquet, point `dataset_dir` to your data folder and set `dataset:` accordingly. See LLaMA‑Factory docs for schema/columns.
+
+Qwen3-235B (8×H200, ZeRO-3 sharded)
+
+- Use `configs/qwen3_235b_lora_zero3.yaml` for LoRA SFT on `Qwen/Qwen3-235B-A22B-Instruct-2507`.
+- DeepSpeed Stage-3 config lives at `configs/deepspeed_zero3_235b.json`; model shards across all 8 GPUs instead of replicating.
+- Example launch (adjust dataset + logging paths):
+  ```bash
+  cd dev/llama-factory
+  . .venv/bin/activate  # or your env
+  mkdir -p ../../logs \
+  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+  torchrun --nproc_per_node=8 --standalone --master_port=29500 \
+    $(pwd)/.venv/bin/llamafactory-cli train configs/qwen3_235b_lora_zero3.yaml \
+    2>&1 | tee ../../logs/llf_qwen3_235b_8g_zero3.log
+  ```
+- The config keeps LoRA targets to attention/router weights to avoid instantiating adapters for every MoE expert.
+  Increase `max_steps`, `max_samples`, and swap in your dataset before real runs.
+- Expect peak GPU memory ~70–80 GiB per H200 for bf16 + ZeRO-3; disable CPU offload or tune JSON buckets if you see stalls.
+- Ensure `nvcc` is available (or set `CUDA_HOME` accordingly) so DeepSpeed can load its prebuilt CUDA ops before launch.
+
+Quantization (optional)
+
+- The YAML contains commented QLoRA lines (`quantization_method: bnb`, etc.). To enable 4‑bit QLoRA:
+  - Uncomment the quantization block.
+  - Consider using `optim: adamw_8bit` in YAML.
+  - Keep `learning_rate` explicit decimal (e.g., `0.0002`) to avoid LR parsing issues with some optimizers.
+
+Troubleshooting
+
+- Fast‑transfer error: if you see `HF_HUB_ENABLE_HF_TRANSFER=1 but hf_transfer not available`, either install `hf_transfer` or set `HF_HUB_ENABLE_HF_TRANSFER=0`.
+- Unsupported keys: remove `evaluation_strategy` from YAML (not used by this CLI path).
+- Wrong LoRA targets: use explicit Qwen3 modules (`q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`).
+- Resume unexpectedly: change `output_dir` in YAML for a fresh run.
+
+References
+
+- LLaMA‑Factory docs: https://github.com/hiyouga/LLaMA-Factory
+- Model card: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507
+
+
diff --git a/dev/llama-factory/config.yaml b/dev/llama-factory/config.yaml
@@ -0,0 +1,94 @@
+# file: skypilot_llamafactory_gptoss20b.yaml
+name: sft-llf-gptoss20b
+
+resources:
+  accelerators: {H200: 1}
+  # Public CUDA image to avoid NGC auth hurdles
+  image_id: docker:pytorch/pytorch:2.4.1-cuda12.1-cudnn9-devel
+
+envs:
+  HF_HUB_ENABLE_HF_TRANSFER: "1"
+  TOKENIZERS_PARALLELISM: "false"
+  PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:128"
+  # Optional: set HF_TOKEN if you use gated models; not needed for gpt-oss-20b
+  # HF_TOKEN: "****"
+
+setup: |
+  set -euxo pipefail
+  apt-get update -y
+  DEBIAN_FRONTEND=noninteractive apt-get install -y git build-essential ninja-build python3-dev
+
+  python -m pip install -U pip wheel setuptools
+
+  # Core training/runtime libs
+  python -m pip install \
+    "transformers>=4.55.0" \
+    "accelerate>=0.33.0" \
+    "datasets>=2.19.0" \
+    "peft>=0.12.0" \
+    "bitsandbytes>=0.43.1" \
+    "trl>=0.9.6" \
+    "xformers>=0.0.27" \
+    "flash-attn>=2.6.1" \
+    "tiktoken" \
+    "vllm>=0.5.5" \
+    "llamafactory>=0.9.2"
+
+  # Minimal workspace
+  mkdir -p /workspace/configs /workspace/outputs /workspace/exports
+
+  # LLaMA‑Factory training config (QLoRA on GPT‑OSS-20B MoE)
+  cat > /workspace/configs/llf_gptoss20b_qlora.yaml <<'YAML'
+  ### model
+  model_name_or_path: openai/gpt-oss-20b
+  trust_remote_code: true
+  template: gpt
+  torch_dtype: bfloat16
+  flash_attn: fa2
+
+  ### method
+  stage: sft
+  finetuning_type: lora
+  # QLoRA knobs
+  use_qlora: true
+  bnb_4bit_quant_type: nf4
+  bnb_4bit_use_double_quant: true
+  bnb_4bit_compute_dtype: bfloat16
+  gradient_checkpointing: true
+
+  # Target common linear modules (attention + MLP). LLaMA‑Factory auto‑maps for GPT‑OSS.
+  lora_rank: 8
+  lora_alpha: 32
+  lora_dropout: 0.05
+  lora_target: all-linear
+
+  ### data (tiny sanity‑check run)
+  dataset: alpaca_gpt4_en     # Built‑in small instruction dataset alias
+  cutoff_len: 1024
+  packing: true
+  max_samples: 512            # keep tiny for quick validation
+
+  ### training
+  num_train_epochs: 1
+  per_device_train_batch_size: 1
+  gradient_accumulation_steps: 8
+  learning_rate: 2e-4
+  lr_scheduler_type: cosine
+  optim: adamw_8bit
+  report_to: none
+  logging_steps: 5
+
+  ### output
+  save_steps: 50
+  save_total_limit: 1
+  output_dir: saves/gpt-oss-20b/lora/sft
+  YAML
+
+run: |
+  set -euxo pipefail
+  nvidia-smi
+  echo "Starting LLaMA‑Factory QLoRA SFT on openai/gpt-oss-20b ..."
+  CUDA_VISIBLE_DEVICES=0 llamafactory-cli train /workspace/configs/llf_gptoss20b_qlora.yaml
+
+  echo "Done. LoRA adapter should be at: saves/gpt-oss-20b/lora/sft"
+  ls -lah saves/gpt-oss-20b/lora/sft || true