Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions dev/llama-factory/.journal
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
We need to add support for Llama Factory training.

We have 2 H200s at our disposal.

We need to successfully run a LoRA SFT job with Qwen/Qwen3-30B-A3B-Instruct-2507.

User insists as much as possible be done via pyproject.toml dependencies.

Can create new extra, `llama-factory`.

Anything that cannot be done with `uv sync --extra llama-factory` must be documented.

We will provide a reproducable script to run the training job with 2 H200s.

What if the model, activations, optimizer, etc. doesn't fit? It should fit with LoRA, but we can try activation offloading and other tricks if we need to.

If the LoRA format is not HF/vLLM compatible, we will need to document how to convert to and from llama-factory format.

We will keep working until it works.

User says to ignore dev/llama-factory/config.yaml, we don't need to update it or follow any of its patterns. We're starting fresh.

We need to keep this journal up-to-date so work can be resumed later.

2025-09-16 — Initial setup

- Added optional extra `llama-factory` in root `pyproject.toml` with dependencies: `llamafactory>=0.9.2`, `deepspeed>=0.15.3`, and `datasets>=2.19.0`. This enables `uv sync --extra llama-factory` to install core training tools.
- Next: create a reproducible 2xH200 LoRA SFT script targeting `Qwen/Qwen3-30B-A3B-Instruct-2507`, plus a ZeRO‑3 config for safety.

Open questions

1) Confirm the exact HF repo id for the model: is it `Qwen/Qwen3-30B-A3B-Instruct-2507`? If gated, ensure `HF_TOKEN` available.
2) Preferred dataset for SFT? If none provided, we will wire a minimal built‑in dataset alias for smoke tests and leave a placeholder for user dataset paths.
3) Output format: prefer HF‑compatible LoRA adapters. If LLaMA‑Factory saves internal format, document `export`/merge steps to HF PEFT/vLLM.

Plan (next steps)

- Added `scripts/llama_factory/qwen3_30b_2xH200_sft.sh` (executable) to launch with 2 GPUs.
- Added `dev/llama-factory/configs/qwen3_30b_lora.yaml` (LoRA SFT config).
- Added `dev/llama-factory/configs/deepspeed_zero3.json` (ZeRO‑3, CPU offload safety).

Run steps

1) Install deps: `uv sync --extra llama-factory`
2) (If model is gated) export `HF_TOKEN=...`
3) Launch 2xH200 SFT: `scripts/llama_factory/qwen3_30b_2xH200_sft.sh`
4) Artifacts will be in `outputs/llamafactory/qwen3_30b_lora_sft`

Fallbacks

- If OOM: lower `lora_rank`, increase `gradient_accumulation_steps`, or reduce `cutoff_len`.
- If still failing: switch `deepspeed_zero3` JSON to more aggressive offloading, or set `flash_attn: fa2` explicitly.

Next

- Document LoRA export/merge to HF PEFT and vLLM runtime.

2025-09-16 — Progress update

- Created isolated env at `dev/llama-factory/.venv` with `llamafactory==0.9.3` and compatible deps.
- Removed DeepSpeed (nvcc not present) and ran pure Accelerate/DDP.
- Fixed YAML keys to LLaMA-Factory schema; set `template: chatml` for Qwen3; switched to ONLINE dataset (`tatsu-lab/alpaca`) for smoke test; `max_steps: 20` and `save_steps: 10`.
- Resolved model loading by upgrading `transformers` in venv to `4.52.4` to support `qwen3_moe`.
- Installed `hf_transfer` to satisfy HF_HUB fast download env.
- Training launched on 2x H200; model weights are loading and dataset preprocessing completed; training loop initializing (`max_steps` acknowledged). Will monitor until first checkpoint appears under `outputs/llamafactory/qwen3_30b_lora_sft`.

2025-09-16 — Debug + relaunch (single GPU)

- Fixed failing keys by removing `evaluation_strategy` and set `template: qwen3` (confirmed available in LLaMA‑Factory `TEMPLATES`).
- Switched dataset config to list form with `dataset_dir: ONLINE` and `dataset: [tatsu-lab/alpaca]` to satisfy parser.
- Relaunched on 1 GPU with `HF_HUB_ENABLE_HF_TRANSFER=0` to avoid hf_transfer dependency errors.
- Status: process alive, GPU0 ~41.9 GiB allocated, model loaded, tokenizer/dataset preprocessed, trainer initializing. Awaiting first step/loss log and save at `save_steps: 10`.

Next

- Verify first checkpoint save; then run `llamafactory-cli export peft` to `exports/qwen3_30b_lora_peft` and a quick text-gen sanity check.
- If OOM/throughput issues appear, reduce `cutoff_len` and/or increase `grad_accum`.
43 changes: 43 additions & 0 deletions dev/llama-factory/EXPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
LLaMA‑Factory LoRA Export and Merge

Goal: produce Hugging Face PEFT‑compatible adapters and merged full model weights suitable for vLLM.

1) Export PEFT LoRA adapter (safe)

Use LLaMA‑Factory CLI export to write a PEFT adapter folder from a training output dir:

```bash
llamafactory-cli export peft \
--model Qwen/Qwen3-30B-A3B-Instruct-2507 \
--adapter outputs/llamafactory/qwen3_30b_lora_sft \
--export_dir exports/qwen3_30b_lora_peft
```

This yields a HF‑style adapter directory usable with `peft` and `transformers`.

2) Merge LoRA into base weights (for vLLM)

If you need merged weights for inference engines that prefer full weights:

```bash
llamafactory-cli export merge \
--model Qwen/Qwen3-30B-A3B-Instruct-2507 \
--adapter outputs/llamafactory/qwen3_30b_lora_sft \
--export_dir exports/qwen3_30b_merged
```

3) Load in vLLM (example)

Point vLLM to the merged model directory:

```bash
python -m vllm.entrypoints.api_server \
--model exports/qwen3_30b_merged \
--tensor-parallel-size 2
```

Notes

- Ensure `HF_TOKEN` is set if the base model is gated.
- For very large models, merging requires substantial CPU RAM and disk space.

146 changes: 146 additions & 0 deletions dev/llama-factory/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
LLaMA‑Factory LoRA SFT for Qwen3‑30B (2×GPU, FP, reproducible)

This folder contains a reproducible setup to fine‑tune `Qwen/Qwen3-30B-A3B-Instruct-2507` with LoRA using LLaMA‑Factory. It supports 1‑GPU debug runs and 2‑GPU data‑parallel (torchrun) runs without quantization (full‑precision bf16). A small ONLINE dataset is wired for smoke tests.

What you get

- Isolated env under `dev/llama-factory/.venv` with pinned deps (Transformers 4.52.4, PEFT, etc.)
- Training config: `configs/qwen3_30b_lora.yaml` (template=qwen3, bf16, LoRA, ONLINE dataset)
- 2‑GPU FP run verified (both GPUs utilized)
- Artifacts in `outputs/llamafactory/<run_name>` (HF PEFT adapter, tokenizer files, checkpoints)
- Simple inference script snippet (base + adapter)

Prerequisites

- Linux + CUDA GPUs (tested on H200/Hopper). bf16 support recommended.
- Python via `uv` (https://docs.astral.sh/uv/) installed on host.
- Disk: ~40–60 GB HF cache + ~2 GB for adapter/checkpoints per short run.

Setup

```bash
cd dev/llama-factory
uv sync
. .venv/bin/activate
```

Useful env vars (optional):

```bash
export TOKENIZERS_PARALLELISM=false
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
# If hf_transfer is not installed, disable fast transfer:
export HF_HUB_ENABLE_HF_TRANSFER=0
```

Config overview (`configs/qwen3_30b_lora.yaml`)

- `model_name_or_path: Qwen/Qwen3-30B-A3B-Instruct-2507`
- `template: qwen3` (Qwen3 chat template)
- LoRA: rank=8, alpha=32, dropout=0.05, targets: `q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`
- `bf16: true`, `optim: adamw_torch`, `learning_rate: 0.0002`
- Dataset: `dataset_dir: ONLINE`, `dataset: [tatsu-lab/alpaca]` (swap with your dataset)
- Output: `output_dir: outputs/llamafactory/qwen3_30b_lora_sft_fp2g` (change per run to avoid auto‑resume)
- Quantization lines are present but commented. Leave commented for full‑precision training.

Run training

- 2 GPUs (recommended):

```bash
cd dev/llama-factory
. .venv/bin/activate
CUDA_VISIBLE_DEVICES=0,1 \
HF_HUB_ENABLE_HF_TRANSFER=${HF_HUB_ENABLE_HF_TRANSFER:-0} \
llamafactory-cli train configs/qwen3_30b_lora.yaml \
2>&1 | tee ../../logs/llf_qwen3_30b_2g_fp16_fresh.log
```

- 1 GPU (debug):

```bash
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train configs/qwen3_30b_lora.yaml
```

Notes

- Auto‑resume: LLaMA‑Factory/Transformers will resume if `output_dir` already contains checkpoints. To force a fresh run, set a new `output_dir` in the YAML.
- GPU utilization: verify two trainer ranks are running and memory is allocated on both GPUs:
- `ps -ef | grep -E "torchrun|llamafactory/launcher.py"`
- `nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv,noheader,nounits`

Outputs and artifacts

- Example fresh run dir: `outputs/llamafactory/qwen3_30b_lora_sft_fp2g/`
- `adapter_model.safetensors`, `adapter_config.json` (HF PEFT adapter)
- `tokenizer_config.json`, `special_tokens_map.json`, `chat_template.jinja`
- `checkpoint-*` subfolders (if `save_steps` is set)

Inference sanity check (base + adapter)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = "Qwen/Qwen3-30B-A3B-Instruct-2507"
adapter = "dev/llama-factory/outputs/llamafactory/qwen3_30b_lora_sft_fp2g"

model = AutoModelForCausalLM.from_pretrained(
base, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter, is_trainable=False)
model.eval()

tok = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
prompt = "You are a helpful assistant.\n\nUser: Tell me a haiku about GPUs.\nAssistant:"
inputs = tok(prompt, return_tensors="pt").to(next(model.parameters()).device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=64, temperature=0.7, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
```

Switching datasets

- To use ONLINE HF datasets, set `dataset_dir: ONLINE` and replace the list under `dataset:` with your dataset name(s).
- For local JSON/JSONL/Parquet, point `dataset_dir` to your data folder and set `dataset:` accordingly. See LLaMA‑Factory docs for schema/columns.

Qwen3-235B (8×H200, ZeRO-3 sharded)

- Use `configs/qwen3_235b_lora_zero3.yaml` for LoRA SFT on `Qwen/Qwen3-235B-A22B-Instruct-2507`.
- DeepSpeed Stage-3 config lives at `configs/deepspeed_zero3_235b.json`; model shards across all 8 GPUs instead of replicating.
- Example launch (adjust dataset + logging paths):
```bash
cd dev/llama-factory
. .venv/bin/activate # or your env
mkdir -p ../../logs \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
torchrun --nproc_per_node=8 --standalone --master_port=29500 \
$(pwd)/.venv/bin/llamafactory-cli train configs/qwen3_235b_lora_zero3.yaml \
2>&1 | tee ../../logs/llf_qwen3_235b_8g_zero3.log
```
- The config keeps LoRA targets to attention/router weights to avoid instantiating adapters for every MoE expert.
Increase `max_steps`, `max_samples`, and swap in your dataset before real runs.
- Expect peak GPU memory ~70–80 GiB per H200 for bf16 + ZeRO-3; disable CPU offload or tune JSON buckets if you see stalls.
- Ensure `nvcc` is available (or set `CUDA_HOME` accordingly) so DeepSpeed can load its prebuilt CUDA ops before launch.

Quantization (optional)

- The YAML contains commented QLoRA lines (`quantization_method: bnb`, etc.). To enable 4‑bit QLoRA:
- Uncomment the quantization block.
- Consider using `optim: adamw_8bit` in YAML.
- Keep `learning_rate` explicit decimal (e.g., `0.0002`) to avoid LR parsing issues with some optimizers.

Troubleshooting

- Fast‑transfer error: if you see `HF_HUB_ENABLE_HF_TRANSFER=1 but hf_transfer not available`, either install `hf_transfer` or set `HF_HUB_ENABLE_HF_TRANSFER=0`.
- Unsupported keys: remove `evaluation_strategy` from YAML (not used by this CLI path).
- Wrong LoRA targets: use explicit Qwen3 modules (`q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`).
- Resume unexpectedly: change `output_dir` in YAML for a fresh run.

References

- LLaMA‑Factory docs: https://github.com/hiyouga/LLaMA-Factory
- Model card: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507


94 changes: 94 additions & 0 deletions dev/llama-factory/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# file: skypilot_llamafactory_gptoss20b.yaml
name: sft-llf-gptoss20b

resources:
accelerators: {H200: 1}
# Public CUDA image to avoid NGC auth hurdles
image_id: docker:pytorch/pytorch:2.4.1-cuda12.1-cudnn9-devel

envs:
HF_HUB_ENABLE_HF_TRANSFER: "1"
TOKENIZERS_PARALLELISM: "false"
PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:128"
# Optional: set HF_TOKEN if you use gated models; not needed for gpt-oss-20b
# HF_TOKEN: "****"

setup: |
set -euxo pipefail
apt-get update -y
DEBIAN_FRONTEND=noninteractive apt-get install -y git build-essential ninja-build python3-dev

python -m pip install -U pip wheel setuptools

# Core training/runtime libs
python -m pip install \
"transformers>=4.55.0" \
"accelerate>=0.33.0" \
"datasets>=2.19.0" \
"peft>=0.12.0" \
"bitsandbytes>=0.43.1" \
"trl>=0.9.6" \
"xformers>=0.0.27" \
"flash-attn>=2.6.1" \
"tiktoken" \
"vllm>=0.5.5" \
"llamafactory>=0.9.2"

# Minimal workspace
mkdir -p /workspace/configs /workspace/outputs /workspace/exports

# LLaMA‑Factory training config (QLoRA on GPT‑OSS-20B MoE)
cat > /workspace/configs/llf_gptoss20b_qlora.yaml <<'YAML'
### model
model_name_or_path: openai/gpt-oss-20b
trust_remote_code: true
template: gpt
torch_dtype: bfloat16
flash_attn: fa2

### method
stage: sft
finetuning_type: lora
# QLoRA knobs
use_qlora: true
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true
bnb_4bit_compute_dtype: bfloat16
gradient_checkpointing: true

# Target common linear modules (attention + MLP). LLaMA‑Factory auto‑maps for GPT‑OSS.
lora_rank: 8
lora_alpha: 32
lora_dropout: 0.05
lora_target: all-linear

### data (tiny sanity‑check run)
dataset: alpaca_gpt4_en # Built‑in small instruction dataset alias
cutoff_len: 1024
packing: true
max_samples: 512 # keep tiny for quick validation

### training
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 2e-4
lr_scheduler_type: cosine
optim: adamw_8bit
report_to: none
logging_steps: 5

### output
save_steps: 50
save_total_limit: 1
output_dir: saves/gpt-oss-20b/lora/sft
YAML

run: |
set -euxo pipefail
nvidia-smi
echo "Starting LLaMA‑Factory QLoRA SFT on openai/gpt-oss-20b ..."
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train /workspace/configs/llf_gptoss20b_qlora.yaml

echo "Done. LoRA adapter should be at: saves/gpt-oss-20b/lora/sft"
ls -lah saves/gpt-oss-20b/lora/sft || true
Loading
Loading