autopack makes your Hugging Face models easy to run, share, and ship. It quantizes once and exports to multiple runtimes, with sensible defaults and an automatic flow that produces a readable summary. It supports HF, ONNX, and GGUF (llama.cpp) formats and can publish to the Hugging Face Hub in one shot.
About · Requirements · Setup · Building Instructions · Running · Detailed Usage · Q&A
autopack is a CLI that helps you quantize and package Hugging Face models into multiple useful formats in a single pass, with an option to publish artifacts to the Hub.
You have a 120B LLM and want to optimize it so that people (not corporations with clusters of B200s) can use it on their 8GB 2060? All you need to do is run:
autopack sentence-transformers/all-MiniLM-L6-v2- Fast: generate multiple variants in one command.
- Practical: built on Transformers, bitsandbytes, ONNX, and llama.cpp.
- Portable: CPU- and GPU-friendly artifacts, good defaults.
- Python 3.9+
- PyTorch, Transformers, Hugging Face Hub
- Optional: bitsandbytes (4/8-bit), optimum[onnxruntime] (ONNX), llama.cpp (GGUF tools)
- GGUF export requires a built llama.cpp and
llama-quantizein PATH. - Set
HUGGINGFACE_HUB_TOKENto publish, or pass--token.
pip install autopack-grn# ONNX export support
pip install 'autopack-grn[onnx]'
# GGUF export helpers (converter deps)
pip install 'autopack-grn[gguf]'
# llama.cpp runtime bindings (llama-cpp-python)
pip install 'autopack-grn[llama]'
# Everything for llama.cpp functionality (GGUF export + runtime)
pip install 'autopack-grn[gguf,llama]'Note: for GGUF and llama.cpp functionality you also need the llama.cpp tools
(llama-quantize, llama-cli) available on your PATH. You can build the
vendored copy and export PATH as shown in
Vendored llama.cpp quick build.
pip install -e .
# Optional extras while developing
pip install -e '.[onnx]'
pip install -e '.[gguf]'
pip install -e '.[llama]'
pip install -e '.[gguf,llama]'python -m buildautopack meta-llama/Llama-3-8B --output-format hfAdd ONNX and GGUF:
autopack meta-llama/Llama-3-8B --output-format hf onnx gguf --summary-json --skip-existingGGUF only (with default presets Q4_K_M, Q5_K_M, Q8_0):
autopack meta-llama/Llama-3-8B --output-format gguf --skip-existingPublish to Hub:
autopack publish out/llama3-4bit your-username/llama3-4bit --private \
--commit-message "Add 4-bit quantized weights"Inspect a model id or local folder and print metadata (config, sizes, quantization hints) with suggestions for next steps. Now includes human-readable sizes, file summary (config/tokenizer presence), weight file counts, and top-5 largest files.
autopack scan <model_id_or_path> \
[--revision <rev>] [--trust-remote-code] [--local-files-only] \
[--resolve-cache] [--json] [--show-files] [--limit-files 50]Examples:
# Remote model (lightweight, no weight download). Prints human-readable summary
autopack scan meta-llama/Llama-3-8B
# JSON output suitable for scripting
autopack scan meta-llama/Llama-3-8B --json
# Resolve a local snapshot to list files and sizes
autopack scan meta-llama/Llama-3-8B --resolve-cache --show-files --limit-files 20
# Scan a local folder
autopack scan ./tiny-gpt2Run common HF quantization variants and optional ONNX/GGUF exports in one go, with a summary table and generated README in the output folder.
autopack [auto] <model_id_or_path> [-o <out_dir>] \
--output-format hf [onnx] [gguf] \
[--eval-dataset <dataset>[::<config>]] \
[--revision <rev>] [--trust-remote-code] [--device auto|cpu|cuda] \
[--no-bench] [--bench-prompt "..."] [--bench-max-new-tokens 16] \
[--bench-warmup 0] [--bench-runs 1] \
[--hf-variant bnb-4bit|bnb-8bit|int8-dynamic|bf16] \
[--hf-variants bnb-4bit bnb-8bit int8-dynamic bf16] \
[--plan] [--resume] [--force-step hf:<variant> gguf:<quant> ...]Key points:
- Default HF variants: bnb-4bit, bnb-8bit, int8-dynamic, bf16
- Add ONNX and/or GGUF via
--output-format - If
-o/--output-diris omitted, the output folder defaults to the last path segment of the model id/path (e.g.,user/model->model). - Benchmarking is enabled by default in
auto; use--no-benchto disable. - If
--eval-datasetis provided, perplexity is computed for each HF variant - If benchmarking is enabled, autopack measures actual Tokens/s per backend and replaces heuristic speedups with real Tokens/s and speedup vs bf16 in the summary and the generated README.
- For very large models, use
--hf-variant bf16(single) or--hf-variants bf16 int8-dynamic(subset) to reduce loads.
Produce specific formats with a chosen quantization strategy.
autopack quantize <model_id_or_path> [-o <out_dir>] \
--output-format hf [onnx] [gguf] \
[--quantization bnb-4bit|bnb-8bit|int8-dynamic|none] \
[--dtype auto|float16|bfloat16|float32] \
[--device-map auto|cpu] [--prune <0..0.95>] \
[--revision <rev>] [--trust-remote-code] \
[--plan] [--resume] [--force-step hf gguf:<quant>]Upload an exported model folder to the Hugging Face Hub.
autopack publish <folder> <user_or_org/repo> \
[--private] [--token $HUGGINGFACE_HUB_TOKEN] \
[--branch <rev>] [--commit-message "..."] [--no-create]Run standalone benchmarks on existing models/artifacts.
autopack bench <target> \
--backend hf [onnx] [gguf] \
[--prompt "Hello"] [--max-new-tokens 64] \
[--device auto] [--num-warmup 1] [--num-runs 3] \
[--trust-remote-code] [--llama-cli /path/to/llama-cli]Notes:
- For HF,
targetcan be a Hub id or local folder. For ONNX, pass the exported folder. For GGUF, pass a.gguffile or a folder containing one. - ONNX benchmarking requires
optimum[onnxruntime]. GGUF benchmarking requiresllama-cli.
--trust-remote-code: enable loading custom modeling code from Hub repos--revision: branch/tag/commit to load--device-map: set tocputo force CPU; defaults toauto--dtype: compute dtype for non-INT8 layers (applies to HF exports)--prune: global magnitude pruning ratio across Linear layers (0..0.95)
--plan: prints a dry-run plan estimating downloads, RAM/VRAM, temp disk, and time per step, then exits without doing work.--resume: resumes from a previous run usingout/.autopack_state.json, skipping steps already marked completed.--force-step: force re-run specific steps even under--resume. Step IDs:- HF variants:
hf:<variant>(e.g.,hf:bnb-4bit,hf:bf16) - ONNX export:
onnx - GGUF export:
gguf:<quant>(e.g.,gguf:Q4_K_M)
- HF variants:
State is stored in <output_dir>/.autopack_state.json with statuses: pending, running, completed, failed, skipped.
hf: Transformers checkpoint with tokenizer and configonnx: ONNX export usingoptimum[onnxruntime]for CausalLMgguf: llama.cpp GGUF viaconvert_hf_to_gguf.pyandllama-quantize
- Converter resolution order:
--gguf-converterif provided$LLAMA_CPP_CONVERTenv var- Vendored script:
third_party/llama.cpp/convert_hf_to_gguf.py ~/llama.cpp/convert_hf_to_gguf.pyor~/src/llama.cpp/convert_hf_to_gguf.py
- Quant presets: uppercase (e.g.,
Q4_K_M). If omitted, autopack generatesQ4_K_M,Q5_K_M,Q8_0by default. - Isolation: by default, conversion runs in an isolated
.venvinside the output dir. Disable with--gguf-no-isolation. - Architecture checks: pass
--gguf-forceto bypass the basic architecture guard. - Ensure
llama-quantizeis inPATH(typically inthird_party/llama.cpp/build/bin).
- Requires:
pip install 'optimum[onnxruntime]' - Uses
ORTModelForCausalLM; non-CausalLM models may not be supported in this version.
--eval-datasetacceptsdatasetordataset:config(e.g.,wikitext-2-raw-v1)--eval-text-keycontrols which dataset column is used for text (default:text)- Device selection is automatic (
cudaif available, elsecpu) - Only CausalLM architectures are supported for perplexity computation
- Uses a bounded sample count and expects a
textfield in the dataset
Single-variant run (bf16 only):
autopack meta-llama/Llama-3-8B --output-format hf --hf-variant bf16Subset of variants:
autopack meta-llama/Llama-3-8B --output-format hf --hf-variants bf16 int8-dynamicCPU-friendly int8 dynamic with pruning:
autopack quantize meta-llama/Llama-3-8B \
--output-format hf --quantization int8-dynamic --prune 0.2 --device-map cpuBF16 only (no quantization):
autopack quantize meta-llama/Llama-3-8B \
--output-format hf --quantization none --dtype bfloat16Override GGUF presets:
autopack meta-llama/Llama-3-8B \
--output-format gguf --gguf-quant Q5_K_M Q8_0Auto with benchmarking (reports Tokens/s and real speedup vs bf16):
autopack sshleifer/tiny-gpt2 --output-format hfPlanning and resume:
# Dry-run plan: estimates downloads, RAM/VRAM, temp disk, and time per step
autopack meta-llama/Llama-3-8B --output-format hf gguf --plan
# Resume a previous run, skipping completed steps; force rerun a specific step
autopack auto meta-llama/Llama-3-8B -o llama3-out --output-format hf gguf \
--resume --force-step gguf:Q4_K_Mautopack maintains a simple pipeline state at .autopack_state.json in the output directory, marking steps as pending, running, completed, failed, or skipped. --resume uses this to avoid redoing completed steps; --force-step overrides it for selected steps.
Hello World (Transformers on CPU):
pip install autopack-grn
autopack sshleifer/tiny-gpt2 --output-format hf
python - <<'PY'
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained('tiny-gpt2/bf16')
m = AutoModelForCausalLM.from_pretrained('tiny-gpt2/bf16', device_map='cpu')
ids = tok('Hello world', return_tensors='pt').input_ids
out = m.generate(ids, max_new_tokens=8)
print(tok.decode(out[0]))
PYHello World (GGUF with llama.cpp):
autopack sshleifer/tiny-gpt2 --output-format gguf
./third_party/llama.cpp/build/bin/llama-cli -m tiny-gpt2/gguf/model-Q4_K_M.gguf -p "Hello world" -n 16cd third_party/llama.cpp
cmake -S . -B build -DGGML_NATIVE=ON
cmake --build build -jllama-quantizenot found: build llama.cpp and ensurebuild/binis inPATH.- BitsAndBytes on Windows: currently not installed by default; prefer CPU/int8-dynamic flows.
- Custom code prompt: pass
--trust-remote-codeto avoid the interactive confirmation.
HUGGINGFACE_HUB_TOKEN: token to publish to the HubLLAMA_CPP_CONVERT: path toconvert_hf_to_gguf.pyPATH: should include the directory withllama-quantize
Generates HF variants (4-bit, 8-bit, int8-dynamic, bf16) and prints a summary; GGUF/ONNX are opt-in.
autopack will create multiple useful presets by default (Q4_K_M, Q5_K_M, Q8_0).
For very large models (tens of GBs), prefer a minimal, resumable flow:
- Single variant: use
--hf-variant bf16(or another) to avoid multiple loads - Avoid extra runs: add
--no-bench - Resume-friendly: keep
--skip-existing - CPU-safe: add
--device cputo skip GPU-only paths
Examples:
# BF16 only, no benchmarking, resume if partial outputs exist
autopack user/model-giant --output-format hf \
--hf-variant bf16 --no-bench --skip-existing
# CPU-focused subset
autopack user/model-giant --output-format hf \
--hf-variants bf16 int8-dynamic --device cpu --no-bench --skip-existingLicense: Apache-2.0