Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: broken llama-imatrix arg parser #9365

Closed
legraphista opened this issue Sep 8, 2024 · 0 comments · Fixed by #9366
Closed

Bug: broken llama-imatrix arg parser #9365

legraphista opened this issue Sep 8, 2024 · 0 comments · Fixed by #9366
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@legraphista
Copy link
Contributor

legraphista commented Sep 8, 2024

What happened?

Release https://github.com/ggerganov/llama.cpp/releases/tag/b3683 (#9308) saw the refactor of the arg parser.
This broke imatrix arguments

$ ./llama-imatrix -m $MDODEL_PATH -f $RAW_TEXT_PATH -o imatrix.data
error: invalid argument: -o

Name and Version

$ ./llama-imatrix --version
version: 3683 (1b9ae518)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

$ ./llama-imatrix -h
----- common options -----



----- example-specific options -----

-h     , --help, --usage                print usage and exit
--version                               show version and build info
-v     , --verbose                      print verbose information
--verbosity N                           set specific verbosity level (default: 1)
-s     , --seed SEED                    RNG seed (default: -1, use random seed for < 0)
-t     , --threads N                    number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb    , --threads-batch N              number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C     , --cpu-mask M                   CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr    , --cpu-range lo-hi              range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--poll <0...100>                        use polling level to wait for work (0 - no polling, default: 50)
-Cb    , --cpu-mask-batch M             CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
                                        (default: same as --cpu-mask)
-Crb   , --cpu-range-batch lo-hi        ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1>                use strict CPU placement (default: same as --cpu-strict)
--poll-batch <0|1>                      use polling to wait for work (default: same as --poll)
-lcs   , --lookup-cache-static FNAME    path to static lookup cache to use for lookup decoding (not updated by
                                        generation)
-lcd   , --lookup-cache-dynamic FNAME   path to dynamic lookup cache to use for lookup decoding (updated by
                                        generation)
-c     , --ctx-size N                   size of the prompt context (default: 512, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)
-n     , --predict, --n-predict N       number of tokens to predict (default: -1, -1 = infinity, -2 = until
                                        context filled)
                                        (env: LLAMA_ARG_N_PREDICT)
-b     , --batch-size N                 logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)
-ub    , --ubatch-size N                physical maximum batch size (default: 512)
                                        (env: LLAMA_ARG_UBATCH)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
                                        all)
--chunks N                              max number of chunks to process (default: -1, -1 = all)
-fa    , --flash-attn                   enable Flash Attention (default: disabled)
                                        (env: LLAMA_ARG_FLASH_ATTN)
-p     , --prompt PROMPT                prompt to start generation with
-f     , --file FNAME                   a file containing the prompt (default: none)
--in-file FNAME                         an input file (repeat to specify multiple files)
-bf    , --binary-file FNAME            binary file containing the prompt (default: none)
-e     , --escape                       process escapes sequences (\n, \r, \t, \', \", \\) (default: true)
--no-escape                             do not process escape sequences
--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'
                                        (default: top_k;tfs_z;typ_p;top_p;min_p;temperature)
--sampling-seq SEQUENCE                 simplified sequence for samplers that will be used (default: kfypmt)
--ignore-eos                            ignore end of stream token and continue generating (implies
                                        --logit-bias EOS-inf)
--penalize-nl                           penalize newline tokens (default: false)
--temp N                                temperature (default: 0.8)
--top-k N                               top-k sampling (default: 40, 0 = disabled)
--top-p N                               top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N                               min-p sampling (default: 0.1, 0.0 = disabled)
--tfs N                                 tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
--typical N                             locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N                       last n tokens to consider for penalize (default: 64, 0 = disabled, -1
                                        = ctx_size)
--repeat-penalty N                      penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
--presence-penalty N                    repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N                   repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--dynatemp-range N                      dynamic temperature range (default: 0.0, 0.0 = disabled)
--dynatemp-exp N                        dynamic temperature exponent (default: 1.0)
--mirostat N                            use Mirostat sampling.
                                        Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if
                                        used.
                                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N                         Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N                        Mirostat target entropy, parameter tau (default: 5.0)
-l     , --logit-bias TOKEN_ID(+/-)BIAS
                                        modifies the likelihood of token appearing in the completion,
                                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR                       BNF-like grammar to constrain generations (see samples in grammars/
                                        dir) (default: '')
--grammar-file FNAME                    file to read grammar from
-j     , --json-schema SCHEMA           JSON schema to constrain generations (https://json-schema.org/), e.g.
                                        `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
--rope-scaling {none,linear,yarn}       RoPE frequency scaling method, defaults to linear unless specified by
                                        the model
--rope-scale N                          RoPE context scaling factor, expands context by a factor of N
--rope-freq-base N                      RoPE base frequency, used by NTK-aware scaling (default: loaded from
                                        model)
--rope-freq-scale N                     RoPE frequency scaling factor, expands context by a factor of 1/N
--yarn-orig-ctx N                       YaRN: original context size of model (default: 0 = model training
                                        context size)
--yarn-ext-factor N                     YaRN: extrapolation mix factor (default: -1.0, 0.0 = full
                                        interpolation)
--yarn-attn-factor N                    YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
--yarn-beta-slow N                      YaRN: high correction dim or alpha (default: 1.0)
--yarn-beta-fast N                      YaRN: low correction dim or beta (default: 32.0)
-gan   , --grp-attn-n N                 group-attention factor (default: 1)
-gaw   , --grp-attn-w N                 group-attention width (default: 512.0)
-dkvc  , --dump-kv-cache                verbose print of the KV cache
-nkvo  , --no-kv-offload                disable KV offload
-ctk   , --cache-type-k TYPE            KV cache data type for K (default: f16)
-ctv   , --cache-type-v TYPE            KV cache data type for V (default: f16)
-dt    , --defrag-thold N               KV cache defragmentation threshold (default: -1.0, < 0 - disabled)
                                        (env: LLAMA_ARG_DEFRAG_THOLD)
-np    , --parallel N                   number of parallel sequences to decode (default: 1)
-ns    , --sequences N                  number of sequences to decode (default: 1)
-cb    , --cont-batching                enable continuous batching (a.k.a dynamic batching) (default: enabled)
                                        (env: LLAMA_ARG_CONT_BATCHING)
-nocb  , --no-cont-batching             disable continuous batching
                                        (env: LLAMA_ARG_NO_CONT_BATCHING)
--mlock                                 force system to keep model in RAM rather than swapping or compressing
--no-mmap                               do not memory-map model (slower load but may reduce pageouts if not
                                        using mlock)
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        - distribute: spread execution evenly over all nodes
                                        - isolate: only spawn threads on CPUs on the node that execution
                                        started on
                                        - numactl: use the CPU map provided by numactl
                                        if run without this previously, it is recommended to drop the system
                                        page cache before using this
                                        see https://github.com/ggerganov/llama.cpp/issues/1437
-ngl   , --gpu-layers N                 number of layers to store in VRAM
                                        (env: LLAMA_ARG_N_GPU_LAYERS)
-sm    , --split-mode {none,layer,row}
                                        how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs
                                        - row: split rows across GPUs
-ts    , --tensor-split N0,N1,N2,...    fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1
-mg    , --main-gpu INDEX               the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)
--check-tensors                         check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE            advanced option to override model metadata by key. may be specified
                                        multiple times.
                                        types: int, float, bool, str. example: --override-kv
                                        tokenizer.ggml.add_bos_token=bool:false
--lora FNAME                            path to LoRA adapter (can be repeated to use multiple adapters)
--lora-scaled FNAME SCALE               path to LoRA adapter with user defined scaling (can be repeated to use
                                        multiple adapters)
--control-vector FNAME                  add a control vector
                                        note: this argument can be repeated to add multiple control vectors
--control-vector-scaled FNAME SCALE     add a control vector with user defined scaling SCALE
                                        note: this argument can be repeated to add multiple scaled control
                                        vectors
--control-vector-layer-range START END
                                        layer range to apply the control vector(s) to, start and end inclusive
-m     , --model FNAME                  model path (default: `models/$filename` with filename from `--hf-file`
                                        or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
                                        (env: LLAMA_ARG_MODEL)
-mu    , --model-url MODEL_URL          model download url (default: unused)
                                        (env: LLAMA_ARG_MODEL_URL)
-hfr   , --hf-repo REPO                 Hugging Face model repository (default: unused)
                                        (env: LLAMA_ARG_HF_REPO)
-hff   , --hf-file FILE                 Hugging Face model file (default: unused)
                                        (env: LLAMA_ARG_HF_FILE)
-hft   , --hf-token TOKEN               Hugging Face access token (default: value from HF_TOKEN environment
                                        variable)
                                        (env: HF_TOKEN)
-ld    , --logdir LOGDIR                path under which to save YAML logs (no logging if unset)
--log-test                              Log test
--log-disable                           Log disable
--log-enable                            Log enable
--log-new                               Log new
--log-append                            Log append
--log-file FNAME                        Log file

example usage:

    ./llama-imatrix \
       -m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] [--verbosity 1] \
       [--no-ppl] [--chunk 123] [--output-frequency 10] [--save-frequency 0] \
       [--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...]
@legraphista legraphista added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant