common : refactor arg parser #9308

ngxson · 2024-09-04T11:49:59Z

TL;DR for breaking change

This PR has only some small breaking changes to the environment variable system introduced in #9105 :

To disable continuous batching, add LLAMA_ARG_NO_CONT_BATCHING=1 (instead of LLAMA_ARG_CONT_BATCHING=0)
To disable slots endpoint, add LLAMA_ARG_NO_ENDPOINT_SLOTS=1 (instead of LLAMA_ARG_ENDPOINT_SLOTS=0)
If both command line argument and environment variable are both set for the same param, the argument will take precedence over env var.
In such case, you will also see a warning, for example: warn: LLAMA_ARG_CTX_SIZE environment variable is set, but will be overwritten by command line argument -c

In this PR

The goals of this PR are:

Refactor & make the arg parser code more intuitive, tailored to llama.cpp's usage
Ability to auto-generate documentation (markdown content) from code
Better support for multiple examples (plus, one arg used by multiple examples or different purposes)
Unifying env variable & arguments logic into one place

To generate markdown, run: make llama-export-docs
Output file will be in the format of autogen-{EXAMPLE_NAME}.md

TODO:

List of removed args (this is not breaking change, since no where in the code base handles these args):

--priority
--priority-batch
--priority-draft
-Cbd
--priority-batch-draft

I have read the contributing guidelines
Self-reported review complexity:
- Medium

ngxson · 2024-09-04T12:20:16Z

@ggerganov Before proceeding further, I would like to ask for you opinion about this subject. Do you think this is a good way to have code-as-documentations? And if so, do you have any idea to add to this? Thank you.

ggerganov

Seems ok to me 👍

common/common.h

ngxson · 2024-09-05T19:30:21Z

@ggerganov Thank you for the initial review. This PR is now ready.

Here is a quick recap on what I've done:

All CLI args & env var are migrated to the new format
tests/test-arg-parser is added to test this new system
llama-export-docs target is added, which can export list of arguments to markdown table (instead of markdown list in my initial demo). Here is an example for server docs

tests/test-arg-parser.cpp

ggerganov · 2024-09-07T13:12:14Z

Functionality-wise, this is great. However the build time of libcommon increases on my machine from ~3s to ~12s:

ccache -C && touch ../common/common.cpp && time make -j common

Should we try to reduce it in some way? I suppose the culprit is in the lambda handlers in gpt_params_parser_init.

slaren · 2024-09-07T13:17:41Z

Yeah this (compile time) is pretty bad. 18 seconds to compile on 13900k.

I suspect that the reason is the std::function, but it is not easy to test. Considering that all the handlers capture only params/sparams, it could be passed by argument and the std::function replaced with function pointers.

This reduces the build time substantially, but it is still quite slow (but for me only 2s slower than master):

diff --git a/common/common.cpp b/common/common.cpp
index 3694c127..012dd1ad 100644
--- a/common/common.cpp
+++ b/common/common.cpp
@@ -362,13 +362,13 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params, std::vecto
         if (opt.get_value_from_env(value)) {
             try {
                 if (opt.handler_void && (value == "1" || value == "true")) {
-                    opt.handler_void();
+                    opt.handler_void(params, sparams);
                 }
                 if (opt.handler_int) {
-                    opt.handler_int(std::stoi(value));
+                    opt.handler_int(params, sparams, std::stoi(value));
                 }
                 if (opt.handler_string) {
-                    opt.handler_string(value);
+                    opt.handler_string(params, sparams, value);
                     continue;
                 }
             } catch (std::exception & e) {
@@ -399,7 +399,7 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params, std::vecto
         }
         try {
             if (opt.handler_void) {
-                opt.handler_void();
+                opt.handler_void(params, sparams);
                 continue;
             }
 
@@ -407,11 +407,11 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params, std::vecto
             check_arg(i);
             std::string val = argv[++i];
             if (opt.handler_int) {
-                opt.handler_int(std::stoi(val));
+                opt.handler_int(params, sparams, std::stoi(val));
                 continue;
             }
             if (opt.handler_string) {
-                opt.handler_string(val);
+                opt.handler_string(params, sparams, val);
                 continue;
             }
 
@@ -419,7 +419,7 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params, std::vecto
             check_arg(i);
             std::string val2 = argv[++i];
             if (opt.handler_str_str) {
-                opt.handler_str_str(val, val2);
+                opt.handler_str_str(params, sparams, val, val2);
                 continue;
             }
         } catch (std::exception & e) {
@@ -687,14 +687,14 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-h", "--help", "--usage"},
         "print usage and exit",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.usage = true;
         }
     ));
     add_opt(llama_arg(
         {"--version"},
         "show version and build info",
-        []() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             fprintf(stderr, "version: %d (%s)\n", LLAMA_BUILD_NUMBER, LLAMA_COMMIT);
             fprintf(stderr, "built with %s for %s\n", LLAMA_COMPILER, LLAMA_BUILD_TARGET);
             exit(0);
@@ -703,42 +703,42 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-v", "--verbose"},
         "print verbose information",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.verbosity = 1;
         }
     ));
     add_opt(llama_arg(
         {"--verbosity"}, "N",
         format("set specific verbosity level (default: %d)", params.verbosity),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.verbosity = value;
         }
     ));
     add_opt(llama_arg(
         {"--verbose-prompt"},
         format("print a verbose prompt before generation (default: %s)", params.verbose_prompt ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.verbose_prompt = true;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(llama_arg(
         {"--no-display-prompt"},
         format("don't print prompt at generation (default: %s)", !params.display_prompt ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.display_prompt = false;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(llama_arg(
         {"-co", "--color"},
         format("colorise output to distinguish prompt and user input from generations (default: %s)", params.use_color ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.use_color = true;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_INFILL}));
     add_opt(llama_arg(
         {"-s", "--seed"}, "SEED",
         format("RNG seed (default: %d, use random seed for < 0)", params.seed),
-        [&sparams, &params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             // TODO: this is temporary, in the future the sampling state will be moved fully to llama_sampling_context.
             params.seed = std::stoul(value);
             sparams.seed = std::stoul(value);
@@ -747,7 +747,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-t", "--threads"}, "N",
         format("number of threads to use during generation (default: %d)", params.cpuparams.n_threads),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.cpuparams.n_threads = value;
             if (params.cpuparams.n_threads <= 0) {
                 params.cpuparams.n_threads = std::thread::hardware_concurrency();
@@ -757,7 +757,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-tb", "--threads-batch"}, "N",
         "number of threads to use during batch and prompt processing (default: same as --threads)",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.cpuparams_batch.n_threads = value;
             if (params.cpuparams_batch.n_threads <= 0) {
                 params.cpuparams_batch.n_threads = std::thread::hardware_concurrency();
@@ -767,7 +767,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-td", "--threads-draft"}, "N",
         "number of threads to use during generation (default: same as --threads)",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.draft_cpuparams.n_threads = value;
             if (params.draft_cpuparams.n_threads <= 0) {
                 params.draft_cpuparams.n_threads = std::thread::hardware_concurrency();
@@ -777,7 +777,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-tbd", "--threads-batch-draft"}, "N",
         "number of threads to use during batch and prompt processing (default: same as --threads-draft)",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.draft_cpuparams_batch.n_threads = value;
             if (params.draft_cpuparams_batch.n_threads <= 0) {
                 params.draft_cpuparams_batch.n_threads = std::thread::hardware_concurrency();
@@ -787,7 +787,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-C", "--cpu-mask"}, "M",
         "CPU affinity mask: arbitrarily long hex. Complements cpu-range (default: \"\")",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::string mask = value;
             params.cpuparams.mask_valid = true;
             if (!parse_cpu_mask(mask, params.cpuparams.cpumask)) {
@@ -798,7 +798,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-Cr", "--cpu-range"}, "lo-hi",
         "range of CPUs for affinity. Complements --cpu-mask",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::string range = value;
             params.cpuparams.mask_valid = true;
             if (!parse_cpu_range(range, params.cpuparams.cpumask)) {
@@ -809,21 +809,21 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--cpu-strict"}, "<0|1>",
         format("use strict CPU placement (default: %u)\n", (unsigned) params.cpuparams.strict_cpu),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.cpuparams.strict_cpu = std::stoul(value);
         }
     ));
     add_opt(llama_arg(
         {"--poll"}, "<0...100>",
         format("use polling level to wait for work (0 - no polling, default: %u)\n", (unsigned) params.cpuparams.poll),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.cpuparams.poll = std::stoul(value);
         }
     ));
     add_opt(llama_arg(
         {"-Cb", "--cpu-mask-batch"}, "M",
         "CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch (default: same as --cpu-mask)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::string mask = value;
             params.cpuparams_batch.mask_valid = true;
             if (!parse_cpu_mask(mask, params.cpuparams_batch.cpumask)) {
@@ -834,7 +834,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-Crb", "--cpu-range-batch"}, "lo-hi",
         "ranges of CPUs for affinity. Complements --cpu-mask-batch",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::string range = value;
             params.cpuparams_batch.mask_valid = true;
             if (!parse_cpu_range(range, params.cpuparams_batch.cpumask)) {
@@ -845,21 +845,21 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--cpu-strict-batch"}, "<0|1>",
         "use strict CPU placement (default: same as --cpu-strict)",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.cpuparams_batch.strict_cpu = value;
         }
     ));
     add_opt(llama_arg(
         {"--poll-batch"}, "<0|1>",
         "use polling to wait for work (default: same as --poll)",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.cpuparams_batch.poll = value;
         }
     ));
     add_opt(llama_arg(
         {"-Cd", "--cpu-mask-draft"}, "M",
         "Draft model CPU affinity mask. Complements cpu-range-draft (default: same as --cpu-mask)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::string mask = value;
             params.draft_cpuparams.mask_valid = true;
             if (!parse_cpu_mask(mask, params.draft_cpuparams.cpumask)) {
@@ -870,7 +870,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-Crd", "--cpu-range-draft"}, "lo-hi",
         "Ranges of CPUs for affinity. Complements --cpu-mask-draft",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::string range = value;
             params.draft_cpuparams.mask_valid = true;
             if (!parse_cpu_range(range, params.draft_cpuparams.cpumask)) {
@@ -881,21 +881,21 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--cpu-strict-draft"}, "<0|1>",
         "Use strict CPU placement for draft model (default: same as --cpu-strict)",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.draft_cpuparams.strict_cpu = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SPECULATIVE}));
     add_opt(llama_arg(
         {"--poll-draft"}, "<0|1>",
         "Use polling to wait for draft model work (default: same as --poll])",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.draft_cpuparams.poll = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SPECULATIVE}));
     add_opt(llama_arg(
         {"-Crbd", "--cpu-range-batch-draft"}, "lo-hi",
         "Ranges of CPUs for affinity. Complements --cpu-mask-draft-batch)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::string range = value;
             params.draft_cpuparams_batch.mask_valid = true;
             if (!parse_cpu_range(range, params.draft_cpuparams_batch.cpumask)) {
@@ -906,91 +906,91 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--cpu-strict-batch-draft"}, "<0|1>",
         "Use strict CPU placement for draft model (default: --cpu-strict-draft)",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.draft_cpuparams_batch.strict_cpu = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SPECULATIVE}));
     add_opt(llama_arg(
         {"--poll-batch-draft"}, "<0|1>",
         "Use polling to wait for draft model work (default: --poll-draft)",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.draft_cpuparams_batch.poll = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SPECULATIVE}));
     add_opt(llama_arg(
         {"--draft"}, "N",
         format("number of tokens to draft for speculative decoding (default: %d)", params.n_draft),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_draft = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SPECULATIVE}));
     add_opt(llama_arg(
         {"-ps", "--p-split"}, "N",
         format("speculative decoding split probability (default: %.1f)", (double)params.p_split),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.p_split = std::stof(value);
         }
     ).set_examples({LLAMA_EXAMPLE_SPECULATIVE}));
     add_opt(llama_arg(
         {"-lcs", "--lookup-cache-static"}, "FNAME",
         "path to static lookup cache to use for lookup decoding (not updated by generation)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.lookup_cache_static = value;
         }
     ));
     add_opt(llama_arg(
         {"-lcd", "--lookup-cache-dynamic"}, "FNAME",
         "path to dynamic lookup cache to use for lookup decoding (updated by generation)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.lookup_cache_dynamic = value;
         }
     ));
     add_opt(llama_arg(
         {"-c", "--ctx-size"}, "N",
         format("size of the prompt context (default: %d, 0 = loaded from model)", params.n_ctx),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_ctx = value;
         }
     ).set_env("LLAMA_ARG_CTX_SIZE"));
     add_opt(llama_arg(
         {"-n", "--predict", "--n-predict"}, "N",
         format("number of tokens to predict (default: %d, -1 = infinity, -2 = until context filled)", params.n_predict),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_predict = value;
         }
     ).set_env("LLAMA_ARG_N_PREDICT"));
     add_opt(llama_arg(
         {"-b", "--batch-size"}, "N",
         format("logical maximum batch size (default: %d)", params.n_batch),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_batch = value;
         }
     ).set_env("LLAMA_ARG_BATCH"));
     add_opt(llama_arg(
         {"-ub", "--ubatch-size"}, "N",
         format("physical maximum batch size (default: %d)", params.n_ubatch),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_ubatch = value;
         }
     ).set_env("LLAMA_ARG_UBATCH"));
     add_opt(llama_arg(
         {"--keep"}, "N",
         format("number of tokens to keep from the initial prompt (default: %d, -1 = all)", params.n_keep),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_keep = value;
         }
     ));
     add_opt(llama_arg(
         {"--chunks"}, "N",
         format("max number of chunks to process (default: %d, -1 = all)", params.n_chunks),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_chunks = value;
         }
     ));
     add_opt(llama_arg(
         {"-fa", "--flash-attn"},
         format("enable Flash Attention (default: %s)", params.flash_attn ? "enabled" : "disabled"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.flash_attn = true;
         }
     ).set_env("LLAMA_ARG_FLASH_ATTN"));
@@ -999,14 +999,14 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
         ex == LLAMA_EXAMPLE_MAIN
             ? "prompt to start generation with\nif -cnv is set, this will be used as system prompt"
             : "prompt to start generation with",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.prompt = value;
         }
     ));
     add_opt(llama_arg(
         {"-f", "--file"}, "FNAME",
         "a file containing the prompt (default: none)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::ifstream file(value);
             if (!file) {
                 throw std::runtime_error(format("error: failed to open file '%s'\n", value.c_str()));
@@ -1022,7 +1022,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--in-file"}, "FNAME",
         "an input file (repeat to specify multiple files)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::ifstream file(value);
             if (!file) {
                 throw std::runtime_error(format("error: failed to open file '%s'\n", value.c_str()));
@@ -1033,7 +1033,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-bf", "--binary-file"}, "FNAME",
         "binary file containing the prompt (default: none)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::ifstream file(value, std::ios::binary);
             if (!file) {
                 throw std::runtime_error(format("error: failed to open file '%s'\n", value.c_str()));
@@ -1049,56 +1049,56 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-e", "--escape"},
         format("process escapes sequences (\\n, \\r, \\t, \\', \\\", \\\\) (default: %s)", params.escape ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.escape = true;
         }
     ));
     add_opt(llama_arg(
         {"--no-escape"},
         "do not process escape sequences",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.escape = false;
         }
     ));
     add_opt(llama_arg(
         {"-ptc", "--print-token-count"}, "N",
         format("print token count every N tokens (default: %d)", params.n_print),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_print = value;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(llama_arg(
         {"--prompt-cache"}, "FNAME",
         "file to cache prompt state for faster startup (default: none)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.path_prompt_cache = value;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(llama_arg(
         {"--prompt-cache-all"},
         "if specified, saves user input and generations to cache as well\n",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.prompt_cache_all = true;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(llama_arg(
         {"--prompt-cache-ro"},
         "if specified, uses the prompt cache but does not update it",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.prompt_cache_ro = true;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(llama_arg(
         {"-r", "--reverse-prompt"}, "PROMPT",
         "halt generation at PROMPT, return control in interactive mode\n",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.antiprompt.emplace_back(value);
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(llama_arg(
         {"-sp", "--special"},
         format("special tokens output enabled (default: %s)", params.special ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.special = true;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
@@ -1111,35 +1111,35 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
             "(default: %s)",
             params.conversation ? "true" : "false"
         ),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.conversation = true;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(llama_arg(
         {"-i", "--interactive"},
         format("run in interactive mode (default: %s)", params.interactive ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.interactive = true;
         }
     ).set_examples({LLAMA_EXAMPLE_INFILL}));
     add_opt(llama_arg(
         {"-if", "--interactive-first"},
         format("run in interactive mode and wait for input right away (default: %s)", params.interactive_first ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.interactive_first = true;
         }
     ).set_examples({LLAMA_EXAMPLE_INFILL}));
     add_opt(llama_arg(
         {"-mli", "--multiline-input"},
         "allows you to write or paste multiple lines without ending each in '\\'",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.multiline_input = true;
         }
     ).set_examples({LLAMA_EXAMPLE_INFILL}));
     add_opt(llama_arg(
         {"--in-prefix-bos"},
         "prefix BOS to user inputs, preceding the `--in-prefix` string",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.input_prefix_bos = true;
             params.enable_chat_template = false;
         }
@@ -1147,7 +1147,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--in-prefix"}, "STRING",
         "string to prefix user inputs with (default: empty)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.input_prefix = value;
             params.enable_chat_template = false;
         }
@@ -1155,7 +1155,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--in-suffix"}, "STRING",
         "string to suffix after user inputs with (default: empty)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.input_suffix = value;
             params.enable_chat_template = false;
         }
@@ -1163,7 +1163,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--no-warmup"},
         "skip warming up the model with an empty run",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.warmup = false;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
@@ -1173,14 +1173,14 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
             "use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this. (default: %s)",
             params.spm_infill ? "enabled" : "disabled"
         ),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.spm_infill = true;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_INFILL}));
     add_opt(llama_arg(
         {"--samplers"}, "SAMPLERS",
         format("samplers that will be used for generation in the order, separated by \';\'\n(default: %s)", sampler_type_names.c_str()),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             const auto sampler_names = string_split(value, ';');
             sparams.samplers_sequence = llama_sampling_types_from_names(sampler_names, true);
         }
@@ -1188,28 +1188,28 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--sampling-seq"}, "SEQUENCE",
         format("simplified sequence for samplers that will be used (default: %s)", sampler_type_chars.c_str()),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.samplers_sequence = llama_sampling_types_from_chars(value);
         }
     ));
     add_opt(llama_arg(
         {"--ignore-eos"},
         "ignore end of stream token and continue generating (implies --logit-bias EOS-inf)",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.ignore_eos = true;
         }
     ));
     add_opt(llama_arg(
         {"--penalize-nl"},
         format("penalize newline tokens (default: %s)", sparams.penalize_nl ? "true" : "false"),
-        [&sparams]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             sparams.penalize_nl = true;
         }
     ));
     add_opt(llama_arg(
         {"--temp"}, "N",
         format("temperature (default: %.1f)", (double)sparams.temp),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.temp = std::stof(value);
             sparams.temp = std::max(sparams.temp, 0.0f);
         }
@@ -1217,42 +1217,42 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--top-k"}, "N",
         format("top-k sampling (default: %d, 0 = disabled)", sparams.top_k),
-        [&sparams](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             sparams.top_k = value;
         }
     ));
     add_opt(llama_arg(
         {"--top-p"}, "N",
         format("top-p sampling (default: %.1f, 1.0 = disabled)", (double)sparams.top_p),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.top_p = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--min-p"}, "N",
         format("min-p sampling (default: %.1f, 0.0 = disabled)", (double)sparams.min_p),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.min_p = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--tfs"}, "N",
         format("tail free sampling, parameter z (default: %.1f, 1.0 = disabled)", (double)sparams.tfs_z),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.tfs_z = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--typical"}, "N",
         format("locally typical sampling, parameter p (default: %.1f, 1.0 = disabled)", (double)sparams.typical_p),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.typical_p = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--repeat-last-n"}, "N",
         format("last n tokens to consider for penalize (default: %d, 0 = disabled, -1 = ctx_size)", sparams.penalty_last_n),
-        [&sparams](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             sparams.penalty_last_n = value;
             sparams.n_prev = std::max(sparams.n_prev, sparams.penalty_last_n);
         }
@@ -1260,35 +1260,35 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--repeat-penalty"}, "N",
         format("penalize repeat sequence of tokens (default: %.1f, 1.0 = disabled)", (double)sparams.penalty_repeat),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.penalty_repeat = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--presence-penalty"}, "N",
         format("repeat alpha presence penalty (default: %.1f, 0.0 = disabled)", (double)sparams.penalty_present),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.penalty_present = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--frequency-penalty"}, "N",
         format("repeat alpha frequency penalty (default: %.1f, 0.0 = disabled)", (double)sparams.penalty_freq),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.penalty_freq = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--dynatemp-range"}, "N",
         format("dynamic temperature range (default: %.1f, 0.0 = disabled)", (double)sparams.dynatemp_range),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.dynatemp_range = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--dynatemp-exp"}, "N",
         format("dynamic temperature exponent (default: %.1f)", (double)sparams.dynatemp_exponent),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.dynatemp_exponent = std::stof(value);
         }
     ));
@@ -1296,21 +1296,21 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
         {"--mirostat"}, "N",
         format("use Mirostat sampling.\nTop K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.\n"
         "(default: %d, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)", sparams.mirostat),
-        [&sparams](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             sparams.mirostat = value;
         }
     ));
     add_opt(llama_arg(
         {"--mirostat-lr"}, "N",
         format("Mirostat learning rate, parameter eta (default: %.1f)", (double)sparams.mirostat_eta),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.mirostat_eta = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--mirostat-ent"}, "N",
         format("Mirostat target entropy, parameter tau (default: %.1f)", (double)sparams.mirostat_tau),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.mirostat_tau = std::stof(value);
         }
     ));
@@ -1319,7 +1319,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
         "modifies the likelihood of token appearing in the completion,\n"
         "i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',\n"
         "or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'",
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::stringstream ss(value);
             llama_token key;
             char sign;
@@ -1338,14 +1338,14 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--cfg-negative-prompt"}, "PROMPT",
         format("negative prompt to use for guidance (default: '%s')", sparams.cfg_negative_prompt.c_str()),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.cfg_negative_prompt = value;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(llama_arg(
         {"--cfg-negative-prompt-file"}, "FNAME",
         "negative prompt file to use for guidance",
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::ifstream file(value);
             if (!file) {
                 throw std::runtime_error(format("error: failed to open file '%s'\n", value.c_str()));
@@ -1359,21 +1359,21 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--cfg-scale"}, "N",
         format("strength of guidance (default: %.1f, 1.0 = disable)", (double)sparams.cfg_scale),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.cfg_scale = std::stof(value);
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(llama_arg(
         {"--grammar"}, "GRAMMAR",
         format("BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '%s')", sparams.grammar.c_str()),
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.grammar = value;
         }
     ));
     add_opt(llama_arg(
         {"--grammar-file"}, "FNAME",
         "file to read grammar from",
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::ifstream file(value);
             if (!file) {
                 throw std::runtime_error(format("error: failed to open file '%s'\n", value.c_str()));
@@ -1388,14 +1388,14 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-j", "--json-schema"}, "SCHEMA",
         "JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object\nFor schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead",
-        [&sparams](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             sparams.grammar = json_schema_to_grammar(json::parse(value));
         }
     ));
     add_opt(llama_arg(
         {"--pooling"}, "{none,mean,cls,last}",
         "pooling type for embeddings, use model default if unspecified",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             /**/ if (value == "none") { params.pooling_type = LLAMA_POOLING_TYPE_NONE; }
             else if (value == "mean") { params.pooling_type = LLAMA_POOLING_TYPE_MEAN; }
             else if (value == "cls") { params.pooling_type = LLAMA_POOLING_TYPE_CLS; }
@@ -1406,7 +1406,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--attention"}, "{causal,non,causal}",
         "attention type for embeddings, use model default if unspecified",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             /**/ if (value == "causal") { params.attention_type = LLAMA_ATTENTION_TYPE_CAUSAL; }
             else if (value == "non-causal") { params.attention_type = LLAMA_ATTENTION_TYPE_NON_CAUSAL; }
             else { throw std::invalid_argument("invalid value"); }
@@ -1415,7 +1415,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--rope-scaling"}, "{none,linear,yarn}",
         "RoPE frequency scaling method, defaults to linear unless specified by the model",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             /**/ if (value == "none") { params.rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_NONE; }
             else if (value == "linear") { params.rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_LINEAR; }
             else if (value == "yarn") { params.rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_YARN; }
@@ -1425,91 +1425,91 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--rope-scale"}, "N",
         "RoPE context scaling factor, expands context by a factor of N",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.rope_freq_scale = 1.0f / std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--rope-freq-base"}, "N",
         "RoPE base frequency, used by NTK-aware scaling (default: loaded from model)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.rope_freq_base = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--rope-freq-scale"}, "N",
         "RoPE frequency scaling factor, expands context by a factor of 1/N",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.rope_freq_scale = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--yarn-orig-ctx"}, "N",
         format("YaRN: original context size of model (default: %d = model training context size)", params.yarn_orig_ctx),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.yarn_orig_ctx = value;
         }
     ));
     add_opt(llama_arg(
         {"--yarn-ext-factor"}, "N",
         format("YaRN: extrapolation mix factor (default: %.1f, 0.0 = full interpolation)", (double)params.yarn_ext_factor),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.yarn_ext_factor = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--yarn-attn-factor"}, "N",
         format("YaRN: scale sqrt(t) or attention magnitude (default: %.1f)", (double)params.yarn_attn_factor),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.yarn_attn_factor = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--yarn-beta-slow"}, "N",
         format("YaRN: high correction dim or alpha (default: %.1f)", (double)params.yarn_beta_slow),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.yarn_beta_slow = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"--yarn-beta-fast"}, "N",
         format("YaRN: low correction dim or beta (default: %.1f)", (double)params.yarn_beta_fast),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.yarn_beta_fast = std::stof(value);
         }
     ));
     add_opt(llama_arg(
         {"-gan", "--grp-attn-n"}, "N",
         format("group-attention factor (default: %d)", params.grp_attn_n),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.grp_attn_n = value;
         }
     ));
     add_opt(llama_arg(
         {"-gaw", "--grp-attn-w"}, "N",
         format("group-attention width (default: %.1f)", (double)params.grp_attn_w),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.grp_attn_w = value;
         }
     ));
     add_opt(llama_arg(
         {"-dkvc", "--dump-kv-cache"},
         "verbose print of the KV cache",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.dump_kv_cache = true;
         }
     ));
     add_opt(llama_arg(
         {"-nkvo", "--no-kv-offload"},
         "disable KV offload",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.no_kv_offload = true;
         }
     ));
     add_opt(llama_arg(
         {"-ctk", "--cache-type-k"}, "TYPE",
         format("KV cache data type for K (default: %s)", params.cache_type_k.c_str()),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             // TODO: get the type right here
             params.cache_type_k = value;
         }
@@ -1517,7 +1517,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-ctv", "--cache-type-v"}, "TYPE",
         format("KV cache data type for V (default: %s)", params.cache_type_v.c_str()),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             // TODO: get the type right here
             params.cache_type_v = value;
         }
@@ -1525,119 +1525,119 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--all-logits"},
         format("return logits for all tokens in the batch (default: %s)", params.logits_all ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.logits_all = true;
         }
     ).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
     add_opt(llama_arg(
         {"--hellaswag"},
         "compute HellaSwag score over random tasks from datafile supplied with -f",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.hellaswag = true;
         }
     ).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
     add_opt(llama_arg(
         {"--hellaswag-tasks"}, "N",
         format("number of tasks to use when computing the HellaSwag score (default: %zu)", params.hellaswag_tasks),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.hellaswag_tasks = value;
         }
     ).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
     add_opt(llama_arg(
         {"--winogrande"},
         "compute Winogrande score over random tasks from datafile supplied with -f",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.winogrande = true;
         }
     ).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
     add_opt(llama_arg(
         {"--winogrande-tasks"}, "N",
         format("number of tasks to use when computing the Winogrande score (default: %zu)", params.winogrande_tasks),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.winogrande_tasks = value;
         }
     ).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
     add_opt(llama_arg(
         {"--multiple-choice"},
         "compute multiple choice score over random tasks from datafile supplied with -f",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.multiple_choice = true;
         }
     ).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
     add_opt(llama_arg(
         {"--multiple-choice-tasks"}, "N",
         format("number of tasks to use when computing the multiple choice score (default: %zu)", params.multiple_choice_tasks),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.multiple_choice_tasks = value;
         }
     ).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
     add_opt(llama_arg(
         {"--kl-divergence"},
         "computes KL-divergence to logits provided via --kl-divergence-base",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.kl_divergence = true;
         }
     ).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
     add_opt(llama_arg(
         {"--ppl-stride"}, "N",
         format("stride for perplexity calculation (default: %d)", params.ppl_stride),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.ppl_stride = value;
         }
     ).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
     add_opt(llama_arg(
         {"--ppl-output-type"}, "<0|1>",
         format("output type for perplexity calculation (default: %d)", params.ppl_output_type),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.ppl_output_type = value;
         }
     ).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
     add_opt(llama_arg(
         {"-dt", "--defrag-thold"}, "N",
         format("KV cache defragmentation threshold (default: %.1f, < 0 - disabled)", (double)params.defrag_thold),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.defrag_thold = std::stof(value);
         }
     ).set_env("LLAMA_ARG_DEFRAG_THOLD"));
     add_opt(llama_arg(
         {"-np", "--parallel"}, "N",
         format("number of parallel sequences to decode (default: %d)", params.n_parallel),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_parallel = value;
         }
     ));
     add_opt(llama_arg(
         {"-ns", "--sequences"}, "N",
         format("number of sequences to decode (default: %d)", params.n_sequences),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_sequences = value;
         }
     ));
     add_opt(llama_arg(
         {"-cb", "--cont-batching"},
         format("enable continuous batching (a.k.a dynamic batching) (default: %s)", params.cont_batching ? "enabled" : "disabled"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.cont_batching = true;
         }
     ).set_env("LLAMA_ARG_CONT_BATCHING"));
     add_opt(llama_arg(
         {"-nocb", "--no-cont-batching"},
         "disable continuous batching",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.cont_batching = false;
         }
     ).set_env("LLAMA_ARG_NO_CONT_BATCHING"));
     add_opt(llama_arg(
         {"--mmproj"}, "FILE",
         "path to a multimodal projector file for LLaVA. see examples/llava/README.md",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.mmproj = value;
         }
     ).set_examples({LLAMA_EXAMPLE_LLAVA}));
     add_opt(llama_arg(
         {"--image"}, "FILE",
         "path to an image file. use with multimodal models. Specify multiple times for batching",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.image.emplace_back(value);
         }
     ).set_examples({LLAMA_EXAMPLE_LLAVA}));
@@ -1645,7 +1645,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--rpc"}, "SERVERS",
         "comma separated list of RPC servers",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.rpc_servers = value;
         }
     ));
@@ -1653,14 +1653,14 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--mlock"},
         "force system to keep model in RAM rather than swapping or compressing",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.use_mlock = true;
         }
     ));
     add_opt(llama_arg(
         {"--no-mmap"},
         "do not memory-map model (slower load but may reduce pageouts if not using mlock)",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.use_mmap = false;
         }
     ));
@@ -1672,7 +1672,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
         "- numactl: use the CPU map provided by numactl\n"
         "if run without this previously, it is recommended to drop the system page cache before using this\n"
         "see https://github.com/ggerganov/llama.cpp/issues/1437",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             /**/ if (value == "distribute" || value == "") { params.numa = GGML_NUMA_STRATEGY_DISTRIBUTE; }
             else if (value == "isolate") { params.numa = GGML_NUMA_STRATEGY_ISOLATE; }
             else if (value == "numactl") { params.numa = GGML_NUMA_STRATEGY_NUMACTL; }
@@ -1682,7 +1682,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-ngl", "--gpu-layers"}, "N",
         "number of layers to store in VRAM",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_gpu_layers = value;
             if (!llama_supports_gpu_offload()) {
                 fprintf(stderr, "warning: not compiled with GPU offload support, --gpu-layers option will be ignored\n");
@@ -1693,7 +1693,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-ngld", "--gpu-layers-draft"}, "N",
         "number of layers to store in VRAM for the draft model",
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_gpu_layers_draft = value;
             if (!llama_supports_gpu_offload()) {
                 fprintf(stderr, "warning: not compiled with GPU offload support, --gpu-layers-draft option will be ignored\n");
@@ -1707,7 +1707,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
         "- none: use one GPU only\n"
         "- layer (default): split layers and KV across GPUs\n"
         "- row: split rows across GPUs",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::string arg_next = value;
             if (arg_next == "none") {
                 params.split_mode = LLAMA_SPLIT_MODE_NONE;
@@ -1732,7 +1732,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-ts", "--tensor-split"}, "N0,N1,N2,...",
         "fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::string arg_next = value;
 
             // split string by , and /
@@ -1759,7 +1759,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-mg", "--main-gpu"}, "INDEX",
         format("the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: %d)", params.main_gpu),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.main_gpu = value;
 #ifndef GGML_USE_CUDA_SYCL_VULKAN
             fprintf(stderr, "warning: llama.cpp was compiled without CUDA/SYCL/Vulkan. Setting the main GPU has no effect.\n");
@@ -1769,7 +1769,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--check-tensors"},
         format("check model tensor data for invalid values (default: %s)", params.check_tensors ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.check_tensors = true;
         }
     ));
@@ -1777,7 +1777,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
         {"--override-kv"}, "KEY=TYPE:VALUE",
         "advanced option to override model metadata by key. may be specified multiple times.\n"
         "types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             if (!string_parse_kv_override(value.c_str(), params.kv_overrides)) {
                 throw std::runtime_error(format("error: Invalid type for KV override: %s\n", value.c_str()));
             }
@@ -1786,21 +1786,21 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--lora"}, "FNAME",
         "path to LoRA adapter (can be repeated to use multiple adapters)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.lora_adapters.push_back({ std::string(value), 1.0 });
         }
     ).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_EXPORT_LORA}));
     add_opt(llama_arg(
         {"--lora-scaled"}, "FNAME", "SCALE",
         "path to LoRA adapter with user defined scaling (can be repeated to use multiple adapters)",
-        [&params](std::string fname, std::string scale) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & fname, const std::string & scale) {
             params.lora_adapters.push_back({ fname, std::stof(scale) });
         }
     ).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_EXPORT_LORA}));
     add_opt(llama_arg(
         {"--control-vector"}, "FNAME",
         "add a control vector\nnote: this argument can be repeated to add multiple control vectors",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.control_vectors.push_back({ 1.0f, value, });
         }
     ));
@@ -1808,14 +1808,14 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
         {"--control-vector-scaled"}, "FNAME", "SCALE",
         "add a control vector with user defined scaling SCALE\n"
         "note: this argument can be repeated to add multiple scaled control vectors",
-        [&params](std::string fname, std::string scale) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & fname, const std::string & scale) {
             params.control_vectors.push_back({ std::stof(scale), fname });
         }
     ));
     add_opt(llama_arg(
         {"--control-vector-layer-range"}, "START", "END",
         "layer range to apply the control vector(s) to, start and end inclusive",
-        [&params](std::string start, std::string end) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & start, const std::string & end) {
             params.control_vector_layer_start = std::stoi(start);
             params.control_vector_layer_end = std::stoi(end);
         }
@@ -1823,7 +1823,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-a", "--alias"}, "STRING",
         "set alias for model name (to be used by REST API)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.model_alias = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_MODEL"));
@@ -1835,49 +1835,49 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
                 "model path (default: `models/$filename` with filename from `--hf-file` "
                 "or `--model-url` if set, otherwise %s)", DEFAULT_MODEL_PATH
             ),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.model = value;
         }
     ).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_EXPORT_LORA}).set_env("LLAMA_ARG_MODEL"));
     add_opt(llama_arg(
         {"-md", "--model-draft"}, "FNAME",
         "draft model for speculative decoding (default: unused)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.model_draft = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SPECULATIVE}));
     add_opt(llama_arg(
         {"-mu", "--model-url"}, "MODEL_URL",
         "model download url (default: unused)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.model_url = value;
         }
     ).set_env("LLAMA_ARG_MODEL_URL"));
     add_opt(llama_arg(
         {"-hfr", "--hf-repo"}, "REPO",
         "Hugging Face model repository (default: unused)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.hf_repo = value;
         }
     ).set_env("LLAMA_ARG_HF_REPO"));
     add_opt(llama_arg(
         {"-hff", "--hf-file"}, "FILE",
         "Hugging Face model file (default: unused)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.hf_file = value;
         }
     ).set_env("LLAMA_ARG_HF_FILE"));
     add_opt(llama_arg(
         {"-hft", "--hf-token"}, "TOKEN",
         "Hugging Face access token (default: value from HF_TOKEN environment variable)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.hf_token = value;
         }
     ).set_env("HF_TOKEN"));
     add_opt(llama_arg(
         {"--context-file"}, "FNAME",
         "file to load context from (repeat to specify multiple files)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::ifstream file(value, std::ios::binary);
             if (!file) {
                 throw std::runtime_error(format("error: failed to open file '%s'\n", value.c_str()));
@@ -1888,28 +1888,28 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--chunk-size"}, "N",
         format("minimum length of embedded text chunks (default: %d)", params.chunk_size),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.chunk_size = value;
         }
     ).set_examples({LLAMA_EXAMPLE_RETRIEVAL}));
     add_opt(llama_arg(
         {"--chunk-separator"}, "STRING",
         format("separator between chunks (default: '%s')", params.chunk_separator.c_str()),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.chunk_separator = value;
         }
     ).set_examples({LLAMA_EXAMPLE_RETRIEVAL}));
     add_opt(llama_arg(
         {"--junk"}, "N",
         format("number of times to repeat the junk text (default: %d)", params.n_junk),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_junk = value;
         }
     ).set_examples({LLAMA_EXAMPLE_PASSKEY}));
     add_opt(llama_arg(
         {"--pos"}, "N",
         format("position of the passkey in the junk text (default: %d)", params.i_pos),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.i_pos = value;
         }
     ).set_examples({LLAMA_EXAMPLE_PASSKEY}));
@@ -1921,7 +1921,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
                 : ex == LLAMA_EXAMPLE_CVECTOR_GENERATOR
                     ? params.cvector_outfile.c_str()
                     : params.out_file.c_str()),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.out_file = value;
             params.cvector_outfile = value;
             params.lora_outfile = value;
@@ -1930,49 +1930,49 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-ofreq", "--output-frequency"}, "N",
         format("output the imatrix every N iterations (default: %d)", params.n_out_freq),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_out_freq = value;
         }
     ).set_examples({LLAMA_EXAMPLE_IMATRIX}));
     add_opt(llama_arg(
         {"--save-frequency"}, "N",
         format("save an imatrix copy every N iterations (default: %d)", params.n_save_freq),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_save_freq = value;
         }
     ).set_examples({LLAMA_EXAMPLE_IMATRIX}));
     add_opt(llama_arg(
         {"--process-output"},
         format("collect data for the output tensor (default: %s)", params.process_output ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.process_output = true;
         }
     ).set_examples({LLAMA_EXAMPLE_IMATRIX}));
     add_opt(llama_arg(
         {"--no-ppl"},
         format("do not compute perplexity (default: %s)", params.compute_ppl ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.compute_ppl = false;
         }
     ).set_examples({LLAMA_EXAMPLE_IMATRIX}));
     add_opt(llama_arg(
         {"--chunk"}, "N",
         format("start processing the input from chunk N (default: %d)", params.i_chunk),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.i_chunk = value;
         }
     ).set_examples({LLAMA_EXAMPLE_IMATRIX}));
     add_opt(llama_arg(
         {"-pps"},
         format("is the prompt shared across parallel sequences (default: %s)", params.is_pp_shared ? "true" : "false"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.is_pp_shared = true;
         }
     ).set_examples({LLAMA_EXAMPLE_BENCH}));
     add_opt(llama_arg(
         {"-npp"}, "n0,n1,...",
         "number of prompt tokens",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             auto p = string_split<int>(value, ',');
             params.n_pp.insert(params.n_pp.end(), p.begin(), p.end());
         }
@@ -1980,7 +1980,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-ntg"}, "n0,n1,...",
         "number of text generation tokens",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             auto p = string_split<int>(value, ',');
             params.n_tg.insert(params.n_tg.end(), p.begin(), p.end());
         }
@@ -1988,7 +1988,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-npl"}, "n0,n1,...",
         "number of parallel prompts",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             auto p = string_split<int>(value, ',');
             params.n_pl.insert(params.n_pl.end(), p.begin(), p.end());
         }
@@ -1996,63 +1996,63 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--embd-normalize"}, "N",
         format("normalisation for embendings (default: %d) (-1=none, 0=max absolute int16, 1=taxicab, 2=euclidean, >2=p-norm)", params.embd_normalize),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.embd_normalize = value;
         }
     ).set_examples({LLAMA_EXAMPLE_EMBEDDING}));
     add_opt(llama_arg(
         {"--embd-output-format"}, "FORMAT",
         "empty = default, \"array\" = [[],[]...], \"json\" = openai style, \"json+\" = same \"json\" + cosine similarity matrix",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.embd_out = value;
         }
     ).set_examples({LLAMA_EXAMPLE_EMBEDDING}));
     add_opt(llama_arg(
         {"--embd-separator"}, "STRING",
         "separator of embendings (default \\n) for example \"<#sep#>\"",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.embd_sep = value;
         }
     ).set_examples({LLAMA_EXAMPLE_EMBEDDING}));
     add_opt(llama_arg(
         {"--host"}, "HOST",
         format("ip address to listen (default: %s)", params.hostname.c_str()),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.hostname = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_HOST"));
     add_opt(llama_arg(
         {"--port"}, "PORT",
         format("port to listen (default: %d)", params.port),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.port = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_PORT"));
     add_opt(llama_arg(
         {"--path"}, "PATH",
         format("path to serve static files from (default: %s)", params.public_path.c_str()),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.public_path = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}));
     add_opt(llama_arg(
         {"--embedding", "--embeddings"},
         format("restrict to only support embedding use case; use only with dedicated embedding models (default: %s)", params.embedding ? "enabled" : "disabled"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.embedding = true;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_EMBEDDINGS"));
     add_opt(llama_arg(
         {"--api-key"}, "KEY",
         "API key to use for authentication (default: none)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.api_keys.push_back(value);
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_API_KEY"));
     add_opt(llama_arg(
         {"--api-key-file"}, "FNAME",
         "path to file containing API keys (default: none)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::ifstream key_file(value);
             if (!key_file) {
                 throw std::runtime_error(format("error: failed to open file '%s'\n", value.c_str()));
@@ -2069,21 +2069,21 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--ssl-key-file"}, "FNAME",
         "path to file a PEM-encoded SSL private key",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.ssl_file_key = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}));
     add_opt(llama_arg(
         {"--ssl-cert-file"}, "FNAME",
         "path to file a PEM-encoded SSL certificate",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.ssl_file_cert = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}));
     add_opt(llama_arg(
         {"--timeout"}, "N",
         format("server read/write timeout in seconds (default: %d)", params.timeout_read),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.timeout_read  = value;
             params.timeout_write = value;
         }
@@ -2091,14 +2091,14 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--threads-http"}, "N",
         format("number of threads used to process HTTP requests (default: %d)", params.n_threads_http),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_threads_http = value;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_THREADS_HTTP"));
     add_opt(llama_arg(
         {"-spf", "--system-prompt-file"}, "FNAME",
         "set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             std::ifstream file(value);
             if (!file) {
                 throw std::runtime_error(format("error: failed to open file '%s'\n", value.c_str()));
@@ -2115,7 +2115,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--log-format"}, "{text, json}",
         "log output format: json or text (default: json)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             if (value == "json") {
                 params.log_json = true;
             } else if (value == "text") {
@@ -2128,21 +2128,21 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--metrics"},
         format("enable prometheus compatible metrics endpoint (default: %s)", params.endpoint_metrics ? "enabled" : "disabled"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.endpoint_metrics = true;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_ENDPOINT_METRICS"));
     add_opt(llama_arg(
         {"--no-slots"},
         format("disables slots monitoring endpoint (default: %s)", params.endpoint_slots ? "enabled" : "disabled"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.endpoint_slots = false;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_NO_ENDPOINT_SLOTS"));
     add_opt(llama_arg(
         {"--slot-save-path"}, "PATH",
         "path to save slot kv cache (default: disabled)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.slot_save_path = value;
             // if doesn't end with DIRECTORY_SEPARATOR, add it
             if (!params.slot_save_path.empty() && params.slot_save_path[params.slot_save_path.size() - 1] != DIRECTORY_SEPARATOR) {
@@ -2155,7 +2155,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
         "set custom jinja chat template (default: template taken from model's metadata)\n"
         "if suffix/prefix are specified, template will be disabled\n"
         "only commonly used templates are accepted:\nhttps://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             if (!llama_chat_verify_template(value)) {
                 throw std::runtime_error(format(
                     "error: the supplied chat template is not supported: %s\n"
@@ -2169,28 +2169,28 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"-sps", "--slot-prompt-similarity"}, "SIMILARITY",
         format("how much the prompt of a request must match the prompt of a slot in order to use that slot (default: %.2f, 0.0 = disabled)\n", params.slot_prompt_similarity),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.slot_prompt_similarity = std::stof(value);
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}));
     add_opt(llama_arg(
         {"--lora-init-without-apply"},
         format("load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: %s)", params.lora_init_without_apply ? "enabled" : "disabled"),
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.lora_init_without_apply = true;
         }
     ).set_examples({LLAMA_EXAMPLE_SERVER}));
     add_opt(llama_arg(
         {"--simple-io"},
         "use basic IO for better compatibility in subprocesses and limited consoles",
-        [&params]() {
+        [](gpt_params & params, llama_sampling_params & sparams) {
             params.simple_io = true;
         }
     ).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_INFILL}));
     add_opt(llama_arg(
         {"-ld", "--logdir"}, "LOGDIR",
         "path under which to save YAML logs (no logging if unset)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.logdir = value;
 
             if (params.logdir.back() != DIRECTORY_SEPARATOR) {
@@ -2201,35 +2201,35 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--positive-file"}, "FNAME",
         format("positive prompts file, one prompt per line (default: '%s')", params.cvector_positive_file.c_str()),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.cvector_positive_file = value;
         }
     ).set_examples({LLAMA_EXAMPLE_CVECTOR_GENERATOR}));
     add_opt(llama_arg(
         {"--negative-file"}, "FNAME",
         format("negative prompts file, one prompt per line (default: '%s')", params.cvector_negative_file.c_str()),
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             params.cvector_negative_file = value;
         }
     ).set_examples({LLAMA_EXAMPLE_CVECTOR_GENERATOR}));
     add_opt(llama_arg(
         {"--pca-batch"}, "N",
         format("batch size used for PCA. Larger batch runs faster, but uses more memory (default: %d)", params.n_pca_batch),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_pca_batch = value;
         }
     ).set_examples({LLAMA_EXAMPLE_CVECTOR_GENERATOR}));
     add_opt(llama_arg(
         {"--pca-iter"}, "N",
         format("number of iterations used for PCA (default: %d)", params.n_pca_iterations),
-        [&params](int value) {
+        [](gpt_params & params, llama_sampling_params & sparams, int value) {
             params.n_pca_iterations = value;
         }
     ).set_examples({LLAMA_EXAMPLE_CVECTOR_GENERATOR}));
     add_opt(llama_arg(
         {"--method"}, "{pca, mean}",
         "dimensionality reduction method to be used (default: pca)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             /**/ if (value == "pca") { params.cvector_dimre_method = DIMRE_METHOD_PCA; }
             else if (value == "mean") { params.cvector_dimre_method = DIMRE_METHOD_MEAN; }
             else { throw std::invalid_argument("invalid value"); }
@@ -2238,7 +2238,7 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--output-format"}, "{md,jsonl}",
         "output format for batched-bench results (default: md)",
-        [&params](std::string value) {
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) {
             /**/ if (value == "jsonl") { params.batched_bench_output_jsonl = true; }
             else if (value == "md") { params.batched_bench_output_jsonl = false; }
             else { std::invalid_argument("invalid value"); }
@@ -2249,32 +2249,32 @@ std::vector<llama_arg> gpt_params_parser_init(gpt_params & params, llama_example
     add_opt(llama_arg(
         {"--log-test"},
         "Log test",
-        []() { log_param_single_parse("--log-test"); }
+        [](gpt_params & params, llama_sampling_params & sparams) { log_param_single_parse("--log-test"); }
     ));
     add_opt(llama_arg(
         {"--log-disable"},
         "Log disable",
-        []() { log_param_single_parse("--log-disable"); }
+        [](gpt_params & params, llama_sampling_params & sparams) { log_param_single_parse("--log-disable"); }
     ));
     add_opt(llama_arg(
         {"--log-enable"},
         "Log enable",
-        []() { log_param_single_parse("--log-enable"); }
+        [](gpt_params & params, llama_sampling_params & sparams) { log_param_single_parse("--log-enable"); }
     ));
     add_opt(llama_arg(
         {"--log-new"},
         "Log new",
-        []() { log_param_single_parse("--log-new"); }
+        [](gpt_params & params, llama_sampling_params & sparams) { log_param_single_parse("--log-new"); }
     ));
     add_opt(llama_arg(
         {"--log-append"},
         "Log append",
-        []() { log_param_single_parse("--log-append"); }
+        [](gpt_params & params, llama_sampling_params & sparams) { log_param_single_parse("--log-append"); }
     ));
     add_opt(llama_arg(
         {"--log-file"}, "FNAME",
         "Log file",
-        [](std::string value) { log_param_pair_parse(false, "--log-file", value); }
+        [](gpt_params & params, llama_sampling_params & sparams, const std::string & value) { log_param_pair_parse(false, "--log-file", value); }
     ));
 #endif // LOG_DISABLE_LOGS
 
diff --git a/common/common.h b/common/common.h
index e8dd040e..60b55340 100644
--- a/common/common.h
+++ b/common/common.h
@@ -310,20 +310,28 @@ struct llama_arg {
     std::string value_hint_2; // for second arg value
     std::string env;
     std::string help;
-    std::function<void(void)>                     handler_void    = nullptr;
-    std::function<void(std::string)>              handler_string  = nullptr;
-    std::function<void(std::string, std::string)> handler_str_str = nullptr;
-    std::function<void(int)>                      handler_int     = nullptr;
+    //std::function<void(void)>                     handler_void    = nullptr;
+    //std::function<void(std::string)>              handler_string  = nullptr;
+    //std::function<void(std::string, std::string)> handler_str_str = nullptr;
+    //std::function<void(int)>                      handler_int     = nullptr;
+    void (*handler_void)   (gpt_params & params, llama_sampling_params & sparams) = nullptr;
+    void (*handler_string) (gpt_params & params, llama_sampling_params & sparams, const std::string &) = nullptr;
+    void (*handler_str_str)(gpt_params & params, llama_sampling_params & sparams, const std::string &, const std::string &) = nullptr;
+    void (*handler_int)    (gpt_params & params, llama_sampling_params & sparams, int) = nullptr;
 
-    llama_arg(std::vector<std::string> args, std::string value_hint, std::string help, std::function<void(std::string)> handler) : args(args), value_hint(value_hint), help(help), handler_string(handler) {}
+    //llama_arg(std::vector<std::string> args, std::string value_hint, std::string help, std::function<void(std::string)> handler) : args(args), value_hint(value_hint), help(help), handler_string(handler) {}
+    llama_arg(const std::vector<std::string> & args, const std::string & value_hint, const std::string & help, void (*handler)(gpt_params & params, llama_sampling_params & sparams, const std::string &)) : args(args), value_hint(value_hint), help(help), handler_string(handler) {}
 
-    llama_arg(std::vector<std::string> args, std::string value_hint, std::string help, std::function<void(int)> handler) : args(args), value_hint(value_hint), help(help), handler_int(handler) {}
+    //llama_arg(std::vector<std::string> args, std::string value_hint, std::string help, std::function<void(int)> handler) : args(args), value_hint(value_hint), help(help), handler_int(handler) {}
+    llama_arg(const std::vector<std::string> & args, const std::string & value_hint, const std::string & help, void (*handler)(gpt_params & params, llama_sampling_params & sparams, int)) : args(args), value_hint(value_hint), help(help), handler_int(handler) {}
 
-    llama_arg(std::vector<std::string> args, std::string help, std::function<void(void)> handler) : args(args), help(help), handler_void(handler) {}
+    //llama_arg(std::vector<std::string> args, std::string help, std::function<void(void)> handler) : args(args), help(help), handler_void(handler) {}
+    llama_arg(const std::vector<std::string> & args, const std::string & help, void (*handler)(gpt_params & params, llama_sampling_params & sparams)) : args(args), help(help), handler_void(handler) {}
 
     // support 2 values for arg
     // note: env variable is not yet support for 2 values
-    llama_arg(std::vector<std::string> args, std::string value_hint, std::string value_hint_2, std::string help, std::function<void(std::string, std::string)> handler) : args(args), value_hint(value_hint), value_hint_2(value_hint_2), help(help), handler_str_str(handler) {}
+    //llama_arg(std::vector<std::string> args, std::string value_hint, std::string value_hint_2, std::string help, std::function<void(std::string, std::string)> handler) : args(args), value_hint(value_hint), value_hint_2(value_hint_2), help(help), handler_str_str(handler) {}
+    llama_arg(const std::vector<std::string> & args, const std::string & value_hint, const std::string & value_hint_2, const std::string & help, void (*handler)(gpt_params & params, llama_sampling_params & sparams, const std::string &, const std::string &)) : args(args), value_hint(value_hint), value_hint_2(value_hint_2), help(help), handler_str_str(handler) {}
 
     llama_arg & set_examples(std::set<enum llama_example> examples) {
         this->examples = std::move(examples);
@@ -340,7 +348,7 @@ struct llama_arg {
         return examples.find(ex) != examples.end();
     }
 
-    bool get_value_from_env(std::string & output) {
+    bool get_value_from_env(std::string & output) const {
         if (env.empty()) return false;
         char * value = std::getenv(env.c_str());
         if (value) {
@@ -350,7 +358,7 @@ struct llama_arg {
         return false;
     }
 
-    bool has_value_from_env() {
+    bool has_value_from_env() const {
         return std::getenv(env.c_str());
     }

ngxson · 2024-09-07T15:06:30Z

Thanks for testing that. Yes I can confirm that the build time is now ~9.6s compared to ~5.8s on master (using macbook M3 Max).

Apply the patch by @slaren bring it down to 7.8s, which is exactly 2s slower than master.

Testing a bit further, I changed all the std::vector<std::string> & args to std::string & args and it brought the build time back to 5.2s

So at this point I'm a bit doubt if I can somehow take advantage of this to reduce the build time, without compromising the runtime performance. In worst case, what is an acceptable increase in build time?

slaren · 2024-09-07T15:16:21Z

Testing a bit further, I changed all the std::vector<std::string> & args to std::string & args and it brought the build time back to 5.2s

I am a bit confused by this, do you mean the args member in llama_arg? Are there other vectors like this?

slaren · 2024-09-07T15:26:16Z

Ok, I see. Replacing the vectors in the llama_arg constructors with std::initializer_list<const char*> should improve the build time a bit without too many changes.

ngxson · 2024-09-07T15:26:59Z

I am a bit confused by this, do you mean the args member in llama_arg? Are there other vectors like this?

I mean the args in constructor of llama_arg

The way I tested was:

Remove all code and leaving only one of the arg (I took "--verbose" in my case)
Repeat the arg 190 times
Try compiling. Even though it's one arg repeated 190 times, it will still take the same time as having 190 different args (just to prove that there is nothing being cached)
Now change the first arg from std::vector to a simple std::string
Re-compile, now it should take much less time

In the end, the contructor becomes (for testing purpose, I save the string to env - the code won't work, but just to test the built time):

llama_arg(const std::string args, const std::string & value_hint, const std::string & help, void (*handler)(gpt_params & params, llama_sampling_params & sparams, int)) : env(args), value_hint(value_hint), help(help), handler_int(handler) {}

ngxson · 2024-09-07T15:28:49Z

Ok, I see. Replacing the vectors in the llama_arg constructors with std::initializer_list<const char*> should improve the build time a bit without too many changes.

Wow not just a bit, it's now back to 5.2s. Thanks for the hint about std::initializer_list, I didn't know about that 😄

ngxson · 2024-09-07T15:31:01Z

Quick question: because sparams is already included inside params, should we get rid of passing sparams as argument of handler function? (and replace it with params.sparams

Co-authored-by: slaren@users.noreply.github.com

ngxson · 2024-09-07T16:48:49Z

Alright, I change some std::string to const char * and it brought down the build time a little bit. Keep in mind that the 5.2s result above was with only one arg repeated 190 times.

The build time of the latest commit e625f5f :

$ make clean && time make -j common/common.o
make -j common/common.o  5.70s user 0.16s system 99% cpu 5.889 total

# versus master
# make -j common/common.o  5.48s user 0.16s system 99% cpu 5.674 total

So that's 0.2s slower compared to master. It can be further reduced if std::string help can be changed to const char *, but currently it's a quite tricky to have format(...) function to return const char *, so I think we can consider it later.

examples/export-docs/export-docs.cpp

common/common.h

ggerganov · 2024-09-07T17:10:53Z

common/common.h

@@ -276,13 +300,93 @@ struct gpt_params {
    bool batched_bench_output_jsonl = false;
 };

-void gpt_params_parse_from_env(gpt_params & params);
-void gpt_params_handle_model_default(gpt_params & params);
+struct llama_arg {


Should move the method implementations in the .cpp, to avoid building the same code in all examples.

Can also move all llama_arg related stuff in common/arg.h,.cpp. Can be in a follow-up PR

Currently doing this will break gen-docs, since it reads the data directly from class member.

But yes I will do a follow up to split it into common/arg.h,.cpp

* (wip) argparser v3 * migrated * add test * handle env * fix linux build * add export-docs example * fix build (2) * skip build test-arg-parser on windows * update server docs * bring back missing --alias * bring back --n-predict * clarify test-arg-parser * small correction * add comments * fix args with 2 values * refine example-specific args * no more lamba capture Co-authored-by: slaren@users.noreply.github.com * params.sparams * optimize more * export-docs --> gen-docs

(wip) argparser v3

6a3a2fc

github-actions bot added the examples label Sep 4, 2024

ggerganov reviewed Sep 4, 2024

View reviewed changes

common/common.h Outdated Show resolved Hide resolved

migrated

9ae4d8a

github-actions bot added the server label Sep 5, 2024

add test

753782a

github-actions bot added the testing Everything test related label Sep 5, 2024

ngxson added 6 commits September 5, 2024 19:26

handle env

60ae92b

fix linux build

286dcc9

add export-docs example

75d0869

fix build (2)

f5e6a80

skip build test-arg-parser on windows

88e3a4f

Merge branch 'master' into xsn/argparser_v3

de378fa

ngxson marked this pull request as ready for review September 5, 2024 18:23

update server docs

fe6df47

ngxson changed the title ~~common : refactor arg parser (WIP)~~ common : refactor arg parser Sep 5, 2024

bring back missing --alias

b1657cb

ngxson added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Sep 5, 2024

bring back --n-predict

509ec08

ngxson requested a review from ggerganov September 5, 2024 19:30

slaren reviewed Sep 6, 2024

View reviewed changes

tests/test-arg-parser.cpp Outdated Show resolved Hide resolved

tests/test-arg-parser.cpp Outdated Show resolved Hide resolved

ngxson added 6 commits September 6, 2024 09:39

clarify test-arg-parser

d545ffc

small correction

79ce128

add comments

961bd19

fix args with 2 values

53244f9

refine example-specific args

e1281d0

Merge branch 'master' into xsn/argparser_v3

5ae09fd

ngxson added 3 commits September 7, 2024 18:19

no more lamba capture

ceddafa

Co-authored-by: slaren@users.noreply.github.com

params.sparams

eb7d8f8

optimize more

e625f5f

Merge branch 'master' into xsn/argparser_v3

65b736f

ggerganov approved these changes Sep 7, 2024

View reviewed changes

examples/export-docs/export-docs.cpp Outdated Show resolved Hide resolved

examples/export-docs/export-docs.cpp Outdated Show resolved Hide resolved

common/common.h Outdated Show resolved Hide resolved

ggerganov reviewed Sep 7, 2024

View reviewed changes

export-docs --> gen-docs

4b96c69

ggerganov mentioned this pull request Sep 7, 2024

llama : llama_perf + option to disable timings during decode #9355

Merged

5 tasks

ngxson merged commit 1b9ae51 into ggerganov:master Sep 7, 2024
52 checks passed

ngxson mentioned this pull request Sep 7, 2024

changelog : llama-server REST API #9291

Open

legraphista mentioned this pull request Sep 8, 2024

Bug: broken llama-imatrix arg parser #9365

Closed

slaren mentioned this pull request Sep 8, 2024

common : restore --n-gpu-layers #9371

Merged

This was referenced Sep 8, 2024

common : bring back missing args, add env var duplication check #9375

Merged

common : move arg parser code to arg.cpp #9388

Merged

bartowski1182 mentioned this pull request Sep 9, 2024

Bug: docker GGML_CUDA=1 make [on llama-gen-docs] fails since arg refactor #9392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common : refactor arg parser #9308

common : refactor arg parser #9308

ngxson commented Sep 4, 2024 •

edited

Loading

ngxson commented Sep 4, 2024

ggerganov left a comment

ngxson commented Sep 5, 2024

ggerganov commented Sep 7, 2024 •

edited

Loading

slaren commented Sep 7, 2024 •

edited

Loading

ngxson commented Sep 7, 2024

slaren commented Sep 7, 2024

slaren commented Sep 7, 2024

ngxson commented Sep 7, 2024

ngxson commented Sep 7, 2024

ngxson commented Sep 7, 2024

ngxson commented Sep 7, 2024 •

edited

Loading

ggerganov Sep 7, 2024

ngxson Sep 7, 2024

common : refactor arg parser #9308

common : refactor arg parser #9308

Conversation

ngxson commented Sep 4, 2024 • edited Loading

TL;DR for breaking change

In this PR

ngxson commented Sep 4, 2024

ggerganov left a comment

Choose a reason for hiding this comment

ngxson commented Sep 5, 2024

ggerganov commented Sep 7, 2024 • edited Loading

slaren commented Sep 7, 2024 • edited Loading

ngxson commented Sep 7, 2024

slaren commented Sep 7, 2024

slaren commented Sep 7, 2024

ngxson commented Sep 7, 2024

ngxson commented Sep 7, 2024

ngxson commented Sep 7, 2024

ngxson commented Sep 7, 2024 • edited Loading

ggerganov Sep 7, 2024

Choose a reason for hiding this comment

ngxson Sep 7, 2024

Choose a reason for hiding this comment

ngxson commented Sep 4, 2024 •

edited

Loading

ggerganov commented Sep 7, 2024 •

edited

Loading

slaren commented Sep 7, 2024 •

edited

Loading

ngxson commented Sep 7, 2024 •

edited

Loading