QVAC-3697: Load GGUF File From Buffer #1

jesusmb1995 · 2025-07-30T18:23:07Z

This pull request makes changes in Llama.cpp in order to be able to load models directly from memory. It is intended to be reviewable by commit. Individual commits contain a long text description below the header.

Tested that works properly from a bare Addon (LLM repo). See #1 (comment)

In particular, this PR exposes:

llama-cpp.h:llama_model_load_from_buffer(vector<uint8_t>&& data, ...) to load from a single buffer containing a .gguf file contents.
llama.h:llama_model_load_from_split_futures(char** paths, ...) and llama-cpp.h:llama_model_load_fulfill_split_future(char* path, ..., unique_ptr<basic_streambuf<uint8_t>>&& streambuf) which allow to asynchronously/incrementally load a model and upload its tensors to the backend storage while host memory is being released.

How to run the code?

Build and prepare model

Build (e.g. in release mode) LLama.cpp including the examples, tests and tools:

cmake -B build -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_EXAMPLES=ON -DGGML_VULKAN=ON && cmake --build build

Generate a sharded model and its *.tensor.txt summary file:

./build/bin/llama-gguf-split --split --split-max-size 300M models/qwen3/Qwen3-0.6B-Q8_0.gguf Qwen3-0.6B-Q8_0 &&
 mv Qwen*.* models/qwen3

Automated tests

Run automated tests for a single gguf file:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0.gguf
ctest -R ^test-model-load-disk$ --verbose
ctest -R ^test-model-load-memory$ --verbose

Run automated tests for sharded model:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf
ctest -R ^test-model-load-disk$ --verbose
ctest -R ^test-model-load-memory-split$ --verbose

Or simply run all tests:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0.gguf
ctest

Should output:

...
30/41 Test #30: test-backend-ops ..................   Passed  104.24 sec                                                     
      Start 31: test-model-load-cancel                        
31/41 Test #31: test-model-load-cancel ............   Passed    0.34 sec                                                     
      Start 32: test-model-load-disk                          
32/41 Test #32: test-model-load-disk ..............   Passed    0.43 sec                                                     
      Start 33: test-model-load-memory                        
33/41 Test #33: test-model-load-memory ............   Passed    0.00 sec                                                     
      Start 34: test-model-load-memory-split                  
34/41 Test #34: test-model-load-memory-split ......   Passed    0.67 sec 
...
41/41 Test #41: test-eval-callback ................   Passed    0.84 sec

100% tests passed, 0 tests failed out of 41

Label Time Summary:
curl             =   0.84 sec*proc (1 test)
eval-callback    =   0.84 sec*proc (1 test)
main             = 136.15 sec*proc (35 tests)
model            =   1.79 sec*proc (5 tests)

Examples

Demo video: https://drive.google.com/file/d/1mjqecwJ1LFYUNofr4wIdPFK9IkUxbHZh/view?usp=sharing

Set up the environment:

# Do not export any variable to load from disk
# export LLAMA_EXAMPLE_MEMORY_BUFFER=1
export LLAMA_EXAMPLE_MEMORY_BUFFER_SPLIT=1

# Alternatively pass a single .gguf file and set _MEMORY_BUFFER=1
export GGUF_PATH="models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf"

Run example with Qwen3:

/usr/bin/time -v ./build/bin/llama-simple -m "$GGUF_PATH"

Outputs:

...
print_backend_buffers_info: offloading 28 repeating layers to GPU
print_backend_buffers_info: offloading output layer to GPU
print_backend_buffers_info: offloaded 29/29 layers to GPU
print_backend_buffers_info:      Vulkan0 model buffer size =   199.11 MiB
print_backend_buffers_info:  Vulkan_Host model buffer size =   157.65 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    44.65 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    46.78 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.84 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    45.71 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    45.71 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.83 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.84 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    46.78 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    31.89 MiB
llama_context: constructing llama_context
llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 35
llama_context: n_ctx_per_seq = 35
llama_context: n_batch       = 64
llama_context: n_ubatch      = 64
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (35) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
create_memory: n_ctx = 64 (padded)
llama_kv_cache_unified: layer   0: dev = Vulkan0
llama_kv_cache_unified: layer   1: dev = Vulkan0
llama_kv_cache_unified: layer   2: dev = Vulkan0
llama_kv_cache_unified: layer   3: dev = Vulkan0
llama_kv_cache_unified: layer   4: dev = Vulkan0
llama_kv_cache_unified: layer   5: dev = Vulkan0
llama_kv_cache_unified: layer   6: dev = Vulkan0
llama_kv_cache_unified: layer   7: dev = Vulkan0
llama_kv_cache_unified: layer   8: dev = Vulkan0
llama_kv_cache_unified: layer   9: dev = Vulkan0
llama_kv_cache_unified: layer  10: dev = Vulkan0
llama_kv_cache_unified: layer  11: dev = Vulkan0
llama_kv_cache_unified: layer  12: dev = Vulkan0
llama_kv_cache_unified: layer  13: dev = Vulkan0
llama_kv_cache_unified: layer  14: dev = Vulkan0
llama_kv_cache_unified: layer  15: dev = Vulkan0
llama_kv_cache_unified: layer  16: dev = Vulkan0
llama_kv_cache_unified: layer  17: dev = Vulkan0
llama_kv_cache_unified: layer  18: dev = Vulkan0
llama_kv_cache_unified: layer  19: dev = Vulkan0
llama_kv_cache_unified: layer  20: dev = Vulkan0
llama_kv_cache_unified: layer  21: dev = Vulkan0
llama_kv_cache_unified: layer  22: dev = Vulkan0
llama_kv_cache_unified: layer  23: dev = Vulkan0
llama_kv_cache_unified: layer  24: dev = Vulkan0
llama_kv_cache_unified: layer  25: dev = Vulkan0
llama_kv_cache_unified: layer  26: dev = Vulkan0
llama_kv_cache_unified: layer  27: dev = Vulkan0
llama_kv_cache_unified:    Vulkan0 KV buffer size =     7.00 MiB
llama_kv_cache_unified: size =    7.00 MiB (    64 cells,  28 layers,  1 seqs), K (f16):    3.50 MiB, V (f16):    3.50 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 64, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =   64, n_seqs =  1, n_outputs =   64
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =   64, n_seqs =  1, n_outputs =   64
llama_context:    Vulkan0 compute buffer size =    37.34 MiB
llama_context: Vulkan_Host compute buffer size =     0.27 MiB
llama_context: graph nodes  = 1126
llama_context: graph splits = 2
Hello my name is Emily. I'm a student in the 10th grade. I'm interested in studying in the field of mathematics. I want to kn
ow how to study
main: decoded 32 tokens in 0.18 s, speed: 174.70 t/s

llama_perf_sampler_print:    sampling time =       2.62 ms /    32 runs   (    0.08 ms per token, 12195.12 tokens per second)
llama_perf_context_print:        load time =     402.14 ms
llama_perf_context_print: prompt eval time =      10.13 ms /     4 tokens (    2.53 ms per token,   394.91 tokens per second)
llama_perf_context_print:        eval time =     166.08 ms /    31 runs   (    5.36 ms per token,   186.65 tokens per second)
llama_perf_context_print:       total time =     575.19 ms /    35 tokens

	Command being timed: "./build/bin/llama-simple -m models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf"
	User time (seconds): 0.37
	System time (seconds): 0.44
	Percent of CPU this job got: 88%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.93
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1101056
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 225849
	Voluntary context switches: 796
	Involuntary context switches: 15
	Swaps: 0
	File system inputs: 0
	File system outputs: 32
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Run example with GTE:

# GGUP_PATH points to gte-large.Q2_K-00001-of-00003.gguf, for example.
/usr/bin/time -v ./build/bin/llama-embedding --model "$GGUF_PATH" --ngl 999

Related PRs

Memory load in LLM addon repo: https://github.com/tetherto/qvac-lib-infer-llamacpp-llm/pull/195
Base inference JS class to allow shards loading. https://github.com/tetherto/qvac-lib-infer-base/pull/39
Fix crash in LLM addon repo https://github.com/tetherto/qvac-lib-infer-llamacpp-llm/pull/194

Asana task: https://app.asana.com/1/45238840754660/project/1210873391319186/task/1210877463428607

To see the specific tasks where the Asana app for GitHub is being used, see below:

Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)

Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.

This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.

The gguf-split utility now generates a `.txt` listing all tensors. Useful both for manual inspection/debugging and for incremental tensor loading where its not possible to know tensors present in other split files (the information is critical to handle optional tensors).

jesusmb1995 · 2025-07-30T18:51:14Z

I seem to lack permissions to add reviewers. It is on draft until I test it on a bare Addon but the review of the Llama.cpp C++ code can start: @olyasir @olek-tether @gianni-cor @chetasr @yuranich @jpgaribotti

jesusmb1995 · 2025-07-30T20:17:35Z

Updated tests to automatically skip based on the gguf filename (sharded or not) when running all tests at once.

jesusmb1995 · 2025-08-14T15:12:15Z

Un-drafting since I was able to run JS integration test for qwen3 llm Addon without problems. The test now can use any dataloader implementation and will incrementally load the Llama.cpp model. See successful log below.

log_integration.txt

jpgaribotti · 2025-08-14T16:00:37Z

We should not merge to master, it will make maintaining the fork more difficult. For example, we currently have another PR to merge from upstream to bring the fork up to date. We should create a differently named branch for our changes to the fork.

yuranich · 2025-08-14T18:33:43Z

We should not merge to master, it will make maintaining the fork more difficult. For example, we currently have another PR to merge from upstream to bring the fork up to date. We should create a differently named branch for our changes to the fork.

can we do the following:

finish updating from upstream
create new branch, merge this fix there
try to contribute back to upstream
is that something we can do?
I also saw there is multimodal branch, is that something we can consider contributing back? @jpgaribotti

jesusmb1995 · 2025-08-18T07:54:01Z

Fine with me. Please create a tether branch where to merge the changes @yuranich

3. try to contribute back to upstream
   is that something we can do?

I have a task in the Asana the project to do this, but I don't know how easy will it be with the amount of changes. Maybe we can merge some of the commits.

yuranich · 2025-08-19T06:20:27Z

Fine with me. Please create a tether branch where to merge the changes @yuranich
3. try to contribute back to upstream
   is that something we can do?
I have a task in the Asana the project to do this, but I don't know how easy will it be with the amount of changes. Maybe we can merge some of the commits.

temp-load-from-buffer
created @jesusmb1995

src/llama-model.cpp

- Ensures a char trait implementation for uint8 exists, that can be used with std::basic_streambuff. - Adds an implementation of std::basic_streambuff for a single vector. Will be used by llama.cpp and tests when loading from a single memory buffer.

Override the pure virtual interface with a class that can operate on a single memory buffer.

Auxiliary function to convert a list of C strings to a vector of C++ strings.

Add new GGUF reader implementation that can read metadata from a memory buffer.

- Add code to be able to load a gguf file from a variant (memory or disk). - Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).

Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.

Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.

A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.

Handles the logic for incrementally loading files and tensors is model shards.

Refactor backend buffer creation (for model loading) into functions.

- The function now takes size_data instead of the member attribute. - Sanity checks of file pointer handles These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.

Adapt the loader and model load to incrementally load files and upload tensors.

Add functions to Llama.cpp public headers to asynchronously load shards.

Split some common loading functionallity. This will help in the memory loading tests.

Add a submodule with re-usable code for tests.

Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.

Adapt simple example to showcase how to load from memory. Can be configured with environment variables. Qwen3, for example, can be used with the simple example.

Add some automatic tests that load from memory (single buffer or multiple async splits)

jesusmb1995 · 2025-08-21T11:38:48Z

Most CI pipelines pass now. Some target failures seem unrelated.

jesusmb1995 · 2025-08-21T14:23:02Z

Most CI pipelines pass now. Some target failures seem unrelated.

@jpgaribotti @yuranich Can you suggest what to do with remaining failing CI pipelines? Seem to be due to unrelated issues, for example:

Run ARTIFACTS_JSON=$(curl -s -L \
Finding latest macos-latest-Release artifact...
No suitable Dawn artifact found!

Is it okay to proceed with the review as it is? Currently even the sync to upstream is failing on CI #4

* oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>

…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 tetherto#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 tetherto#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 tetherto#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 tetherto#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 tetherto#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 tetherto#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 tetherto#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 tetherto#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 tetherto#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 tetherto#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) tetherto#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) tetherto#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) tetherto#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) tetherto#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) tetherto#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING

jesusmb1995 added 4 commits July 30, 2025 17:13

[common] Pure interface for files

f8942e7

Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)

[common] Compile time debug logs

ca481ed

Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.

[aux] Test full load from disk

cd6f698

This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.

jesusmb1995 marked this pull request as draft July 30, 2025 18:24

jesusmb1995 force-pushed the jmb/memory_load_pr branch 2 times, most recently from 02227e3 to 0718c30 Compare July 30, 2025 20:16

jesusmb1995 force-pushed the jmb/memory_load_pr branch 2 times, most recently from 5df4e25 to 52ed642 Compare July 30, 2025 20:49

jesusmb1995 self-assigned this Aug 14, 2025

jesusmb1995 marked this pull request as ready for review August 14, 2025 15:08

jesusmb1995 requested review from gianni-cor, jpgaribotti, olek-tether, olyasir and yuranich August 14, 2025 15:08

jesusmb1995 requested a review from chetasr August 14, 2025 15:14

jesusmb1995 changed the title ~~Load GGUF File From Buffer~~ QVAC-3697: Load GGUF File From Buffer Aug 18, 2025

olek-tether approved these changes Aug 18, 2025

View reviewed changes

olek-tether self-requested a review August 18, 2025 20:26

jesusmb1995 changed the base branch from master to temp-load-from-buffer August 19, 2025 07:30

jesusmb1995 force-pushed the jmb/memory_load_pr branch from 52ed642 to 85405d9 Compare August 19, 2025 15:13

jesusmb1995 commented Aug 19, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

jesusmb1995 force-pushed the jmb/memory_load_pr branch 2 times, most recently from 4277f06 to 4d263be Compare August 21, 2025 10:44

jesusmb1995 added 18 commits August 21, 2025 13:24

[mbuffer] Llama file buffer implementation

b6f825d

Override the pure virtual interface with a class that can operate on a single memory buffer.

[refactor] C splits into C++

86da48c

Auxiliary function to convert a list of C strings to a vector of C++ strings.

[common] GGUF reader from memory

cba0254

Add new GGUF reader implementation that can read metadata from a memory buffer.

[refactor][mbuffer] File load from variant

610d73e

- Add code to be able to load a gguf file from a variant (memory or disk). - Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).

[refactor] Process file method

762c968

Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.

[mbuffer] Expose single-buffer loading to Llama interface

be62aaa

Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.

[fbuffers] Future file buffer implementation

3a0855d

A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.

[fbuffers] Incremental loading of future files

85c4d3b

Handles the logic for incrementally loading files and tensors is model shards.

[refactor] Create backend buffers

bd60c89

Refactor backend buffer creation (for model loading) into functions.

[refactor] Load all data

0561525

- The function now takes size_data instead of the member attribute. - Sanity checks of file pointer handles These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.

[fbuffers] Incremental model load

77cef5b

Adapt the loader and model load to incrementally load files and upload tensors.

[fbuffers] Expose async interface

ff882fe

Add functions to Llama.cpp public headers to asynchronously load shards.

[refactor] Increase common loading granularity

dab6554

Split some common loading functionallity. This will help in the memory loading tests.

[aux] Common test

f5902e8

Add a submodule with re-usable code for tests.

[aux] Memory example (embedding)

425e192

Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.

[aux] Memory example (simple)

f0e7125

Adapt simple example to showcase how to load from memory. Can be configured with environment variables. Qwen3, for example, can be used with the simple example.

[aux] Auto. memory loading tests

cd1b485

Add some automatic tests that load from memory (single buffer or multiple async splits)

jesusmb1995 force-pushed the jmb/memory_load_pr branch from 4d263be to cd1b485 Compare August 21, 2025 11:25

olek-tether approved these changes Aug 22, 2025

View reviewed changes

vigan-abd approved these changes Aug 23, 2025

View reviewed changes

yuranich approved these changes Aug 25, 2025

View reviewed changes

yuranich merged commit bfa84c3 into tetherto:temp-load-from-buffer Aug 25, 2025
39 of 46 checks passed

jesusmb1995 mentioned this pull request Aug 28, 2025

Rebase temp-load-from-buffer and merge into master #7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QVAC-3697: Load GGUF File From Buffer #1

QVAC-3697: Load GGUF File From Buffer #1

Uh oh!

jesusmb1995 commented Jul 30, 2025 •

edited

Loading

Uh oh!

jesusmb1995 commented Jul 30, 2025

Uh oh!

jesusmb1995 commented Jul 30, 2025

Uh oh!

jesusmb1995 commented Aug 14, 2025 •

edited

Loading

Uh oh!

jpgaribotti commented Aug 14, 2025

Uh oh!

yuranich commented Aug 14, 2025

Uh oh!

jesusmb1995 commented Aug 18, 2025 •

edited

Loading

Uh oh!

yuranich commented Aug 19, 2025

Uh oh!

Uh oh!

jesusmb1995 commented Aug 21, 2025

Uh oh!

jesusmb1995 commented Aug 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

QVAC-3697: Load GGUF File From Buffer #1

QVAC-3697: Load GGUF File From Buffer #1

Uh oh!

Conversation

jesusmb1995 commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run the code?

Build and prepare model

Automated tests

Examples

Related PRs

Uh oh!

jesusmb1995 commented Jul 30, 2025

Uh oh!

jesusmb1995 commented Jul 30, 2025

Uh oh!

jesusmb1995 commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpgaribotti commented Aug 14, 2025

Uh oh!

yuranich commented Aug 14, 2025

Uh oh!

jesusmb1995 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuranich commented Aug 19, 2025

Uh oh!

Uh oh!

jesusmb1995 commented Aug 21, 2025

Uh oh!

jesusmb1995 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jesusmb1995 commented Jul 30, 2025 •

edited

Loading

jesusmb1995 commented Aug 14, 2025 •

edited

Loading

jesusmb1995 commented Aug 18, 2025 •

edited

Loading

jesusmb1995 commented Aug 21, 2025 •

edited

Loading