Skip to content

Conversation

jesusmb1995
Copy link

@jesusmb1995 jesusmb1995 commented Jul 30, 2025

This pull request makes changes in Llama.cpp in order to be able to load models directly from memory. It is intended to be reviewable by commit. Individual commits contain a long text description below the header.

Tested that works properly from a bare Addon (LLM repo). See #1 (comment)

In particular, this PR exposes:

  • llama-cpp.h:llama_model_load_from_buffer(vector<uint8_t>&& data, ...) to load from a single buffer containing a .gguf file contents.
  • llama.h:llama_model_load_from_split_futures(char** paths, ...) and llama-cpp.h:llama_model_load_fulfill_split_future(char* path, ..., unique_ptr<basic_streambuf<uint8_t>>&& streambuf) which allow to asynchronously/incrementally load a model and upload its tensors to the backend storage while host memory is being released.

How to run the code?

Build and prepare model

Build (e.g. in release mode) LLama.cpp including the examples, tests and tools:

cmake -B build -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_EXAMPLES=ON -DGGML_VULKAN=ON && cmake --build build

Generate a sharded model and its *.tensor.txt summary file:

./build/bin/llama-gguf-split --split --split-max-size 300M models/qwen3/Qwen3-0.6B-Q8_0.gguf Qwen3-0.6B-Q8_0 &&
 mv Qwen*.* models/qwen3

Automated tests

Run automated tests for a single gguf file:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0.gguf
ctest -R ^test-model-load-disk$ --verbose
ctest -R ^test-model-load-memory$ --verbose

Run automated tests for sharded model:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf
ctest -R ^test-model-load-disk$ --verbose
ctest -R ^test-model-load-memory-split$ --verbose

Or simply run all tests:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0.gguf
ctest

Should output:

...
30/41 Test #30: test-backend-ops ..................   Passed  104.24 sec                                                     
      Start 31: test-model-load-cancel                        
31/41 Test #31: test-model-load-cancel ............   Passed    0.34 sec                                                     
      Start 32: test-model-load-disk                          
32/41 Test #32: test-model-load-disk ..............   Passed    0.43 sec                                                     
      Start 33: test-model-load-memory                        
33/41 Test #33: test-model-load-memory ............   Passed    0.00 sec                                                     
      Start 34: test-model-load-memory-split                  
34/41 Test #34: test-model-load-memory-split ......   Passed    0.67 sec 
...
41/41 Test #41: test-eval-callback ................   Passed    0.84 sec

100% tests passed, 0 tests failed out of 41

Label Time Summary:
curl             =   0.84 sec*proc (1 test)
eval-callback    =   0.84 sec*proc (1 test)
main             = 136.15 sec*proc (35 tests)
model            =   1.79 sec*proc (5 tests)

Examples

Demo video: https://drive.google.com/file/d/1mjqecwJ1LFYUNofr4wIdPFK9IkUxbHZh/view?usp=sharing

Set up the environment:

# Do not export any variable to load from disk
# export LLAMA_EXAMPLE_MEMORY_BUFFER=1
export LLAMA_EXAMPLE_MEMORY_BUFFER_SPLIT=1

# Alternatively pass a single .gguf file and set _MEMORY_BUFFER=1
export GGUF_PATH="models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf"

Run example with Qwen3:

/usr/bin/time -v ./build/bin/llama-simple -m "$GGUF_PATH"

Outputs:

...
print_backend_buffers_info: offloading 28 repeating layers to GPU
print_backend_buffers_info: offloading output layer to GPU
print_backend_buffers_info: offloaded 29/29 layers to GPU
print_backend_buffers_info:      Vulkan0 model buffer size =   199.11 MiB
print_backend_buffers_info:  Vulkan_Host model buffer size =   157.65 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    44.65 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    46.78 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.84 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    45.71 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    45.71 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.83 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.84 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    46.78 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    31.89 MiB
llama_context: constructing llama_context
llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 35
llama_context: n_ctx_per_seq = 35
llama_context: n_batch       = 64
llama_context: n_ubatch      = 64
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (35) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
create_memory: n_ctx = 64 (padded)
llama_kv_cache_unified: layer   0: dev = Vulkan0
llama_kv_cache_unified: layer   1: dev = Vulkan0
llama_kv_cache_unified: layer   2: dev = Vulkan0
llama_kv_cache_unified: layer   3: dev = Vulkan0
llama_kv_cache_unified: layer   4: dev = Vulkan0
llama_kv_cache_unified: layer   5: dev = Vulkan0
llama_kv_cache_unified: layer   6: dev = Vulkan0
llama_kv_cache_unified: layer   7: dev = Vulkan0
llama_kv_cache_unified: layer   8: dev = Vulkan0
llama_kv_cache_unified: layer   9: dev = Vulkan0
llama_kv_cache_unified: layer  10: dev = Vulkan0
llama_kv_cache_unified: layer  11: dev = Vulkan0
llama_kv_cache_unified: layer  12: dev = Vulkan0
llama_kv_cache_unified: layer  13: dev = Vulkan0
llama_kv_cache_unified: layer  14: dev = Vulkan0
llama_kv_cache_unified: layer  15: dev = Vulkan0
llama_kv_cache_unified: layer  16: dev = Vulkan0
llama_kv_cache_unified: layer  17: dev = Vulkan0
llama_kv_cache_unified: layer  18: dev = Vulkan0
llama_kv_cache_unified: layer  19: dev = Vulkan0
llama_kv_cache_unified: layer  20: dev = Vulkan0
llama_kv_cache_unified: layer  21: dev = Vulkan0
llama_kv_cache_unified: layer  22: dev = Vulkan0
llama_kv_cache_unified: layer  23: dev = Vulkan0
llama_kv_cache_unified: layer  24: dev = Vulkan0
llama_kv_cache_unified: layer  25: dev = Vulkan0
llama_kv_cache_unified: layer  26: dev = Vulkan0
llama_kv_cache_unified: layer  27: dev = Vulkan0
llama_kv_cache_unified:    Vulkan0 KV buffer size =     7.00 MiB
llama_kv_cache_unified: size =    7.00 MiB (    64 cells,  28 layers,  1 seqs), K (f16):    3.50 MiB, V (f16):    3.50 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 64, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =   64, n_seqs =  1, n_outputs =   64
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =   64, n_seqs =  1, n_outputs =   64
llama_context:    Vulkan0 compute buffer size =    37.34 MiB
llama_context: Vulkan_Host compute buffer size =     0.27 MiB
llama_context: graph nodes  = 1126
llama_context: graph splits = 2
Hello my name is Emily. I'm a student in the 10th grade. I'm interested in studying in the field of mathematics. I want to kn
ow how to study
main: decoded 32 tokens in 0.18 s, speed: 174.70 t/s

llama_perf_sampler_print:    sampling time =       2.62 ms /    32 runs   (    0.08 ms per token, 12195.12 tokens per second)
llama_perf_context_print:        load time =     402.14 ms
llama_perf_context_print: prompt eval time =      10.13 ms /     4 tokens (    2.53 ms per token,   394.91 tokens per second)
llama_perf_context_print:        eval time =     166.08 ms /    31 runs   (    5.36 ms per token,   186.65 tokens per second)
llama_perf_context_print:       total time =     575.19 ms /    35 tokens

	Command being timed: "./build/bin/llama-simple -m models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf"
	User time (seconds): 0.37
	System time (seconds): 0.44
	Percent of CPU this job got: 88%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.93
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1101056
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 225849
	Voluntary context switches: 796
	Involuntary context switches: 15
	Swaps: 0
	File system inputs: 0
	File system outputs: 32
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Run example with GTE:

# GGUP_PATH points to gte-large.Q2_K-00001-of-00003.gguf, for example.
/usr/bin/time -v ./build/bin/llama-embedding --model "$GGUF_PATH" --ngl 999

Related PRs


Asana task: https://app.asana.com/1/45238840754660/project/1210873391319186/task/1210877463428607


Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)
Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.
This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.
The gguf-split utility now generates a `.txt` listing all tensors. Useful both for manual inspection/debugging and for incremental tensor loading where its not possible to know tensors present in other split files (the information is critical to handle optional tensors).
@jesusmb1995 jesusmb1995 marked this pull request as draft July 30, 2025 18:24
@jesusmb1995
Copy link
Author

I seem to lack permissions to add reviewers. It is on draft until I test it on a bare Addon but the review of the Llama.cpp C++ code can start: @olyasir @olek-tether @gianni-cor @chetasr @yuranich @jpgaribotti

@jesusmb1995 jesusmb1995 force-pushed the jmb/memory_load_pr branch 2 times, most recently from 02227e3 to 0718c30 Compare July 30, 2025 20:16
@jesusmb1995
Copy link
Author

Updated tests to automatically skip based on the gguf filename (sharded or not) when running all tests at once.

@jesusmb1995 jesusmb1995 force-pushed the jmb/memory_load_pr branch 2 times, most recently from 5df4e25 to 52ed642 Compare July 30, 2025 20:49
@jesusmb1995 jesusmb1995 self-assigned this Aug 14, 2025
@jesusmb1995 jesusmb1995 marked this pull request as ready for review August 14, 2025 15:08
@jesusmb1995
Copy link
Author

jesusmb1995 commented Aug 14, 2025

Un-drafting since I was able to run JS integration test for qwen3 llm Addon without problems. The test now can use any dataloader implementation and will incrementally load the Llama.cpp model. See successful log below.

log_integration.txt

@jesusmb1995 jesusmb1995 requested a review from chetasr August 14, 2025 15:14
@jpgaribotti
Copy link

We should not merge to master, it will make maintaining the fork more difficult. For example, we currently have another PR to merge from upstream to bring the fork up to date. We should create a differently named branch for our changes to the fork.

@yuranich
Copy link

We should not merge to master, it will make maintaining the fork more difficult. For example, we currently have another PR to merge from upstream to bring the fork up to date. We should create a differently named branch for our changes to the fork.

can we do the following:

  1. finish updating from upstream
  2. create new branch, merge this fix there
  3. try to contribute back to upstream
    is that something we can do?
    I also saw there is multimodal branch, is that something we can consider contributing back? @jpgaribotti

@jesusmb1995
Copy link
Author

jesusmb1995 commented Aug 18, 2025

Fine with me. Please create a tether branch where to merge the changes @yuranich

3. try to contribute back to upstream
   is that something we can do?

I have a task in the Asana the project to do this, but I don't know how easy will it be with the amount of changes. Maybe we can merge some of the commits.

@jesusmb1995 jesusmb1995 changed the title Load GGUF File From Buffer QVAC-3697: Load GGUF File From Buffer Aug 18, 2025
@olek-tether olek-tether self-requested a review August 18, 2025 20:26
@yuranich
Copy link

Fine with me. Please create a tether branch where to merge the changes @yuranich

3. try to contribute back to upstream
   is that something we can do?

I have a task in the Asana the project to do this, but I don't know how easy will it be with the amount of changes. Maybe we can merge some of the commits.

temp-load-from-buffer
created @jesusmb1995

@jesusmb1995 jesusmb1995 changed the base branch from master to temp-load-from-buffer August 19, 2025 07:30
@jesusmb1995 jesusmb1995 force-pushed the jmb/memory_load_pr branch 2 times, most recently from 4277f06 to 4d263be Compare August 21, 2025 10:44
- Ensures a char trait implementation for uint8 exists, that can be used with std::basic_streambuff.
- Adds an implementation of std::basic_streambuff for a single vector. Will be used by llama.cpp and tests when loading from a single memory buffer.
Override the pure virtual interface with a class that can operate on a single memory buffer.
Auxiliary function to convert a list of C strings to a vector of C++ strings.
Add new GGUF reader implementation that can read metadata from a memory buffer.
- Add code to be able to load a gguf file from a variant (memory or disk).
- Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).
Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.
Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.
A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.
Handles the logic for incrementally loading files and tensors is model shards.
Refactor backend buffer creation (for model loading) into functions.
- The function now takes size_data instead of the member attribute.
- Sanity checks of file pointer handles

These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.
Adapt the loader and model load to incrementally load files and upload tensors.
Add functions to Llama.cpp public headers to asynchronously load shards.
Split some common loading functionallity. This will help in the memory loading tests.
Add a submodule with re-usable code for tests.
Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.
Adapt simple example to showcase how to load from memory. Can be configured with environment variables.

Qwen3, for example, can be used with the simple example.
Add some automatic tests that load from memory (single buffer or multiple async splits)
@jesusmb1995
Copy link
Author

Most CI pipelines pass now. Some target failures seem unrelated.

@jesusmb1995
Copy link
Author

jesusmb1995 commented Aug 21, 2025

Most CI pipelines pass now. Some target failures seem unrelated.

@jpgaribotti @yuranich Can you suggest what to do with remaining failing CI pipelines? Seem to be due to unrelated issues, for example:

Run ARTIFACTS_JSON=$(curl -s -L \
Finding latest macos-latest-Release artifact...
No suitable Dawn artifact found!

Is it okay to proceed with the review as it is? Currently even the sync to upstream is failing on CI #4

@yuranich yuranich merged commit bfa84c3 into tetherto:temp-load-from-buffer Aug 25, 2025
39 of 46 checks passed
jpgaribotti pushed a commit that referenced this pull request Sep 10, 2025
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <slarengh@gmail.com>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
gianni-cor pushed a commit to gianni-cor/qvac-ext-lib-llama.cpp that referenced this pull request Sep 18, 2025
…gml-org#16038)

Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.

    #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565
    tetherto#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802
    tetherto#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    tetherto#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    tetherto#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762
    tetherto#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319
    tetherto#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982
    tetherto#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110
    tetherto#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992
    tetherto#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074
    tetherto#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120)
    ...

==45482==Register values:
 x[0] = 0x00006020004147f8   x[1] = 0x00006080000013c8   x[2] = 0x0000000000000000   x[3] = 0x0000604006289738
 x[4] = 0x0000000000000002   x[5] = 0x0000000000000001   x[6] = 0x04034000004b4000   x[7] = 0x0000000000000001
 x[8] = 0xbebebebebebebebe   x[9] = 0x17d7d7d7d7d7d7d7  x[10] = 0x00000c04000828ff  x[11] = 0x0000000000000001
x[12] = 0x000000002018d383  x[13] = 0x0000000000000000  x[14] = 0xfa0000000000fafa  x[15] = 0x000010700001ffff
x[16] = 0x000000019dc012c0  x[17] = 0x00000001021284f8  x[18] = 0x0000000000000000  x[19] = 0x00000001700acdc0
x[20] = 0x0000000000000002  x[21] = 0x000000002018d384  x[22] = 0x16dd16fd2e731151  x[23] = 0x0000007000020000
x[24] = 0x0000000100c69c08  x[25] = 0x0000000100c69c20  x[26] = 0x00006080000013c7  x[27] = 0x0000000100c69c00
x[28] = 0x00000001700acd60     fp = 0x00000001700aceb0     lr = 0x0000000100abce30     sp = 0x00000001700acd60
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)
Thread T5 created by T0 here:
    #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4)
    tetherto#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910)
    tetherto#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c)
    tetherto#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0)
    tetherto#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758)
    tetherto#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0)
    ...

==45482==ABORTING
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants