split: allow --split-max-size option #6343

ngxson · 2024-03-27T11:58:30Z

I ended up re-write the split_strategy class. How it works now:

In the constructor, it produces all the ctx_out that is used by the output file (one ctx_out per one split). These ctx_out are saved into a std::vector<struct gguf_context *> ctx_outs. This step only produce a split "plan", not the actual file.
split_strategy.print_info() can be used to print out the "plan" and details for each split (n_tensors, size)
Finally, when you're happy with the split "plan", call split_strategy.write() to actually write out to output files

phymbert

Great that you start looking into this. I did some attempts but resigned. Please reuse the test script I prepared:

https://github.com/phymbert/llama.cpp/blob/hp/split/max-size/examples/gguf-split/tests.sh

ngxson · 2024-03-27T14:12:31Z

@phymbert Thanks for the info, yeah I did managed to get it work, but didn't test it very carefully. I'll use the script that you provided.

Meanwhile, would you mind to test the version that I've pushed?

ngxson · 2024-03-27T14:45:07Z

I can confirm that my version works with your test script @phymbert (thanks again for that) - except for the test case 4 and 5 which requires --no-tensor-in-metadata that I don't have in this version, I manually skipped it, but that's nice to have all these test cases from beginning.

phymbert · 2024-03-27T14:52:06Z

Ah yeah, I forgot this params. Just remove it. It's not necessary for now, and it's not possible to have no tensor in gguf AFAIK.

examples/gguf-split/gguf-split.cpp

ngxson · 2024-03-27T14:58:21Z

@phymbert With this PR, it maybe possible to have a split that does not have any tensors. I encountered this case (as a bug) if inside should_split I don't have i_tensor > 0 condition. Though, I don't know if llama_model_loader can handle this case.

ngxson · 2024-03-27T15:04:49Z

examples/gguf-split/gguf-split.cpp

-        fprintf(stderr, "\033[3Ddone\n");
+    void copy_file_to_file(std::ifstream & f_in, std::ofstream & f_out, const size_t in_offset, const size_t len) {
+        // TODO: detect OS and use copy_file_range() here for better performance
+        if (read_buf.size() < len) {


@Artefact2 With this refactoring, it will be trivial to add copy_file_range() as you suggested. The only thing that I'm not sure is how to detect if I can use copy_file_range() or not (i.e. based on OS or something else?). Do you have any idea on that? Thanks.

phymbert · 2024-03-27T15:05:24Z

It triggers a malloc of 0 size in ggml, with a warning.

ngxson · 2024-03-27T15:09:30Z

It triggers a malloc of 0 size in ggml, with a warning.

Oh OK I see, that's because we allocate one device buffer per file, so file with no tensor will have 0 size buffer. A quick hack is to add a dummy tensor with size of 1 ~~byte~~ element, but that's not a good solution so let's consider that later on..

examples/gguf-split/gguf-split.cpp

phymbert · 2024-03-30T06:08:36Z

Was it urgent to merge ? unfortunately you did not add the test file, I think it is important to add it here and in the CI to show the complete process between gguf-split and llama.cpp as it is easy to break it we did here: 764c7af

ngxson · 2024-03-30T08:09:39Z

I merge it because it has been open for review for a 3 days. I agree that the test is needed, but since I initally have no intend to add test on this PR, so I though that we will do in another PR. Would you mind to open a new PR @phymbert ? Thanks.

phymbert · 2024-03-30T08:22:39Z

You always can request for review, especially here I did the gguf-split example module, I feel normal to review changes on the matter.

I think tests are part of the devlopment and must be done in the same PR, by the Author, this is the standard development best practice. I do not plan to test other people code BTW.

The shard feature is widely used now on HF and we need to carefully test it and document it, taking into account some previous issue we did:

Add grok-1 support #6204 (comment)

Is it complicated to add tests.sh ?

ngxson · 2024-03-30T08:44:50Z

Sorry I think there’s some misunderstanding here:

I added you as reviewer for this PR 3 days ago. Since we exchanged about the test file earlier, I though that you had a look on the code and have no further comments, so I just merge it - that was my fault to assume that. Next time I’ll explicitly ask you via comment if needed.
The PR does not have tests because I ran the test locally. I have no idea if the test file you gave me is the final version or not. I also don’t say in the description or the title that there will be a test, so it’s to do for another PR.
No blame here, we’re working hard and making errors (in term of communication) is something unavoidable. I hope that the next time we can be more clear on the scope of each PR.

I’ll open a new PR today when I finish my works in real-life. Will let you know when it’s done.

phymbert · 2024-03-30T09:01:14Z

Thanks for the clarification and your effort of course, I understand, merci. FYI I am uploading grok-1 to ggml-org in Q4_0 and I will open a demo discussion to explain the whole process, I will include --split-max-size you bring here.

phymbert · 2024-03-30T16:11:49Z

It would also be nice to update the README.md.

* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size

phymbert · 2024-04-02T18:29:40Z

@ngxson it looks split.no is not present in the first file. Take a look at the HF model visualizer: https://huggingface.co/ggml-org/models?show_tensors=grok-1%2Fgrok-1-q4_0-00001-of-00009.gguf

ngxson · 2024-04-02T21:40:07Z

I think there's a UI bug on huggingface. I downloaded the file and use gguf-py to inspect it, and I can confirm that split.no is there: https://colab.research.google.com/drive/1nvIVinbUTd8MeFSPz0ZvP-UhTW9MDPmb?usp=sharing

(My internet is quite laggy so some notebook cells ran twice, but it doesn't change the result anw)

Edit: gguf viewer on huggungface is quite weird. I thought that the metadata table is firstly decoded on backend and send to frontend. But turns out, they download the whole model into browser which caused my browser to crash.

phymbert · 2024-04-02T21:48:27Z

I think there's a UI bug on huggingface. I downloaded the file and use gguf-py to inspect it, and I can confirm that split.no

Ok, thanks for your double checks, the code is straightforward indeed:

llama.cpp/examples/gguf-split/gguf-split.cpp

Line 217 in f87f7b8

gguf_set_val_u16(ctx_out, LLM_KV_SPLIT_NO, i_split);

Wait and see

* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size

ggerganov · 2024-04-03T12:47:52Z

But turns out, they download the whole model into browser which caused my browser to crash.

AFAIK the browser just fetches the GGUF headers - not the entire file

julien-c · 2024-04-03T14:46:56Z

I think there's a UI bug on huggingface. I downloaded the file and use gguf-py to inspect it, and I can confirm that split.no is there

can you take a look at this @mishig25?

ngxson · 2024-04-03T15:07:45Z

AFAIK the browser just fetches the GGUF headers - not the entire file

It seem to be the case of Firefox - it only downloads 2MB of the file. But firefox does not have the required API to decode the file: Your browser does not support ArrayBuffer.resize. Find more information here.

I tried Chrome, it can read the file but does not stop downloading:

@julien-c @mishig25 If you need more information for debugging, please let me know

julien-c · 2024-04-03T15:13:02Z

@ngxson could i trouble you to open an issue in https://github.com/huggingface/huggingface.js with your comment? 🙏

we'll check asap

ggerganov · 2024-04-03T15:54:22Z

@phymbert and @ngxson

I remember we discussed somewhere to be able to make a split where the first shard is very small and contains primarily the metadata so that it can be downloaded quickly and then start the download of the other shards without waiting for the first to finish. I can't find an issue for this - it might be a good idea to track this and might be related to: huggingface/huggingface.js#601 (comment)

We can add extra meta data in the first file that describes all tensors in the shards for example

phymbert · 2024-04-03T16:04:19Z

I remember we discussed somewhere to be able to make a split where the first shard is very small and contains primarily the metadata so that it can be downloaded quickly and then start the download of the other shards without waiting for the first to finish.

yes here:

common: llama_load_model_from_url split support #6192 (comment)

~~But I did not create an issue because of~~ ggml_malloc is complaining with WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_malloc! in this case.

EDIT: issue created:

gguf-split add a default option to not include tensors data in first shard #6463

* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size

split by max size

12b2554

phymbert reviewed Mar 27, 2024

View reviewed changes

ngxson added 2 commits March 27, 2024 13:39

clean up arg parse

843bc30

split: ok

583022c

add dry run option

8569ba3

ngxson marked this pull request as ready for review March 27, 2024 14:51

error on 0 tensors

b7cb3bb

ngxson requested review from slaren and phymbert March 27, 2024 14:52

ngxson commented Mar 27, 2024

View reviewed changes

examples/gguf-split/gguf-split.cpp Outdated Show resolved Hide resolved

ngxson commented Mar 27, 2024

View reviewed changes

be positive

83f944c

slaren reviewed Mar 27, 2024

View reviewed changes

examples/gguf-split/gguf-split.cpp Outdated Show resolved Hide resolved

remove next_metadata_size

38d0e1d

slaren approved these changes Mar 29, 2024

View reviewed changes

ngxson merged commit f7fc5f6 into ggerganov:master Mar 29, 2024
56 of 57 checks passed

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

split: allow --split-max-size option (ggerganov#6343)

fd23ce0

* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024

split: allow --split-max-size option (ggerganov#6343)

1a1d3ca

* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size

This was referenced Apr 3, 2024

GGUF Sharded model metadata display might have a memory leak huggingface/huggingface.js#603

Closed

GGUF: missing split.no metadata huggingface/huggingface.js#604

Closed

phymbert mentioned this pull request Apr 3, 2024

gguf-split add a default option to not include tensors data in first shard #6463

Closed

phymbert mentioned this pull request Apr 13, 2024

Fix --split-max-size #6655

Merged

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024

split: allow --split-max-size option (ggerganov#6343)

583ef94

* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split: allow --split-max-size option #6343

split: allow --split-max-size option #6343

ngxson commented Mar 27, 2024 •

edited

Loading

phymbert left a comment

ngxson commented Mar 27, 2024

ngxson commented Mar 27, 2024 •

edited

Loading

phymbert commented Mar 27, 2024

ngxson commented Mar 27, 2024 •

edited

Loading

ngxson Mar 27, 2024

phymbert commented Mar 27, 2024

ngxson commented Mar 27, 2024 •

edited

Loading

phymbert commented Mar 30, 2024

ngxson commented Mar 30, 2024

phymbert commented Mar 30, 2024

ngxson commented Mar 30, 2024

phymbert commented Mar 30, 2024

phymbert commented Mar 30, 2024

phymbert commented Apr 2, 2024 •

edited

Loading

ngxson commented Apr 2, 2024 •

edited

Loading

phymbert commented Apr 2, 2024

ggerganov commented Apr 3, 2024

julien-c commented Apr 3, 2024

ngxson commented Apr 3, 2024

julien-c commented Apr 3, 2024

ggerganov commented Apr 3, 2024

phymbert commented Apr 3, 2024 •

edited

Loading

split: allow --split-max-size option #6343

split: allow --split-max-size option #6343

Conversation

ngxson commented Mar 27, 2024 • edited Loading

phymbert left a comment

Choose a reason for hiding this comment

ngxson commented Mar 27, 2024

ngxson commented Mar 27, 2024 • edited Loading

phymbert commented Mar 27, 2024

ngxson commented Mar 27, 2024 • edited Loading

ngxson Mar 27, 2024

Choose a reason for hiding this comment

phymbert commented Mar 27, 2024

ngxson commented Mar 27, 2024 • edited Loading

phymbert commented Mar 30, 2024

ngxson commented Mar 30, 2024

phymbert commented Mar 30, 2024

ngxson commented Mar 30, 2024

phymbert commented Mar 30, 2024

phymbert commented Mar 30, 2024

phymbert commented Apr 2, 2024 • edited Loading

ngxson commented Apr 2, 2024 • edited Loading

phymbert commented Apr 2, 2024

ggerganov commented Apr 3, 2024

julien-c commented Apr 3, 2024

ngxson commented Apr 3, 2024

julien-c commented Apr 3, 2024

ggerganov commented Apr 3, 2024

phymbert commented Apr 3, 2024 • edited Loading

ngxson commented Mar 27, 2024 •

edited

Loading

ngxson commented Mar 27, 2024 •

edited

Loading

ngxson commented Mar 27, 2024 •

edited

Loading

ngxson commented Mar 27, 2024 •

edited

Loading

phymbert commented Apr 2, 2024 •

edited

Loading

ngxson commented Apr 2, 2024 •

edited

Loading

phymbert commented Apr 3, 2024 •

edited

Loading