Efficient preloading for mmap() #869

cmp-nct · 2023-04-10T03:25:08Z

It's only psychological nice to have the program jump to the inference part directly after mmap load.
The entire model is loaded during first inference which means it influences our inference timings with SSD/disk latency.
There is currently no way to compare the first inference due to that.
There was some discussion that the model is "sparse" so mmap would save memory - that's not correct. So we can preload it.

This function iterated through the pages right after mmap(), it reads the absolute minimum required to force the OS to fully cache the file.
It is tested on Windows and works.
Needs a test on Linux
It should be considered to be default on when mmap is used, I'm not sure if there is a downside that outweights the benefits.

…ng mmap(). This brings back consistency so benchmarking token inference does not depend on ssd/disk speed anymore.

j-f1 · 2023-04-10T03:38:18Z

Would you be able to re-integrate support for reporting load progress to the API? That functionality was essentially broken by the mmap PR so it would be nice to have it back at least in this mode.

prusnak · 2023-04-10T10:00:45Z

Can you provide some tests that proof the change actually does improve the situation?

We already have these 3 mechanisms in master (after #801 merge), so I am wondering if and how the proposed patch improves things.

MAP_POPULATE on Linux

https://github.com/ggerganov/llama.cpp/blob/180b693a47b6b825288ef9f2c39d24b6eea4eea6/llama_util.h#L169-L171

madvise on Unix (Linux+macOS)

https://github.com/ggerganov/llama.cpp/blob/180b693a47b6b825288ef9f2c39d24b6eea4eea6/llama_util.h#L178-L182

PrefetchVirtualMemory on Windows

https://github.com/ggerganov/llama.cpp/blob/180b693a47b6b825288ef9f2c39d24b6eea4eea6/llama_util.h#L212-L219

cmp-nct · 2023-04-10T12:02:07Z

Can you provide some tests that proof the change actually does improve the situation?

madvice is not likely to help with the problem, I don't know how much it affects the kernel but it's not blocking.
I believe MAP_POPULATE is the same, it won't change mmap() into a blocking call, right ?
PrefetchVirtualMemory is non-blocking as well afaik, but no experience on that

Please correct me if wrong but all 3 methods introduced/discussed are non blocking but we need a blocking call.
The only instance when "block" is not wanted is when we do not have enough memory to cache the whole model but for that special case we also need special handling (mlock a part of the model and swap the remainder and I'd still block-read the mlocked region as pre-load).

My method is universal among all architectures and quite simple and it's the only one that really blocks the loading until finished which is the main problem I wanted to address.

It was super late yesterday, the current version iterated through the memory in "memory page size" steps, I was thinking about it and it likely would be beneficial to allocate a couple MB RAM instead of one byte, and read it in larger blocks.
The idea was that by accessing one byte per page we'll get a sequential read through the disk, the OS will populate the pages fully but the program only looks at one byte (less CPU cycles).

I'll give it more tests in terms of speed, and try a variant that reads in larger memory blocks.
The benefit of that would be that we can completely ignore the OS specific calls (page size).

Update 1 after extensive testing:
The current code is not suitable to be used. It comes with a loading performance hit.
PrefetchVirtualMemory is blocking but it's not sufficient.
I managed to cut the loading performance hit in half but it's not yet where it needs to be.

comex · 2023-04-10T19:42:18Z

Just to note, if you use mlock, it is a blocking call, but I recognize there are reasons not to use it.

… available threads on the system Tested on Windows - a small performance during loading is not avoidable but this is the best possible solution On Linux

Tested on Windows - a small performance hit during loading is not avoidable but this is the fastest method I found On Linux - madvise needs a test if it's working. otherwise readahead() needs to be implemented in the TODO region

cmp-nct · 2023-04-10T23:03:34Z

Alright, I'm burned out on this one by now. That was intented to be a quick addon, nightmare.

The goal of this commit:
Inference timings with mmap are disk latency spiked and unreliable because loading was non-blocking and leaked into inference.
Loading needs to be blocking while not giving up on the significant benefits of mmap.

I ran extensive benchmarks of a variety of different methods and combinations, always in combination with realtime graphs on the system memory consumption.
I currently have a loading performance hit of 300ms on the 65B-4bit model, which is more than I hoped for.

The added commits are the possibly best solution possible on Windows, if anyone can make that faster I'd be amazed.
For Linux I added a todo mark and I believe it will require readahead at that point (maybe the madvice will do it too, needs a test and benchmarks).
Main Findings:

PrefetchVirtualMemory -> this alone is not working. It will release the memory and it does not fetch everything causing disk latency chaos during first inference
mlock (on Win) is not beneficial to performance or similar so I commented it out as optional use
Optimizer is quite some trouble and Windows seems to disregard the memory used by threads easier than of the single thread. I originally wanted to run the threads in chunks of 100 pages each to stay synchronized - windows disregarded the memory in those cases. (a mutex and IPC was too much work)
There is a loading performance hit due to the involved processing per page, the multithreaded approach reduced it to max 3-4% (loading time) on my system. Of course no hit on inference which was the main goal. It has stable/reproducible timings now.

Solution:
So what I ended up with is a multithreaded sparse pseudo-sequential read on the pages that follows a PrefetchVirtualMemory (readahead) call.
It comes with a tiny performance impact on load, unavoidable given it needs to access millions of memory pages once.
The inference times are clean now, they were a mess before.

Benchmarks below
original mmap - unmodified (notice the wrong token time):
llama_print_timings: load time = 10066.06 ms
llama_print_timings: sample time = 3.78 ms / 16 runs ( 0.24 ms per run)
llama_print_timings: prompt eval time = 6854.49 ms / 12 tokens ( 571.21 ms per token)
llama_print_timings: eval time = 10004.66 ms / 14 runs ( 714.62 ms per run)
llama_print_timings: total time = 22589.35 ms
00:00:26.3251858 (process start, end + inference)

Using Prefetch + multithreaded access into one byte per page (this commit)
Preloading file took 8442861 us (8442 ms) for 40808468096 bytes
llama_print_timings: load time = 10109.92 ms
llama_print_timings: sample time = 3.90 ms / 16 runs ( 0.24 ms per run)
llama_print_timings: prompt eval time = 3551.38 ms / 12 tokens ( 295.95 ms per token)
llama_print_timings: eval time = 10155.54 ms / 14 runs ( 725.40 ms per run)
llama_print_timings: total time = 22817.76 ms
00:00:26.6295628

Below are more test benchmarks, notice the timings differ here by up to 900ms, the computer was in a different state and slightly faster in general.

The previous release:
using Prefetch + native access single thread
Preloading file took 14843566 us (14843 ms) for 40808468096 bytes
llama_print_timings: load time = 16422.49 ms
llama_print_timings: sample time = 4.01 ms / 16 runs ( 0.25 ms per run)
llama_print_timings: prompt eval time = 3631.95 ms / 12 tokens ( 302.66 ms per token)
llama_print_timings: eval time = 10180.20 ms / 14 runs ( 727.16 ms per run)
llama_print_timings: total time = 29116.35 ms
00:00:32.9779445

using PrefetchVirtualMemory + VirtualLock only (token timings are incorrect due to disk IO):
Preloading file took 5592977 us (5592 ms) for 40808468096 bytes
llama_print_timings: load time = 9489.53 ms
llama_print_timings: sample time = 3.83 ms / 16 runs ( 0.24 ms per run)
llama_print_timings: prompt eval time = 5893.47 ms / 12 tokens ( 491.12 ms per token)
llama_print_timings: eval time = 9953.94 ms / 14 runs ( 711.00 ms per run)
llama_print_timings: total time = 21925.88 ms
00:00:25.6962610

using PrefetchVirtualMemory only (token timings are incorrect due to disk IO):
Preloading file took 5659007 us (5659 ms) for 40808468096 bytes
llama_print_timings: load time = 9005.27 ms
llama_print_timings: sample time = 3.97 ms / 16 runs ( 0.25 ms per run)
llama_print_timings: prompt eval time = 5359.45 ms / 12 tokens ( 446.62 ms per token)
llama_print_timings: eval time = 10033.04 ms / 14 runs ( 716.65 ms per run)
llama_print_timings: total time = 21538.04 ms
00:00:25.7903725

64MB chunk access sequential
Preloading file took 29471504 us (29471 ms) for 40808468096 bytes
llama_print_timings: load time = 30933.04 ms
llama_print_timings: sample time = 3.83 ms / 16 runs ( 0.24 ms per run)
llama_print_timings: prompt eval time = 3526.53 ms / 12 tokens ( 293.88 ms per token)
llama_print_timings: eval time = 10046.89 ms / 14 runs ( 717.64 ms per run)
llama_print_timings: total time = 43497.46 ms

No Prefetch and single thread page size method:
Preloading file took 24562987 us (24562 ms) for 40808468096 bytes
llama_print_timings: load time = 26195.29 ms
llama_print_timings: sample time = 3.87 ms / 16 runs ( 0.24 ms per run)
llama_print_timings: prompt eval time = 3750.56 ms / 12 tokens ( 312.55 ms per token)
llama_print_timings: eval time = 11225.91 ms / 14 runs ( 801.85 ms per run)
llama_print_timings: total time = 40006.16 ms

Todo:
Linux needs it's own benchmarks. The code will work for linux too but it needs a test.
Maybe madvice() will trigger a prefetch, then the code should work 1:1
If not we need to use a readahead() call on the file descriptor
If both fails, n_threads needs to be benchmarked. if the multi-threaded approach works without prefetch or n_threads needs to be fixed to 1

ivanstepanovftw · 2023-04-11T01:24:04Z

Why are you testing performance on Windows? There will be always overhead due to slow drivers.

cmp-nct · 2023-04-11T02:27:17Z

Why are you testing performance on Windows? There will be always overhead due to slow drivers.

Because I use it on Windows, quite likely most users here use it on Windows.
Regarding the drivers, that sounds like a bold claim which would need an equally bold proof.
My system runs at 12 seq gig/sec from disk and 105gig/sec from RAM sustained. Can't complain about any drivers. Generally Windows has the reputation for providing the most performant drivers, so I'd call that it's biggest bonus among all the other disadvantages.

Looks like it needs some work for Linux.
Also on Windows I, rarely, have the case that the preloaded memory is discarded.
I also once had the case that the performance during preloading appeared to be single threaded, could not reproduce that.

During benchmarking/testing I always emptied the memory before running a test, I used another model for that and loaded until swapping starts. Then I restart the new test.
So I am not sure what causes this random behavior, maybe it's not possible to reliably preload with mmap().
I have the current code in use, it works but it's not where I want it to be (100% reliability to preload all during load())

apaz-cli · 2023-04-12T15:29:50Z

Another platform-independent solution is to just cast to a volatile pointer and prefault every page by looping over the pages and reading a byte from each.

* Update llama_chat_format.py properly formal llama2 with first-message prompt embedded * Update llama_chat_format.py

John added 2 commits April 10, 2023 05:14

Adds _PRELOAD_MMAP_FILE flag to fully preload the model even when usi…

5010b6a

…ng mmap(). This brings back consistency so benchmarking token inference does not depend on ssd/disk speed anymore.

linux will need unistd.h

56b6fa5

John added 2 commits April 11, 2023 00:28

Updated preloader to use multithreading - currently set to 50% of the…

f4c1c6b

… available threads on the system Tested on Windows - a small performance during loading is not avoidable but this is the best possible solution On Linux

Updated preloader to use multithreading

6e65b8a

Tested on Windows - a small performance hit during loading is not avoidable but this is the fastest method I found On Linux - madvise needs a test if it's working. otherwise readahead() needs to be implemented in the TODO region

cmp-nct mentioned this pull request Apr 10, 2023

Long time until generation starts when using big context #865

Closed

cmp-nct changed the title ~~Efficient preloading for mmap() (disabled by default)~~ Efficient preloading for mmap() Apr 10, 2023

CoderRC mentioned this pull request Apr 19, 2023

Fix model loading time through prefetching the file on another thread #734

Closed

cmp-nct mentioned this pull request Jun 12, 2023

llama : do a warm-up eval at start for better timings #1824

Merged

cmp-nct closed this by deleting the head repository Jun 13, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Update llama_chat_format.py (ggml-org#869)

3580e2c

* Update llama_chat_format.py properly formal llama2 with first-message prompt embedded * Update llama_chat_format.py

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient preloading for mmap() #869

Efficient preloading for mmap() #869

cmp-nct commented Apr 10, 2023 •

edited

Loading

j-f1 commented Apr 10, 2023 •

edited

Loading

prusnak commented Apr 10, 2023 •

edited

Loading

cmp-nct commented Apr 10, 2023 •

edited

Loading

comex commented Apr 10, 2023

cmp-nct commented Apr 10, 2023 •

edited

Loading

ivanstepanovftw commented Apr 11, 2023

cmp-nct commented Apr 11, 2023 •

edited

Loading

apaz-cli commented Apr 12, 2023

Efficient preloading for mmap() #869

Efficient preloading for mmap() #869

Conversation

cmp-nct commented Apr 10, 2023 • edited Loading

j-f1 commented Apr 10, 2023 • edited Loading

prusnak commented Apr 10, 2023 • edited Loading

cmp-nct commented Apr 10, 2023 • edited Loading

comex commented Apr 10, 2023

cmp-nct commented Apr 10, 2023 • edited Loading

ivanstepanovftw commented Apr 11, 2023

cmp-nct commented Apr 11, 2023 • edited Loading

apaz-cli commented Apr 12, 2023

cmp-nct commented Apr 10, 2023 •

edited

Loading

j-f1 commented Apr 10, 2023 •

edited

Loading

prusnak commented Apr 10, 2023 •

edited

Loading

cmp-nct commented Apr 10, 2023 •

edited

Loading

cmp-nct commented Apr 10, 2023 •

edited

Loading

cmp-nct commented Apr 11, 2023 •

edited

Loading