Skip to content

Efficient preloading for mmap() #869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Efficient preloading for mmap() #869

wants to merge 4 commits into from

Conversation

cmp-nct
Copy link
Contributor

@cmp-nct cmp-nct commented Apr 10, 2023

It's only psychological nice to have the program jump to the inference part directly after mmap load.
The entire model is loaded during first inference which means it influences our inference timings with SSD/disk latency.
There is currently no way to compare the first inference due to that.
There was some discussion that the model is "sparse" so mmap would save memory - that's not correct. So we can preload it.

This function iterated through the pages right after mmap(), it reads the absolute minimum required to force the OS to fully cache the file.
It is tested on Windows and works.
Needs a test on Linux
It should be considered to be default on when mmap is used, I'm not sure if there is a downside that outweights the benefits.

John added 2 commits April 10, 2023 05:14
…ng mmap(). This brings back consistency so benchmarking token inference does not depend on ssd/disk speed anymore.
@j-f1
Copy link
Collaborator

j-f1 commented Apr 10, 2023

Would you be able to re-integrate support for reporting load progress to the API? That functionality was essentially broken by the mmap PR so it would be nice to have it back at least in this mode.

@prusnak
Copy link
Collaborator

prusnak commented Apr 10, 2023

Can you provide some tests that proof the change actually does improve the situation?

We already have these 3 mechanisms in master (after #801 merge), so I am wondering if and how the proposed patch improves things.

  • MAP_POPULATE on Linux

https://github.com/ggerganov/llama.cpp/blob/180b693a47b6b825288ef9f2c39d24b6eea4eea6/llama_util.h#L169-L171

  • madvise on Unix (Linux+macOS)

https://github.com/ggerganov/llama.cpp/blob/180b693a47b6b825288ef9f2c39d24b6eea4eea6/llama_util.h#L178-L182

  • PrefetchVirtualMemory on Windows

https://github.com/ggerganov/llama.cpp/blob/180b693a47b6b825288ef9f2c39d24b6eea4eea6/llama_util.h#L212-L219

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Apr 10, 2023

Can you provide some tests that proof the change actually does improve the situation?

  1. madvice is not likely to help with the problem, I don't know how much it affects the kernel but it's not blocking.
  2. I believe MAP_POPULATE is the same, it won't change mmap() into a blocking call, right ?
  3. PrefetchVirtualMemory is non-blocking as well afaik, but no experience on that

Please correct me if wrong but all 3 methods introduced/discussed are non blocking but we need a blocking call.
The only instance when "block" is not wanted is when we do not have enough memory to cache the whole model but for that special case we also need special handling (mlock a part of the model and swap the remainder and I'd still block-read the mlocked region as pre-load).

My method is universal among all architectures and quite simple and it's the only one that really blocks the loading until finished which is the main problem I wanted to address.

It was super late yesterday, the current version iterated through the memory in "memory page size" steps, I was thinking about it and it likely would be beneficial to allocate a couple MB RAM instead of one byte, and read it in larger blocks.
The idea was that by accessing one byte per page we'll get a sequential read through the disk, the OS will populate the pages fully but the program only looks at one byte (less CPU cycles).

I'll give it more tests in terms of speed, and try a variant that reads in larger memory blocks.
The benefit of that would be that we can completely ignore the OS specific calls (page size).

Update 1 after extensive testing:
The current code is not suitable to be used. It comes with a loading performance hit.
PrefetchVirtualMemory is blocking but it's not sufficient.
I managed to cut the loading performance hit in half but it's not yet where it needs to be.

@comex
Copy link
Contributor

comex commented Apr 10, 2023

Just to note, if you use mlock, it is a blocking call, but I recognize there are reasons not to use it.

John added 2 commits April 11, 2023 00:28
… available threads on the system

Tested on Windows - a small performance during loading is not avoidable but this is the best possible solution
On Linux
Tested on Windows - a small performance hit during loading is not avoidable but this is the fastest method I found
On Linux - madvise needs a test if it's working. otherwise readahead() needs to be implemented in the TODO region
@cmp-nct
Copy link
Contributor Author

cmp-nct commented Apr 10, 2023

Alright, I'm burned out on this one by now. That was intented to be a quick addon, nightmare.

The goal of this commit:
Inference timings with mmap are disk latency spiked and unreliable because loading was non-blocking and leaked into inference.
Loading needs to be blocking while not giving up on the significant benefits of mmap.

I ran extensive benchmarks of a variety of different methods and combinations, always in combination with realtime graphs on the system memory consumption.
I currently have a loading performance hit of 300ms on the 65B-4bit model, which is more than I hoped for.

The added commits are the possibly best solution possible on Windows, if anyone can make that faster I'd be amazed.
For Linux I added a todo mark and I believe it will require readahead at that point (maybe the madvice will do it too, needs a test and benchmarks).
Main Findings:

  1. PrefetchVirtualMemory -> this alone is not working. It will release the memory and it does not fetch everything causing disk latency chaos during first inference
  2. mlock (on Win) is not beneficial to performance or similar so I commented it out as optional use
  3. Optimizer is quite some trouble and Windows seems to disregard the memory used by threads easier than of the single thread. I originally wanted to run the threads in chunks of 100 pages each to stay synchronized - windows disregarded the memory in those cases. (a mutex and IPC was too much work)
  4. There is a loading performance hit due to the involved processing per page, the multithreaded approach reduced it to max 3-4% (loading time) on my system. Of course no hit on inference which was the main goal. It has stable/reproducible timings now.

Solution:
So what I ended up with is a multithreaded sparse pseudo-sequential read on the pages that follows a PrefetchVirtualMemory (readahead) call.
It comes with a tiny performance impact on load, unavoidable given it needs to access millions of memory pages once.
The inference times are clean now, they were a mess before.

Benchmarks below
original mmap - unmodified (notice the wrong token time):
llama_print_timings: load time = 10066.06 ms
llama_print_timings: sample time = 3.78 ms / 16 runs ( 0.24 ms per run)
llama_print_timings: prompt eval time = 6854.49 ms / 12 tokens ( 571.21 ms per token)
llama_print_timings: eval time = 10004.66 ms / 14 runs ( 714.62 ms per run)
llama_print_timings: total time = 22589.35 ms
00:00:26.3251858 (process start, end + inference)

Using Prefetch + multithreaded access into one byte per page (this commit)
Preloading file took 8442861 us (8442 ms) for 40808468096 bytes
llama_print_timings: load time = 10109.92 ms
llama_print_timings: sample time = 3.90 ms / 16 runs ( 0.24 ms per run)
llama_print_timings: prompt eval time = 3551.38 ms / 12 tokens ( 295.95 ms per token)
llama_print_timings: eval time = 10155.54 ms / 14 runs ( 725.40 ms per run)
llama_print_timings: total time = 22817.76 ms
00:00:26.6295628

Below are more test benchmarks, notice the timings differ here by up to 900ms, the computer was in a different state and slightly faster in general.

The previous release:
using Prefetch + native access single thread
Preloading file took 14843566 us (14843 ms) for 40808468096 bytes
llama_print_timings: load time = 16422.49 ms
llama_print_timings: sample time = 4.01 ms / 16 runs ( 0.25 ms per run)
llama_print_timings: prompt eval time = 3631.95 ms / 12 tokens ( 302.66 ms per token)
llama_print_timings: eval time = 10180.20 ms / 14 runs ( 727.16 ms per run)
llama_print_timings: total time = 29116.35 ms
00:00:32.9779445

using PrefetchVirtualMemory + VirtualLock only (token timings are incorrect due to disk IO):
Preloading file took 5592977 us (5592 ms) for 40808468096 bytes
llama_print_timings: load time = 9489.53 ms
llama_print_timings: sample time = 3.83 ms / 16 runs ( 0.24 ms per run)
llama_print_timings: prompt eval time = 5893.47 ms / 12 tokens ( 491.12 ms per token)
llama_print_timings: eval time = 9953.94 ms / 14 runs ( 711.00 ms per run)
llama_print_timings: total time = 21925.88 ms
00:00:25.6962610

using PrefetchVirtualMemory only (token timings are incorrect due to disk IO):
Preloading file took 5659007 us (5659 ms) for 40808468096 bytes
llama_print_timings: load time = 9005.27 ms
llama_print_timings: sample time = 3.97 ms / 16 runs ( 0.25 ms per run)
llama_print_timings: prompt eval time = 5359.45 ms / 12 tokens ( 446.62 ms per token)
llama_print_timings: eval time = 10033.04 ms / 14 runs ( 716.65 ms per run)
llama_print_timings: total time = 21538.04 ms
00:00:25.7903725

64MB chunk access sequential
Preloading file took 29471504 us (29471 ms) for 40808468096 bytes
llama_print_timings: load time = 30933.04 ms
llama_print_timings: sample time = 3.83 ms / 16 runs ( 0.24 ms per run)
llama_print_timings: prompt eval time = 3526.53 ms / 12 tokens ( 293.88 ms per token)
llama_print_timings: eval time = 10046.89 ms / 14 runs ( 717.64 ms per run)
llama_print_timings: total time = 43497.46 ms

No Prefetch and single thread page size method:
Preloading file took 24562987 us (24562 ms) for 40808468096 bytes
llama_print_timings: load time = 26195.29 ms
llama_print_timings: sample time = 3.87 ms / 16 runs ( 0.24 ms per run)
llama_print_timings: prompt eval time = 3750.56 ms / 12 tokens ( 312.55 ms per token)
llama_print_timings: eval time = 11225.91 ms / 14 runs ( 801.85 ms per run)
llama_print_timings: total time = 40006.16 ms

Todo:
Linux needs it's own benchmarks. The code will work for linux too but it needs a test.
Maybe madvice() will trigger a prefetch, then the code should work 1:1
If not we need to use a readahead() call on the file descriptor
If both fails, n_threads needs to be benchmarked. if the multi-threaded approach works without prefetch or n_threads needs to be fixed to 1

@cmp-nct cmp-nct changed the title Efficient preloading for mmap() (disabled by default) Efficient preloading for mmap() Apr 10, 2023
@ivanstepanovftw
Copy link
Collaborator

Why are you testing performance on Windows? There will be always overhead due to slow drivers.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Apr 11, 2023

Why are you testing performance on Windows? There will be always overhead due to slow drivers.

Because I use it on Windows, quite likely most users here use it on Windows.
Regarding the drivers, that sounds like a bold claim which would need an equally bold proof.
My system runs at 12 seq gig/sec from disk and 105gig/sec from RAM sustained. Can't complain about any drivers. Generally Windows has the reputation for providing the most performant drivers, so I'd call that it's biggest bonus among all the other disadvantages.

Looks like it needs some work for Linux.
Also on Windows I, rarely, have the case that the preloaded memory is discarded.
I also once had the case that the performance during preloading appeared to be single threaded, could not reproduce that.

During benchmarking/testing I always emptied the memory before running a test, I used another model for that and loaded until swapping starts. Then I restart the new test.
So I am not sure what causes this random behavior, maybe it's not possible to reliably preload with mmap().
I have the current code in use, it works but it's not where I want it to be (100% reliability to preload all during load())

@apaz-cli
Copy link
Contributor

Another platform-independent solution is to just cast to a volatile pointer and prefault every page by looping over the pages and reading a byte from each.

@cmp-nct cmp-nct closed this by deleting the head repository Jun 13, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
* Update llama_chat_format.py

properly formal llama2 with first-message prompt embedded

* Update llama_chat_format.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants