-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Efficient preloading for mmap() #869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ng mmap(). This brings back consistency so benchmarking token inference does not depend on ssd/disk speed anymore.
Would you be able to re-integrate support for reporting load progress to the API? That functionality was essentially broken by the mmap PR so it would be nice to have it back at least in this mode. |
Can you provide some tests that proof the change actually does improve the situation? We already have these 3 mechanisms in master (after #801 merge), so I am wondering if and how the proposed patch improves things.
|
Please correct me if wrong but all 3 methods introduced/discussed are non blocking but we need a blocking call. My method is universal among all architectures and quite simple and it's the only one that really blocks the loading until finished which is the main problem I wanted to address. It was super late yesterday, the current version iterated through the memory in "memory page size" steps, I was thinking about it and it likely would be beneficial to allocate a couple MB RAM instead of one byte, and read it in larger blocks. I'll give it more tests in terms of speed, and try a variant that reads in larger memory blocks. Update 1 after extensive testing: |
Just to note, if you use mlock, it is a blocking call, but I recognize there are reasons not to use it. |
… available threads on the system Tested on Windows - a small performance during loading is not avoidable but this is the best possible solution On Linux
Tested on Windows - a small performance hit during loading is not avoidable but this is the fastest method I found On Linux - madvise needs a test if it's working. otherwise readahead() needs to be implemented in the TODO region
Alright, I'm burned out on this one by now. That was intented to be a quick addon, nightmare. The goal of this commit: I ran extensive benchmarks of a variety of different methods and combinations, always in combination with realtime graphs on the system memory consumption. The added commits are the possibly best solution possible on Windows, if anyone can make that faster I'd be amazed.
Solution: Benchmarks below Using Prefetch + multithreaded access into one byte per page (this commit) Below are more test benchmarks, notice the timings differ here by up to 900ms, the computer was in a different state and slightly faster in general. The previous release: using PrefetchVirtualMemory + VirtualLock only (token timings are incorrect due to disk IO): using PrefetchVirtualMemory only (token timings are incorrect due to disk IO): 64MB chunk access sequential No Prefetch and single thread page size method: Todo: |
Why are you testing performance on Windows? There will be always overhead due to slow drivers. |
Because I use it on Windows, quite likely most users here use it on Windows. Looks like it needs some work for Linux. During benchmarking/testing I always emptied the memory before running a test, I used another model for that and loaded until swapping starts. Then I restart the new test. |
Another platform-independent solution is to just cast to a volatile pointer and prefault every page by looping over the pages and reading a byte from each. |
* Update llama_chat_format.py properly formal llama2 with first-message prompt embedded * Update llama_chat_format.py
It's only psychological nice to have the program jump to the inference part directly after mmap load.
The entire model is loaded during first inference which means it influences our inference timings with SSD/disk latency.
There is currently no way to compare the first inference due to that.
There was some discussion that the model is "sparse" so mmap would save memory - that's not correct. So we can preload it.
This function iterated through the pages right after mmap(), it reads the absolute minimum required to force the OS to fully cache the file.
It is tested on Windows and works.
Needs a test on Linux
It should be considered to be default on when mmap is used, I'm not sure if there is a downside that outweights the benefits.