-
Notifications
You must be signed in to change notification settings - Fork 363
Copy v_transposed like llama.cpp #68
Copy v_transposed like llama.cpp #68
Conversation
I'm a bit confused about this change. Does it increase quality? Because from what you're reporting, it seems to increase memory use and increase inference time. Probably needs some more scientific testing 😄 At least comparing mean and stdev of 5 executions on |
@setzer22 See the comments and referenced issues in the llama.cpp pull: ggerganov/llama.cpp#439 I basically just made the same change, I can't say I understand it or anything. According to the pull and issue, previously llama.cpp had undeterministic results based on the thread count and batch size. That pull was intended to fix that issue, so that llama.cpp could produce deterministic results. I assume it would be the same for llama-rs. Right now llama-rs doesn't have a way to specify the RNG seed so it's a less visible problem, but that's probably something that would be good to add for reproducible testing. |
Actually, I'm blind, llama-rs does have With this change, the results are exactly identical with the same seed regardless of batch size or thread count. I tested 4-5 times with each and various batch and thread counts. One thing we could potentially do is make this a CLI option, it would mean duplicating that one line of code which probably isn't a big deal. Maybe something like |
Hm, something like |
Sure that's certainly no problem. Or Would you want it as a commandline flag also? Also, should it be automatically enabled if |
Yup, I'll leave the exact name to you. I think it's reasonable to turn it on with the seed setting; I can't think of many circumstances in which you'd want to set a seed without the additional determinism, and I don't think determinism matters if you don't have a seed set. |
Yup, we're just being a bit careful with promising determinism overall, since we haven't done any proper testing. When fixing a seed, things should be reproducible on the same hardware, but we haven't done any testing to ensure this is reproducible across machines (other than some very anecdotal testing on my desktop + laptop, for which I can report results are indeed identical). There were some caveats reported in #27 about floating point math determinism in general which may or may not affect us (since we're on CPU, and CPU manufacturers take floating point determinism more seriously).
I would say, yes. It's best to have this under a command-line flag (disabled by default) if it makes inference speed measurably lower. And I think automatically enabling when --seed is specified is also OK, sounds like good ergonomics 👍 |
I've been doing some tests, but it's hard to measure if inference speed has gotten slower with the code change because different prompts can make inference speed vary by up to 30ms / token on my machine. Both on It's hard to draw conclusions about the timings here because I'm getting different completions for the same seed, and timings are affected by differences in the generated prompt. What I can report positively on is that I'm getting different results with the same seed and different number of threads on |
Would it be too confusing to call it It's probably silly to worry about it, I just hate making it so there's just no way to do something that someone might want to do. (Probably my biggest pet peeve with software is where it just says "No, I can't let you do that, you might hurt yourself!") edit: And yeah, I honestly couldn't tell if it made a noticeable difference. It seemed like it was pretty small if it did exist. |
How about this approach? It adds a I made |
I did some profiling and the difference between the charts seems very small, so the observed differences probably aren't anything more than noise. (It looks like copying the data may account for 1-2% of the total time.) I guess it's not really surprising, but this also shows that the code that executes in llama-rs itself is basically insignificant in performance terms. (Obviously the way it sets up the GGML stuff is very important though.) Old FCNew FCOld FGNew FG |
Interesting. If the cost is that unnoticeable, I'm inclined to merge it as-is. Thoughts, @setzer22 ? |
I should add that this is the first time I've tried profiling Rust code or reading flamegraphs so my opinion about it isn't too authoritative. It doesn't seem like there's much any significant visual difference though. By the way, you might have already noticed but if you click/open those SVGs, they're interactive and you can focus elements or search based on name within it. |
I couldn't notice any performance differences in my tests either, so I'd say we can merge as is. No need to put it behind a flag.
Yup. I don't think there's anything we can do on the rust side (other than, as you say, change the tensor operations) that would have a significant impact on performance. All the heavy lifting is done by ggml. |
44bf179
to
34ba664
Compare
Thanks for updating the branch! Looks like setzer and I agree that the flag's not necessary given the unnoticeable performance hit - can you remove it and I'll merge the PR? |
@setzer22 How about this? I left it as a parameter in the library (defaulting to enabled) but removed the CLI stuff. The rationale is it doesn't really add much complexity and there's at least a chance we'll want to revert it or maybe someone will have a use case where they care. |
@philpax Were you talking about the CLI option, or just making it 100% not configurable at all? I can do that if you really want, but personally I'd give it a while being enabled by default and see if it turns out there's some reason for people to want to disable it. Right now it shouldn't really even break library stuff since the struct has a |
Yeah, I'm fine with your rationale. Seems reasonable enough. |
Same here 👍 What I would do is always enable this by default, and just have a flag to disable it. |
See ggerganov/llama.cpp#439
Closes #67
I'm not necessarily proposing to merge this, just putting it here in case it's useful.
From my very, very, very unscientific testing, it seems like this does very slightly increase memory usage and also increases token generation time a little bit.
These notes are super unscientific, but included just in case it's useful. Also note these tests on were on a machine running other applications like VS Code, web browsers, etc. The tests with the 30B model are close to my machine's memory limit (32GB) so may have caused some swapping as well.
The differences are definitely in the margin of error just because the tests weren't very controlled. (It also did seem like it made more of a difference with 12 threads vs 6. My CPU only has 6 physical cores.)