-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Survey] Supported Hardwares and Speed #15
Comments
@junrushao how can we find tokens/sec? I'd say 'quite fast' fastest LLM I've run on this 2020 MacBook Pro M1 8G. 10x faster than your WebGPU demo running with less overall memory usage. All reports out is the text? |
We just added a new updates #14 which should ship to conda by now, you can type |
Killer, I'm at encode: 31.9 tok/s, decode: 11.4 tok/s for 2020 MacBook Pro M1 8G with the default vicuna 6b. For reference my decode on the WebGPU demo is like, 0.5/sec. |
OOM on gtx 1650. Load the model fine, but OOM when generate the first message |
@nRuaif 4GB memory wouldn't be enough. A 6GB one should work |
On iPhone 13, crashes after a few seconds of |
@y-lee That's correct. The model we are using so far requires 6GB RAM to run smoothly |
On the iPad Pro 11” with M1 I am getting decode of 10.6 tok/s (I have seen slightly higher and lower). It is running iPadOS 16.1. |
|
|
On my M1 Max Mac Studio with 64GB of RAM:
|
On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1 encode: 46.4 tok/s |
No sure if this is useful or if this is the right thread to post this in but I encountered this error on an old Laptop with a discrete very old Nvidia GPU (GT 920m) with the 470.182.03 driver which should include Vulcan:
|
@zifken looks like |
I see so only GPUs with more than 4go or vRAM are supported because of the size of the model (it makes sense) . |
@zifken there are some reports saying 4GB might work, but 6GB is recommended atm |
It's confusing, On my Win10: [AMD Ryzen 5 5600 6-Core Processor 3.50 GHz, 96G Ram, RTX 2080 Ti Modified to 22G VRAM], the stats is below:
|
iPad Pro 11 A12Z encode: 5.1 tok/s, decode: 4.1 tok/s |
Linux RTX 3090
|
mlc sampleslaptop on Fedora (bat):
laptop on Windows (bat):
desktop:
|
On 14" Macbook Pro (M2 Pro with 10-Core CPU and 16-Core GPU with 16GB Unified Memory) with macos Ventura 13.3.1
I am seeing encoding performance b/w 45-60 and decoding b/w 20-29. |
Update: I opened up a repo (https://github.com/junrushao/llm-perf-bench) of dockerfiles to help reproduce cuda performance numbers. The takeaway is: MLC LLM is around 30% faster than Exllama. It’s a bit sloppy now as a weekend night project, but we could iterate in the future weeks. |
I'm seeing different results for MLC vs ExLlama performance (where MLC is significantly slower). I've added llama.cpp results as well for comparison: 3090 (360W)
4090 (400W)
@junrushao any ideas of what might be going on, this is a huge difference from your test, so I'd assume that there's something I'm missing. With MLC is there any way to control batching? Also, not sure why the "prefill" time is so low compared to the others, but maybe they're all referring to different measurements. Every tool uses their own terminology for output numbers. Test NotesI am running these on an Arch Linux install w/ CUDA 12.2:
The 3090 set to PL 360W, the 4090 is set to PL 400W (both slightly undervolted, but tested to have ~97% for the stock power limits. MLC was run like so:
For ExLlama I am using the most accurate 32g desc act order GPTQ (128g or no grouping could be faster):
For llama.cpp I am using the q4_K_M GGMLv3 (q4_0 could be faster):
|
@lhl likely you are using the vulkan backend, which is more portable but much slower. Need to build for cuda backends here for best perf in nvidia platform |
@lhl The number you used is likely Vulkan. Vulkan is usually 30%-80% slower than CUDA. We haven’t released a prebuilt for CUDA yet, but you may directly run it via the Dockerfile I provided. MLC uses end-to-end decoding time which includes sampling and text generation, and thus will underestimate performance the most. Will improve over the incoming weeks. |
@junrushao hello, where can I get the latest performance changes on CUDA? thank you ! |
@sleepwalker2017 would you like to check out the dockerfile if it works? https://github.com/junrushao/llm-perf-bench I’m eager to have some feedbacks on specifically on the usability issue |
I'm closing this issue as we are pursuing more systematic and accurate performance benchmarking, preferrably based on Dockerfile for maximized reproducibility. See also: https://github.com/junrushao/llm-perf-bench |
BTW, I was able to get the Docker running, but it took quite a long time to build. I don't think the make flags are being passed properly (I have a 16C system but they weren't being used). |
@lhl the make flag is passed properly. this is (unfortunately) expected behavior because there is one particular compilation unit, which uses cutlass, is extremely slow, which on my end took 10min to build. Cutlass is known as “slow to build” anyways… The good news is that we have included it in our nightly wheel so that you don’t have to build it yourself in most of the time. Will update the wheel by the end of the month to use prebuilt wheel instead. |
@junrushao Ah, ok, thanks for the clarification. A CUDA prebuilt would be great. BTW for you (or others interested), here are my results (just ran on HEAD of every project). Using the main mlc-llm branch, the CUDA performance is almost exactly the same as ExLlama's. Using your benchmark branch (using the docker image, also works the same exporting the dists), it looks like it's 5-15% faster than llama.cpp. Performance looks good!
Notes:
Those interested in a few more command line specifics for the table I posted btw can view this shared Worksheet: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1788227831 |
Ah thanks @lhl for the detailed feedbacks, and this is extremely valuable to us! Well, it's super informative and may be worth for a separate thread! Disclaimer: our CUDA effort is pretty new, and we are planning to do quite a lot of UX enhancements improve documentation, usability and performance (still huge room for perf squeeze!). Some of our efforts include:
This is great to learn the configuration. Meanwhile we wanted to make llm-perf-bench a more general reproducible benchmarking infra so that it could test different frameworks under different settings, such as long context prefilling, short conversation, batched inference, distributed inference, etc. @sunggg could probably share more performance comparison under other settings.
Good point! I just realized that we didn’t have any documentation for those formats yet To briefly explain what Regarding perplexity, we use group quantization natively, which is identical to GGML’s format, meaning
Yes this is something we have been working towards in the short term (1 week or so). Basically it is possible to build a fatbin that includes different CUDA architectures, so that we don’t have to switch over.
This is one of our short-term goal (2-3 weeks). We will keep the community updated on CUDA documentation!
Thanks for sharing, and happy to advocate for your blog post! To share some update: you don’t have to suffer from compiling TVM yourself any more by the end of this week - the prebuilt will be available by then in http://mlc.ai/package/ (well, the package name is mlc-ai which is weird). BTW, don’t use cuBLAS or cuDNN as they are relatively slow in our particular case. Regarding the separate “benchmark” branch, this is my embarrassing quick weekend night hack to get at least something functioning. In fact, all its pieces have been upstreamed as of today. I’m going to deprecate this branch this weekend after the latest TVM wheel is released. I don’t really know much the glibc issue tbh. It occurred at times when I forgot to install some dependency - in our case, it’s likely that LLVM depends on a different version of glibc, which we may not have on archlinux. Would you mind sharing the detailed error message? |
Hi, @lhl and thank you for sharing your experience with the detailed explanation! Let me share our latest numbers on llama-2 in our dev branch. (They will be upstreamed soon.) For measurement, we used 128 tokens for each prompt and generation.
Once we finish the upstream, we would be able to share more exciting results :) |
Sorry for the late response, I'll try this repo and see if it works. |
On quantization - so is that comparable to GGML q4_0 or q4_1 (there's a big perplexity difference, the sweet spot for GGML's perplexity/perf seems to be q4_K_M these days - details on k-quants here: ggerganov/llama.cpp#1684). I dove into exllama's perplexity code a couple months ago (https://github.com/turboderp/exllama/blob/master/perplexity.py) and if I get a chance, will try to see something similar can be implemented for MLC LLM so we can run comparisons on the same model w/ different formats: https://github.com/turboderp/exllama/blob/master/perplexity.py , especially since there's so many new optimizations being published (AWQ, SpQR, SqueezeLLM, etc)
No script, but here's my step by step setup: https://llm-tracker.info/books/howto-guides/page/nvidia-gpus#bkmrk-mlc-llm It sounds like there's a lot in motion and I'm traveling this week anyway, so happy to just wait for things to settle. If the Also, I do have an old (Radeon VII) ROCm card and I saw the recent checkin so I may give it a spin when I revisit. For actual usage, I'll keep tabs on the Python API improvements - I think the most useful general thing would probably be an OpenAI API drop-in (chat and completions). I've been using adhoc scripts w/ various engines to do that, although I saw there are some all-in-one bindings like https://github.com/go-skynet/LocalAI as well. While I'm enjoying poking around, as I'm moving some local LLM stuff closer to production I'll probably be looking to do some testing similar to https://hamel.dev/notes/llm/03_inference.html for q4 models w/ different batching, and for handling simultaneous queries (I suppose Apache Bench against a web API would be a good way to test)? For production, I'll be in cloud, so those benchmarks will either be against A100s or L40s most likely. Will drop by the Discord as well. |
@lhl Thanks for the discussion! Regarding quantization, this information is extremely helpful to us! In the short term (~1 month), we will likely sleep on the existing quantization algorithms, and tend not to invent new ones on our own (which is quite handy to implement), instead, we wanted to make our compiler framework general enough to integrate quantization techniques from latest research such as the ones you mentioned.
The good news is that most of the optimizations just got in last week! The dockerfile is updated accordingly: https://github.com/mlc-ai/llm-perf-bench.
Another good news: you don't have to compile TVM from scratch now to get most of the CUDA performance! All is included in the prebuilt including cutlass: https://github.com/mlc-ai/llm-perf-bench/blob/main/Dockerfile.cu121.mlc#L23.
We have been building ROCm on our nightly TVM wheel since tonight, which is based on the latest ROCm. ROCm is still an almost unknown territory to me, and I'm not sure if it's going to work for older cards (overheard some compatibility thing but didn't validate myself).
We have the initial prototype of REST API ready designed with OpenAI-style APIs:
They are quite rough (but at least working) at the moment, and we are actively working on revamping the design: #650
Both A100 and A10g are interesting in production, and two direction we are heading towards are distributed inference (my top priority at the moment) and batching (@MasterJH5574 is on it). |
Results for AMD RX6800XT + 5950X. Kernel 6.4. Debian 13. Model: Llama-2-7b-chat-hf-q4f16_1 vulkan: rocm: ./mlc_chat_cli --local-id GOAT-7B-Community-q4f16_1 --device rocm
Use MLC config: "/home/user/src/mlc/dist/prebuilt/mlc-chat-GOAT-7B-Community-q4f16_1/mlc-chat-config.json"
Use model weights: "/home/user/src/mlc/dist/prebuilt/mlc-chat-GOAT-7B-Community-q4f16_1/ndarray-cache.json"
Use model library: "/home/user/src/mlc/dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-rocm.so"
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out the latest stats (token/sec)
/reset restart a fresh chat
/reload [local_id] reload model `local_id` from disk, or reload the current model if `local_id` is not specified
Loading model...
Loading finished
Running system prompts...
[19:36:12] /home/user/src/mlc/mlc-llm/3rdparty/tvm/src/runtime/library_module.cc:87: TVMError: ROCM HIP Error: hipModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: shared object initialization failed
Stack trace:
File "/home/user/src/mlc/mlc-llm/3rdparty/tvm/src/runtime/rocm/rocm_module.cc", line 105
[bt] (0) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x13) [0x7fbbe8d12b83]
[bt] (1) ./mlc_chat_cli(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x24) [0x55f580793ae4]
[bt] (2) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x216cb4) [0x7fbbe8e16cb4]
[bt] (3) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::ROCMModuleNode::GetFunc(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x13e) [0x7fbbe8e199be]
[bt] (4) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x216e36) [0x7fbbe8e16e36]
[bt] (5) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::detail::PackFuncPackedArg_<4, tvm::runtime::ROCMWrappedFunc>(tvm::runtime::ROCMWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x6a) [0x7fbbe8e19a5a]
[bt] (6) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(TVMFuncCall+0x46) [0x7fbbe8cdf156]
Stack trace:
[bt] (0) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x13) [0x7fbbe8d12b83]
[bt] (1) ./mlc_chat_cli(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x24) [0x55f580793ae4]
[bt] (2) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x10f404) [0x7fbbe8d0f404]
[bt] (3) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x10f5a0) [0x7fbbe8d0f5a0]
[bt] (4) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)+0x8c0) [0x7fbbe8d8ff30]
[bt] (5) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()+0x2c7) [0x7fbbe8d8cbd7]
[bt] (6) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)+0x24d) [0x7fbbe8d8d06d]
[bt] (7) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x18d455) [0x7fbbe8d8d455]
[bt] (8) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x277) [0x7fbbe8d8b787] |
HP Intel Desktop PC: What do you mean with 6 GB memory? Thank you. |
Psyched that I got MLC to build/run for ARM64 + CUDA! Results for Jetson AGX Orin 64GB: * llama-2-7b-chat 36.4 tokens/sec
* llama-2-13b-chat 20.4 tokens/sec
* llama-1-30b 8.3 tokens/sec
* llama-2-70b 3.8 tokens/sec Results for Jetson Orin Nano 8GB: * llama-2-7b-chat 10.2 tokens/sec These are all with q4f16_1 quantization, CUTLASS, and CUDA graphs enabled. A MLC container that builds wheels from source for JetPack-L4T can be found here: https://github.com/dusty-nv/jetson-containers/tree/dev/packages/llm/mlc |
This is very cool! Thanks @dusty-nv for sharing! |
Performance I got;
All other models were OOM upon loading Nvidia MX150: AMD R7 240: AMD RX 580 2048SP: Nvidia GTX 1060 (3GB): |
hello are p40 cards supported? |
Seeing the Pascal GTX 10 series cards are supported then the P40 should work too I think. I have a couple of them and will test this out. |
UPDATE (08/09/2023):
We have done major performance overhaul in the past few months, and now I'm happy to share the latest results:
============================================================
Hi everyone,
We are looking to gather data points on running MLC-LLM on different hardwares and platforms. Our goal is to create a comprehensive reference for new users. Please share your own experiences in this thread! Thank you for your help!
NOTE: for benchmarking, we highly recommended a device of at least 6GB memory, because the model itself takes 2.9G already. For this reason, it is known that the iOS app will crash on a 4GB iPhone.
AMD GPUs
Macbook
Intel GPUs
NVIDIA GPUs
iOS
Android
The text was updated successfully, but these errors were encountered: