Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Survey] Supported Hardwares and Speed #15

Closed
junrushao opened this issue Apr 30, 2023 · 118 comments
Closed

[Survey] Supported Hardwares and Speed #15

junrushao opened this issue Apr 30, 2023 · 118 comments

Comments

@junrushao
Copy link
Member

junrushao commented Apr 30, 2023

UPDATE (08/09/2023):

We have done major performance overhaul in the past few months, and now I'm happy to share the latest results:

============================================================

Hi everyone,

We are looking to gather data points on running MLC-LLM on different hardwares and platforms. Our goal is to create a comprehensive reference for new users. Please share your own experiences in this thread! Thank you for your help!

NOTE: for benchmarking, we highly recommended a device of at least 6GB memory, because the model itself takes 2.9G already. For this reason, it is known that the iOS app will crash on a 4GB iPhone.

AMD GPUs

Hardware/GPU OS Tokens/sec Source Notes
RX 6600XT (8G) N/A 28.3 GitHub
RX 6750XT openSUSE TumbleWeed 8.9 - 154.3 GitHub
RX 6700XT Windows 11 33.7 GitHub
APU 5800H Windows 11 8.5 GitHub
Raden RX 470 (4G) AlmaLinux 9.1 9.4 GitHub
Raden Pro 5300M macOS Venture 12.6 @junrushao Intel MBP 16" (late 2019)
AMD GPU on Steam Deck Steam Deck's Linux TBD Reddit
RX6800 16G VRAM macOS Ventura 22.5 GitHub Intel MBP 13'' (2020)
Radeon RX 6600 (8GB) Ubuntu 22.04 7.0 Reddit
RX 7900 xtx Reddit

Macbook

Hardware/GPU OS Tokens/sec Source Notes
2020 MacBook Pro M1 (8G) macOS 11.4 GitHub
2021 MacBook Pro M1Pro (16G) macOS Ventura 17.1 GitHub
M1 Max Mac Studio (64G) N/A 18.6 GitHub
2021 MacBook Pro M1 Max (32G) macOS Monterey 21.0 GitHub
MacBook Pro M2 (16G) macOS Ventura 22.5 GitHub
2021 MacBook M1Pro (32G) macOS Ventura 19.3 GitHub

Intel GPUs

Hardware/GPU OS Tokens/sec Source Notes
Arc A770 N/A 3.1 - 118.6 GitHub perf issues in decoding needs investigation
UHD Graphics (Comet Lake-U GT2) 1G Windows 10 2.2 GitHub
UHD Graphics 630 macOS Ventura 2.3 @junrushao Integrated GPU. Intel MBP 16" (late 2019)
Iris Plus Graphics 1536 MB macOS Ventura 2.6 GitHub Integrated GPU on MBP
Iris Plus Graphics 645 1536 MB macOS Ventura 2.9 GitHub Integrated GPU on MBP

NVIDIA GPUs

Hardware/GPU OS Tokens/sec Source Notes
GTX 1650 ti (4GB) Fedora 15.6 GitHub
GTX 1060 (6GB) Windows 10 16.7 GitHub
RTX 3080 Windows 11 26.0 GitHub
RTX 3060 Debian bookworm 21.3 GitHub
RTX 2080Ti Windows 10 24.5 GitHub
RTX 3090 N/A 25.7 GitHub
GTX 1660ti N/A 23.9 GitHub
RTX 3070 N/A 23.3 GitHub

iOS

Hardware/GPU OS Tokens/sec Source Notes
iPhone 14 Pro iOS 16.4.1 7.2 @junrushao
iPad Pro 11' with M1 iPadOS 16.1 10.6 GitHub
iPad Pro 11' A12Z N/A 4.1 GitHub
iPad Pro 11' with M2 (4-th gen) iPadOS 16.5 14.1 GitHub

Android

Hardware/GPU OS Tokens/sec Link Notes
@maxtheman
Copy link

maxtheman commented Apr 30, 2023

@junrushao how can we find tokens/sec? I'd say 'quite fast' fastest LLM I've run on this 2020 MacBook Pro M1 8G. 10x faster than your WebGPU demo running with less overall memory usage.

All reports out is the text?

Screenshot 2023-04-30 at 9 32 11 AM

@tqchen
Copy link
Contributor

tqchen commented Apr 30, 2023

We just added a new updates #14 which should ship to conda by now, you can type /stats after a conversation to get the measured speed

@maxtheman
Copy link

Killer, I'm at encode: 31.9 tok/s, decode: 11.4 tok/s for 2020 MacBook Pro M1 8G with the default vicuna 6b. For reference my decode on the WebGPU demo is like, 0.5/sec.

@Kimiko-AI
Copy link

OOM on gtx 1650. Load the model fine, but OOM when generate the first message

@junrushao
Copy link
Member Author

@nRuaif 4GB memory wouldn't be enough. A 6GB one should work

@junrushao junrushao changed the title Data points running MLC-LLM on hardwares/platforms Runnable Hardwares and Speed May 1, 2023
@junrushao junrushao pinned this issue May 1, 2023
@y-lee
Copy link

y-lee commented May 1, 2023

On iPhone 13, crashes after a few seconds of [System] Initialize.... Phone has 4GB of RAM, which I presume is the cause.

@junrushao
Copy link
Member Author

@y-lee That's correct. The model we are using so far requires 6GB RAM to run smoothly

@jolonf
Copy link

jolonf commented May 1, 2023

On the iPad Pro 11” with M1 I am getting decode of 10.6 tok/s (I have seen slightly higher and lower). It is running iPadOS 16.1.

@Hzfengsy
Copy link
Member

Hzfengsy commented May 1, 2023

encode: 39.5 tok/s, decode: 26.0 tok/s on Windows 11 with RTX-3080
encode: 32.5 tok/s, decode: 17.1 tok/s on Macbook Pro with M1Pro (16 GPUs) and macOS Ventura 13.3.1

@junrushao junrushao changed the title Runnable Hardwares and Speed Supported Hardwares and Speed May 1, 2023
@juodumas
Copy link

juodumas commented May 1, 2023

Hardware/GPU OS Tokens/sec Source Model Notes
RTX 3060 (12GB) Debian bookworm 21 vicuna-v1-7b 3644MiB GPU memory used
  • /stats after /reset: encode: 72.2 tok/s, decode: 23.2 tok/s
  • /stats for 2nd and later messages: encode: 39.3 tok/s, decode: 21.3 tok/s
>>nvidia-smi --query-gpu=memory.used --format=csv     
memory.used [MiB]
3644 MiB

@jefflewis
Copy link

On my M1 Max Mac Studio with 64GB of RAM:

encode: 53.7 tok/s, decode: 18.6 tok/s

@FreeBlues
Copy link

On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1

encode: 46.4 tok/s
decode: 22.5 tok/s

@junrushao junrushao changed the title Supported Hardwares and Speed [Survey] Supported Hardwares and Speed May 1, 2023
@zifken
Copy link

zifken commented May 1, 2023

No sure if this is useful or if this is the right thread to post this in but I encountered this error on an old Laptop with a discrete very old Nvidia GPU (GT 920m) with the 470.182.03 driver which should include Vulcan:

MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0                                                              

WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
Use lib /mnt/run/code/llma/mlc-ai/dist/lib/vicuna-v1-7b_vulkan_float16.so
Initializing the chat module...
[20:30:33] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_buffer.cc:61: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-2: VK_ERROR_OUT_OF_DEVICE_MEMORY
Stack trace:
  [bt] (0) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x27) [0x7f975d98ba37]
  [bt] (1) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x3f375) [0x7f975d929375]
  [bt] (2) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanBuffer::VulkanBuffer(tvm::runtime::vulkan::VulkanDevice const&, unsigned long, unsigned int, unsigned int)+0x220) [0x7f975da646b0]
  [bt] (3) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType)+0x4a) [0x7f975da7168a]
  [bt] (4) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional<tvm::runtime::String>)+0x1a7) [0x7f975d9a3037]
  [bt] (5) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x121862) [0x7f975da0b862]
  [bt] (6) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>::AssignTypedLambda<void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>(void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x204) [0x7f975da0f7e4]
  [bt] (7) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(+0x1bdea6) [0x7f975dce3ea6]
  [bt] (8) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(mlc::llm::CreateChatModule(tvm::runtime::Module, tvm::runtime::String const&, tvm::runtime::String const&, DLDevice)+0x411) [0x7f975dce4ba1]

@junrushao
Copy link
Member Author

@zifken looks like VK_ERROR_OUT_OF_DEVICE_MEMORY indicates that it doesn't have enough memory. I looked it up and it seems that GT 920M only has 2GB RAM, but the default model is 2.9G in size :/

@zifken
Copy link

zifken commented May 1, 2023

I see so only GPUs with more than 4go or vRAM are supported because of the size of the model (it makes sense) .
I will try on an other GPU model shortly.
Thank you for the feedback

@junrushao
Copy link
Member Author

@zifken there are some reports saying 4GB might work, but 6GB is recommended atm

@FreeBlues
Copy link

On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1

encode: 46.4 tok/s decode: 22.5 tok/s

It's confusing, On my Win10: [AMD Ryzen 5 5600 6-Core Processor 3.50 GHz, 96G Ram, RTX 2080 Ti Modified to 22G VRAM], the stats is below:

encode: 24.0 tok/s, decode: 24.5 tok/s

@colakang
Copy link

colakang commented May 1, 2023

iPad Pro 11 A12Z

encode: 5.1 tok/s, decode: 4.1 tok/s

@ganler
Copy link

ganler commented May 1, 2023

Linux RTX 3090

  • encode: encode: 179.0 tok/s, decode: 25.7 tok/s (prompted with Implement binary search in Python please! Also try to use some type annotations!)
  • mem usage: 3.6GB

@swittk
Copy link

swittk commented May 1, 2023

2021 MacBook Pro M1 Max [32 cores], 32 GB RAM, 1 TB SSD
Mac OS Monterey 12.5.1
encode: 69.4 tok/s, decode: 21.0 tok/s
Memory usage 3.64 GB

(Edit) Just saw that this score is higher than M1 Max 64 GB so I repeated the prompts and I'm still getting an average of > 20 tokens/second.
Screen Shot 2566-05-02 at 05 27 17

@facesthe
Copy link

facesthe commented May 2, 2023

machine OS CPU GPU result remarks
Yoga Slim 7 pro 14ARH7 Fedora 37 6800HS (13.3GB) 680M (2GB) encode: 11.0 tok/s, decode: 3.8 tok/s battery
Yoga Slim 7 pro 14ARH7 Fedora 38 6800HS (13.3GB) 680M (2GB) encode: 16.6 tok/s, decode: 6.2 tok/s AC
Yoga Slim 7 pro 14ARH7 Windows 11 22H2 6800HS (13.3GB) 680M (2GB) encode: 6.7 tok/s, decode: 7.9 tok/s battery
Yoga Slim 7 pro 14ARH7 Windows 11 22H2 6800HS (13.3GB) 680M (2GB) encode: 16.6 tok/s, decode: 10.2 tok/s AC
desktop Windows 10 22H2 5900x (32GB) 1660ti (6GB) encode: 49.5 tok/s, decode: 23.9 tok/s -

mlc samples

laptop on Fedora (bat):

USER: /reset
RESET CHAT SUCCESS
USER: Can you explain your features?
ASSISTANT: Sure! Here are brief explanations of the features of my AI:

1. Natural Language Understanding: My AI has been trained on a large corpus of text to understand the meaning of the words and sentences it encounters. This allows it to understand the context of a question or statement and provide a response that is relevant and accurate.
2. Knowledge Graph: My AI has access to a large graph of knowledge that it can use to answer questions or provide information on a wide range of topics. This graph is constantly updated and improved to ensure that the information it provides is accurate and up-to-date.
3. Plagiarism Detection: My AI has been trained to identify and report on instances of plagiarism it encounters. This allows it to provide feedback and suggestions for improvement on written work.
4. Summarization: My AI can summarize large amounts of text and data into a shorter, more digestible format. This can be useful for quickly understanding the main points of a document or set of data.
5. Machine Translation: My AI can translate written or spoken content from one language to another using state-of-the-art neural machine translation models. This can be useful for communication in different languages or for providing information in a language other than the one the AI was trained on.
6. Sentiment Analysis: My AI can analyze the sentiment of written or spoken content and provide an analysis or summary of the overall tone or message. This can be useful for identifying the emotional or persuasive impact of a message or communication.
7. Image Recognition: My AI has been trained on a large dataset of images to recognize and classify them. This allows it to identify objects or scenes in an image and provide additional information or context about what is depicted in the image.
8. TTS: My AI can generate text-to-speech output from a written or spoken input. This can be useful for providing an audio version of written content or for improving accessibility and inclusivity by providing an alternative format for those with hearing or speech difficulties.
USER: /stats 
encode: 11.0 tok/s, decode: 3.8 tok/s

laptop on Windows (bat):

USER: /reset
RESET CHAT SUCCESS
USER: you're on github. say hi!
ASSISTANT: Hello! I'm an AI assistant on GitHub, here to answer any questions you might have about the platform. Let's get started!
USER: /stats
encode: 6.7 tok/s, decode: 7.9 tok/s

desktop:

USER: /reset
RESET CHAT SUCCESS
USER: compare yourself to ChatGPT
ASSISTANT: As an AI language model, I am different from ChatGPT in a few ways:

* My training data is different from ChatGPT's. This means that I may have a different perspective on the world and may be able to understand and respond to certain types of language in ways that ChatGPT cannot.
* I have a more advanced architecture that allows me to process longer texts and understand more complex language patterns.
* My training objective was to be a general AI that can do a wide range of things like answering questions about any topic under the sun, summarizing long texts and understanding the context of the sentence and suggesting appropriate response.
* I have been trained on a much larger dataset of text and have access to a more powerful GPU for faster language modeling.
* I have more parameters in my model than ChatGPT which allows me to capture more nuanced information and learn from that information.

In summary ChatGPT is a specific model optimized for NLP and conversational text understanding and I am a more general AI model that can do a wide range of things and can handle more complex language patterns.
USER: /stats
encode: 49.5 tok/s, decode: 23.9 tok/s

@anmoljagetia
Copy link

On 14" Macbook Pro (M2 Pro with 10-Core CPU and 16-Core GPU with 16GB Unified Memory) with macos Ventura 13.3.1

encode: 59.2 tok/s, decode: 22.5 tok/s

I am seeing encoding performance b/w 45-60 and decoding b/w 20-29.

@durmazt durmazt mentioned this issue May 2, 2023
@junrushao
Copy link
Member Author

Update: I opened up a repo (https://github.com/junrushao/llm-perf-bench) of dockerfiles to help reproduce cuda performance numbers. The takeaway is: MLC LLM is around 30% faster than Exllama.

It’s a bit sloppy now as a weekend night project, but we could iterate in the future weeks.

@lhl
Copy link

lhl commented Jul 30, 2023

I'm seeing different results for MLC vs ExLlama performance (where MLC is significantly slower). I've added llama.cpp results as well for comparison:

3090 (360W)

Prefill t/s Decode t/s
MLC 258.7 44.3
ExLlama 6473.65 73.58
llama.cpp 2219.4 105.18

4090 (400W)

Prefill t/s Decode t/s
MLC 461.4 75.7
ExLlama 12248.12 107.54
llama.cpp 2269.50 132.56

@junrushao any ideas of what might be going on, this is a huge difference from your test, so I'd assume that there's something I'm missing. With MLC is there any way to control batching? Also, not sure why the "prefill" time is so low compared to the others, but maybe they're all referring to different measurements. Every tool uses their own terminology for output numbers.

Test Notes

I am running these on an Arch Linux install w/ CUDA 12.2:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

The 3090 set to PL 360W, the 4090 is set to PL 400W (both slightly undervolted, but tested to have ~97% for the stock power limits.

MLC was run like so:

$ mlc_chat_cli --local-id Llama-2-7b-chat-hf-q4f16_1 --device_id 0
[INST]: Write me a 1000 word essay on the social implications of generative AI
...
[INST]: /stats
prefill: 258.7 tok/s, decode: 44.3 tok/s
  • I don't know how MLC to control output like ExLlama or llama.cpp can so MLC gets an advantage over the others for inferencing (since it slows down with longer context), my previous query on how to actually do apples-to-apples comparisons
  • This is using the prebuilt CLI llama2 model from, which the docs say is the most optimized version? https://mlc.ai/mlc-llm/docs/prebuilt_models.html

For ExLlama I am using the most accurate 32g desc act order GPTQ (128g or no grouping could be faster):

$ python test_benchmark_inference.py -d /models/llm/llama2/TheBloke_Llama-2-7B-GPTQ/ -p
...
 ** Time, Inference: 0.30 seconds
 ** Speed: 6473.65 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 73.58 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 96.28 tokens/second
  • I use ExLLama's initial "Inference" time as equivalent to MLC's Prefill
  • I uses the 128/1920 t/s, which is the "worst case" for inference

For llama.cpp I am using the q4_K_M GGMLv3 (q4_0 could be faster):

$ ./main -m /models/llm/llama2/TheBloke_Llama-2-7B-GGML/llama-2-7b.ggmlv3.q4_K_M.bin -n 2048 --ignore-eos -ngl 99
...
llama_print_timings:        load time =   507.30 ms
llama_print_timings:      sample time =   922.92 ms /  2048 runs   (    0.45 ms per token,  2219.04 tokens per second)
llama_print_timings: prompt eval time =  2180.57 ms /  1801 tokens (    1.21 ms per token,   825.93 tokens per second)
llama_print_timings:        eval time = 19395.81 ms /  2040 runs   (    9.51 ms per token,   105.18 tokens per second)
llama_print_timings:       total time = 22814.60 ms
  • For llama.cpp I use the "sample" time as the Prefill and then the "eval" time

@tqchen
Copy link
Contributor

tqchen commented Jul 30, 2023

@lhl likely you are using the vulkan backend, which is more portable but much slower. Need to build for cuda backends here for best perf in nvidia platform

@junrushao
Copy link
Member Author

@lhl The number you used is likely Vulkan. Vulkan is usually 30%-80% slower than CUDA. We haven’t released a prebuilt for CUDA yet, but you may directly run it via the Dockerfile I provided.

MLC uses end-to-end decoding time which includes sampling and text generation, and thus will underestimate performance the most. Will improve over the incoming weeks.

@sleepwalker2017
Copy link

@junrushao hello, where can I get the latest performance changes on CUDA? thank you !

@junrushao
Copy link
Member Author

@sleepwalker2017 would you like to check out the dockerfile if it works? https://github.com/junrushao/llm-perf-bench

I’m eager to have some feedbacks on specifically on the usability issue

@junrushao
Copy link
Member Author

I'm closing this issue as we are pursuing more systematic and accurate performance benchmarking, preferrably based on Dockerfile for maximized reproducibility. See also: https://github.com/junrushao/llm-perf-bench

@junrushao junrushao unpinned this issue Aug 3, 2023
@lhl
Copy link

lhl commented Aug 3, 2023

@sleepwalker2017 would you like to check out the dockerfile if it works? https://github.com/junrushao/llm-perf-bench

I’m eager to have some feedbacks on specifically on the usability issue

BTW, I was able to get the Docker running, but it took quite a long time to build. I don't think the make flags are being passed properly (I have a 16C system but they weren't being used).

@junrushao
Copy link
Member Author

@lhl the make flag is passed properly. this is (unfortunately) expected behavior because there is one particular compilation unit, which uses cutlass, is extremely slow, which on my end took 10min to build. Cutlass is known as “slow to build” anyways…

The good news is that we have included it in our nightly wheel so that you don’t have to build it yourself in most of the time. Will update the wheel by the end of the month to use prebuilt wheel instead.

@lhl
Copy link

lhl commented Aug 3, 2023

@junrushao Ah, ok, thanks for the clarification. A CUDA prebuilt would be great.

BTW for you (or others interested), here are my results (just ran on HEAD of every project). Using the main mlc-llm branch, the CUDA performance is almost exactly the same as ExLlama's. Using your benchmark branch (using the docker image, also works the same exporting the dists), it looks like it's 5-15% faster than llama.cpp. Performance looks good!

Package Commit Model Quant Memory Usage 4090 @ 400PL 3090 @ 360PL
MLC LLM CUDA 3c53eeb llama2-7b-chat q4f16_1 5932 115.87 83.63
MLC LLM Perf c40be6a llama2-7b-chat q4f16_1 5244 165.57 131.73
llama.cpp 8183159 llama2-7b-chat q4_0 5226 146.79 125.54
llama.cpp 8183159 llama2-7b q4_K_M 5480 138.83 114.66
ExLlama 91b9b12 llama2-7b-chat q4_128gs 5466 115.92 81.91
ExLlama 91b9b12 llama2-7b q4_32gs_act 5672 107.21 73.54

Notes:

  • These numbers are all using prompt 128/inference 1920 tokens (using the new --evaluate options for mlc_chat_cli and the appropriate options for the others) so this is as close to 1:1 as possible
  • I'm not familiar with q4f16_1's perplexity (is that compared anywhere to the other popular quant formats?) - is there documentation on what quantization q4f16_1 actually is? It would be nice to get that into a comparison like https://oobabooga.github.io/blog/posts/perplexities/
  • I need to set CUDA_VISIBLE_DEVICES to control which architecture build.py builds to - when building to 4090 (sm_89), models don't run on my 3090 (sm_86) although if I just build to sm_86, there's no perf difference. I had to use symlinks to switch dist folders - it'd be nice if there were a more elegant way of handling different CUDA kernel versions (eg, being able to generate multiple targets and automatically pick the appropriate one) if that were possible?
  • It was really hard (not in any docs, found only in a closed issue from a few months ago) for instructions on how to compile the CUDA version (as someone not familiar w/ relax/tvm/mlc-llm) - there are docs for other ways of getting started, so hopefully CUDA will be easier in the future.
  • For anyone looking for some more setup help, I documented my process for getting things working here: https://llm-tracker.info/books/howto-guides/page/nvidia-gpus#bkmrk-mlc-llm - this is for Arch w/ Conda/Mamba, although I ran into some GLIBCXX version issues w/ the "benchmark" branch (but not main).

Those interested in a few more command line specifics for the table I posted btw can view this shared Worksheet: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1788227831

@junrushao
Copy link
Member Author

junrushao commented Aug 3, 2023

Ah thanks @lhl for the detailed feedbacks, and this is extremely valuable to us! Well, it's super informative and may be worth for a separate thread!

Disclaimer: our CUDA effort is pretty new, and we are planning to do quite a lot of UX enhancements improve documentation, usability and performance (still huge room for perf squeeze!). Some of our efforts include:

These numbers are all using prompt 128/inference 1920 tokens (using the new --evaluate options for mlc_chat_cli and the appropriate options for the others) so this is as close to 1:1 as possible

This is great to learn the configuration. Meanwhile we wanted to make llm-perf-bench a more general reproducible benchmarking infra so that it could test different frameworks under different settings, such as long context prefilling, short conversation, batched inference, distributed inference, etc. @sunggg could probably share more performance comparison under other settings.

I'm not familiar with q4f16_1's perplexity (is that compared anywhere to the other popular quant formats?) - is there documentation on what quantization q4f16_1 actually is? It would be nice to get that into a comparison like https://oobabooga.github.io/blog/posts/perplexities/

Good point! I just realized that we didn’t have any documentation for those formats yet

To briefly explain what q4f16_1 is: q4 means 4bit quantization, f16 means fp16 compute, and _1 is an ordinal number that doesn’t have particular meaning. In our case, _1 is always preferred to _0, because they are of the same numeric precision while _1 is faster.

Regarding perplexity, we use group quantization natively, which is identical to GGML’s format, meaning q4f16_1 should have similar perplexity with GGML’s 4bit quantization.

I need to set CUDA_VISIBLE_DEVICES to control which architecture build.py builds to - when building to 4090 (sm_89), models don't run on my 3090 (sm_86) although if I just build to sm_86, there's no perf difference. I had to use symlinks to switch dist folders - it'd be nice if there were a more elegant way of handling different CUDA kernel versions (eg, being able to generate multiple targets and automatically pick the appropriate one) if that were possible?

Yes this is something we have been working towards in the short term (1 week or so). Basically it is possible to build a fatbin that includes different CUDA architectures, so that we don’t have to switch over.

It was really hard (not in any docs, [found only in a closed issue](https://github.com/mlc-ai/mlc-llm/issues/229#issuecomment-1564139277) from a few months ago) for instructions on how to compile the CUDA version (as someone not familiar w/ relax/tvm/mlc-llm) - there are docs for other ways of getting started, so hopefully CUDA will be easier in the future.

This is one of our short-term goal (2-3 weeks). We will keep the community updated on CUDA documentation!

For anyone looking for some more setup help, I documented my process for getting things working here: https://llm-tracker.info/books/howto-guides/page/nvidia-gpus#bkmrk-mlc-llm - this is for Arch w/ Conda/Mamba, although I ran into some GLIBCXX version issues w/ the "benchmark" branch (but not main).

Thanks for sharing, and happy to advocate for your blog post! To share some update: you don’t have to suffer from compiling TVM yourself any more by the end of this week - the prebuilt will be available by then in http://mlc.ai/package/ (well, the package name is mlc-ai which is weird). BTW, don’t use cuBLAS or cuDNN as they are relatively slow in our particular case.

Regarding the separate “benchmark” branch, this is my embarrassing quick weekend night hack to get at least something functioning. In fact, all its pieces have been upstreamed as of today. I’m going to deprecate this branch this weekend after the latest TVM wheel is released.

I don’t really know much the glibc issue tbh. It occurred at times when I forgot to install some dependency - in our case, it’s likely that LLVM depends on a different version of glibc, which we may not have on archlinux. Would you mind sharing the detailed error message?

@sunggg
Copy link
Contributor

sunggg commented Aug 4, 2023

Hi, @lhl and thank you for sharing your experience with the detailed explanation!

Let me share our latest numbers on llama-2 in our dev branch. (They will be upstreamed soon.)
We put our efforts in optimizing kernel performance, applying more fusion, reducing memory footprint, etc. and also integrated external libraries, such as FasterTransformer and Cutlass.

For measurement, we used 128 tokens for each prompt and generation.

Package Commit Model Quant A10G
ExLlama e8a544f llama2-7b-chat q4_128gs 79.97
MLC LLM CUDA dev llama2-7b-chat q4f16_1 102.19
ExLlama e8a544f llama2-13b-chat q4_128gs 47.97
MLC LLM CUDA dev llama2-13b-chat q4f16_1 57.00

Once we finish the upstream, we would be able to share more exciting results :)
By the way, did you release any reproducible scripts or instructions by chance? We would be happy to try out on our end.

@sleepwalker2017
Copy link

@sleepwalker2017 would you like to check out the dockerfile if it works? https://github.com/junrushao/llm-perf-bench

I’m eager to have some feedbacks on specifically on the usability issue

Sorry for the late response, I'll try this repo and see if it works.

@lhl
Copy link

lhl commented Aug 4, 2023

Regarding perplexity, we use group quantization natively, which is identical to GGML’s format, meaning q4f16_1 should have similar perplexity with GGML’s 4bit quantization.

On quantization - so is that comparable to GGML q4_0 or q4_1 (there's a big perplexity difference, the sweet spot for GGML's perplexity/perf seems to be q4_K_M these days - details on k-quants here: ggerganov/llama.cpp#1684).

I dove into exllama's perplexity code a couple months ago (https://github.com/turboderp/exllama/blob/master/perplexity.py) and if I get a chance, will try to see something similar can be implemented for MLC LLM so we can run comparisons on the same model w/ different formats: https://github.com/turboderp/exllama/blob/master/perplexity.py , especially since there's so many new optimizations being published (AWQ, SpQR, SqueezeLLM, etc)

I don’t really know much the glibc issue tbh. It occurred at times when I forgot to install some dependency - in our case, it’s likely that LLVM depends on a different version of glibc, which we may not have on archlinux. Would you mind sharing the detailed error message?
By the way, did you release any reproducible scripts or instructions by chance? We would be happy to try out on our end.

No script, but here's my step by step setup: https://llm-tracker.info/books/howto-guides/page/nvidia-gpus#bkmrk-mlc-llm It sounds like there's a lot in motion and I'm traveling this week anyway, so happy to just wait for things to settle. If the benchmark branch is getting rolled into main maybe it doesn't matter, since the former was fine and the latter I got working past that error w/ mamba install cmake (Basically I believe cmake was getting libs confused between the arch symbols (I had libstdc++5 installed which should cover what it wasn't finding) and conda symbols - I suppose since only the benchmark branch was using LLVM that would explain why main was fine). I also had a build problem with Cutlass on my Arch system - what I ended up doing, since the builds were taking forever and failing was just using the Docker image to run build.py on the models and then exported them out to run on my base system. Once things are upstreamed maybe I'll give it another try and open an issue if unable to get things working, or maybe it'll be unnecessary if there are prebuilts.

Also, I do have an old (Radeon VII) ROCm card and I saw the recent checkin so I may give it a spin when I revisit.

For actual usage, I'll keep tabs on the Python API improvements - I think the most useful general thing would probably be an OpenAI API drop-in (chat and completions). I've been using adhoc scripts w/ various engines to do that, although I saw there are some all-in-one bindings like https://github.com/go-skynet/LocalAI as well. While I'm enjoying poking around, as I'm moving some local LLM stuff closer to production I'll probably be looking to do some testing similar to https://hamel.dev/notes/llm/03_inference.html for q4 models w/ different batching, and for handling simultaneous queries (I suppose Apache Bench against a web API would be a good way to test)? For production, I'll be in cloud, so those benchmarks will either be against A100s or L40s most likely. Will drop by the Discord as well.

@junrushao
Copy link
Member Author

junrushao commented Aug 7, 2023

@lhl Thanks for the discussion!

Regarding quantization, this information is extremely helpful to us! In the short term (~1 month), we will likely sleep on the existing quantization algorithms, and tend not to invent new ones on our own (which is quite handy to implement), instead, we wanted to make our compiler framework general enough to integrate quantization techniques from latest research such as the ones you mentioned.

It sounds like there's a lot in motion and I'm traveling this week anyway, so happy to just wait for things to settle

The good news is that most of the optimizations just got in last week! The dockerfile is updated accordingly: https://github.com/mlc-ai/llm-perf-bench.

I also had a build problem with Cutlass on my Arch system - what I ended up doing, since the builds were taking forever

Another good news: you don't have to compile TVM from scratch now to get most of the CUDA performance! All is included in the prebuilt including cutlass: https://github.com/mlc-ai/llm-perf-bench/blob/main/Dockerfile.cu121.mlc#L23.

Also, I do have an old (Radeon VII) ROCm card and I saw the recent checkin so I may give it a spin when I revisit.

We have been building ROCm on our nightly TVM wheel since tonight, which is based on the latest ROCm. ROCm is still an almost unknown territory to me, and I'm not sure if it's going to work for older cards (overheard some compatibility thing but didn't validate myself).

I'll keep tabs on the Python API improvements - I think the most useful general thing would probably be an OpenAI API drop-in (chat and completions)

We have the initial prototype of REST API ready designed with OpenAI-style APIs:

They are quite rough (but at least working) at the moment, and we are actively working on revamping the design: #650

For production, I'll be in cloud, so those benchmarks will either be against A100s or L40s most likely

Both A100 and A10g are interesting in production, and two direction we are heading towards are distributed inference (my top priority at the moment) and batching (@MasterJH5574 is on it).

@robertswiecki
Copy link

robertswiecki commented Aug 11, 2023

Results for AMD RX6800XT + 5950X. Kernel 6.4. Debian 13. Model: Llama-2-7b-chat-hf-q4f16_1

vulkan:
Statistics: prefill: 48.7 tok/s, decode: 52.8 tok/s

rocm:
doesn't work

./mlc_chat_cli --local-id GOAT-7B-Community-q4f16_1 --device rocm
Use MLC config: "/home/user/src/mlc/dist/prebuilt/mlc-chat-GOAT-7B-Community-q4f16_1/mlc-chat-config.json"
Use model weights: "/home/user/src/mlc/dist/prebuilt/mlc-chat-GOAT-7B-Community-q4f16_1/ndarray-cache.json"
Use model library: "/home/user/src/mlc/dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-rocm.so"
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out the latest stats (token/sec)
  /reset              restart a fresh chat
  /reload [local_id]  reload model `local_id` from disk, or reload the current model if `local_id` is not specified

Loading model...
Loading finished
Running system prompts...
[19:36:12] /home/user/src/mlc/mlc-llm/3rdparty/tvm/src/runtime/library_module.cc:87: TVMError: ROCM HIP Error: hipModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: shared object initialization failed
Stack trace:
  File "/home/user/src/mlc/mlc-llm/3rdparty/tvm/src/runtime/rocm/rocm_module.cc", line 105
  [bt] (0) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x13) [0x7fbbe8d12b83]
  [bt] (1) ./mlc_chat_cli(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x24) [0x55f580793ae4]
  [bt] (2) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x216cb4) [0x7fbbe8e16cb4]
  [bt] (3) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::ROCMModuleNode::GetFunc(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x13e) [0x7fbbe8e199be]
  [bt] (4) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x216e36) [0x7fbbe8e16e36]
  [bt] (5) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::detail::PackFuncPackedArg_<4, tvm::runtime::ROCMWrappedFunc>(tvm::runtime::ROCMWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x6a) [0x7fbbe8e19a5a]
  [bt] (6) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(TVMFuncCall+0x46) [0x7fbbe8cdf156]

Stack trace:
  [bt] (0) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x13) [0x7fbbe8d12b83]
  [bt] (1) ./mlc_chat_cli(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x24) [0x55f580793ae4]
  [bt] (2) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x10f404) [0x7fbbe8d0f404]
  [bt] (3) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x10f5a0) [0x7fbbe8d0f5a0]
  [bt] (4) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)+0x8c0) [0x7fbbe8d8ff30]
  [bt] (5) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()+0x2c7) [0x7fbbe8d8cbd7]
  [bt] (6) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)+0x24d) [0x7fbbe8d8d06d]
  [bt] (7) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(+0x18d455) [0x7fbbe8d8d455]
  [bt] (8) /home/user/src/mlc/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x277) [0x7fbbe8d8b787]

@Golddouble
Copy link

Golddouble commented Aug 27, 2023

Hi everyone,

We are looking to gather data points on running MLC-LLM on different hardwares and platforms. Our goal is to create a comprehensive reference for new users. Please share your own experiences in this thread! Thank you for your help!

NOTE: for benchmarking, we highly recommended a device of at least 6GB memory, because the model itself takes 2.9G already. For this reason, it is known that the iOS app will crash on a 4GB iPhone.

HP Intel Desktop PC: What do you mean with 6 GB memory?
What is right?
A) This is the RAM that is on your motherboard
B) This is the RAM that is part of your graphic card (vRAM)
C) This is the total RAM of your graphic card and motherboard together
D) It depends.

Thank you.

@dusty-nv
Copy link

Psyched that I got MLC to build/run for ARM64 + CUDA!

Results for Jetson AGX Orin 64GB:

* llama-2-7b-chat    36.4 tokens/sec
* llama-2-13b-chat   20.4 tokens/sec
* llama-1-30b         8.3 tokens/sec
* llama-2-70b         3.8 tokens/sec

Results for Jetson Orin Nano 8GB:

* llama-2-7b-chat    10.2 tokens/sec

These are all with q4f16_1 quantization, CUTLASS, and CUDA graphs enabled.

A MLC container that builds wheels from source for JetPack-L4T can be found here: https://github.com/dusty-nv/jetson-containers/tree/dev/packages/llm/mlc

@junrushao
Copy link
Member Author

This is very cool! Thanks @dusty-nv for sharing!

@x330930520
Copy link

x330930520 commented Sep 15, 2023

Performance I got;

OS RAM CPU GPU result model VRAM
Windows 11 16G Intel Pentium Gold G5400 Nvidia MX150 (2GB) encode: 7.4 tok/s, decode: 8.6 tok/s rwkv-raven-1b5-q8f16_0 1616MiB / 2048MiB
Windows 10 14G AMD Athlon™ X4 860K AMD R7 240 (2GB) encode: 2.0 tok/s, decode: 2.2 tok/s rwkv-raven-1b5-q8f16_0 1616MiB / 2048MiB
Windows 10 14G AMD Athlon™ X4 860K AMD RX 580 2048SP (8GB) encode: 9.7 tok/s, decode: 3.6 tok/s Llama-2-7b-chat-hf-q4f16_1 6285MiB / 8096MiB
Windows 10 14G AMD Athlon™ X4 860K AMD RX 580 2048SP (8GB) encode: 6.2 tok/s, decode: 8.9 tok/s vicuna-v1-7b-q3f16_0 6284MiB / 8096MiB
Windows 11 32G Intel i3 8100 Nvidia GTX 1060 (3GB) encode: 18.0 tok/s, decode: 17.4 tok/s rwkv-raven-1b5-q8f16_0 2069MiB / 3072MiB

All other models were OOM upon loading

Nvidia MX150:

image

image

AMD R7 240:

image

AMD RX 580 2048SP:

image

Nvidia GTX 1060 (3GB):

image

image

@Fuckingnameless
Copy link

hello are p40 cards supported?
what about mali610/rk3588?

@Nero10578
Copy link

hello are p40 cards supported? what about mali610/rk3588?

Seeing the Pascal GTX 10 series cards are supported then the P40 should work too I think. I have a couple of them and will test this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests