Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User] Inference time GPU and CPU #1727

Closed
realcarlos opened this issue Jun 7, 2023 · 2 comments
Closed

[User] Inference time GPU and CPU #1727

realcarlos opened this issue Jun 7, 2023 · 2 comments

Comments

@realcarlos
Copy link

LLAMA_METAL=1 make -j && ./main -m ./models/guanaco-7B.ggmlv3.q4_0.bin -p "I love fish" --ignore-eos -n 1024 -ngl 1

llama_print_timings: load time = 7918.69 ms
llama_print_timings: sample time = 1013.54 ms / 1024 runs ( 0.99 ms per token)
llama_print_timings: prompt eval time = 14705.49 ms / 775 tokens ( 18.97 ms per token)
llama_print_timings: eval time = 46435.82 ms / 1020 runs ( 45.53 ms per token)
llama_print_timings: total time = 69981.58 ms

my question is , it seems that the eval time is same on CPU, is it normal?

Macbook pro M1 , 32GB

@ggerganov
Copy link
Member

Yup, on M1 Pro I also get similar time for 8 thread CPU compared to GPU - ~45 ms / tok
My explanation is that the CPU and GPU share 100 GB/s bandwidth each from the total 200 GB/s of M1 Pro so parity is expected for this machine

@realcarlos
Copy link
Author

Yup, on M1 Pro I also get similar time for 8 thread CPU compared to GPU - ~45 ms / tok My explanation is that the CPU and GPU share 100 GB/s bandwidth each from the total 200 GB/s of M1 Pro so parity is expected for this machine

Got it ,Sir !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants