feedback: mmap for keeping Model in VRAM when Flash Attention is used #1717

dan-homebrew · 2024-11-23T07:55:05Z

Goal

Feedback from WiseFarAI:

I am also wondering if you guys use Mmap and other ways of keeping the model in Vram memory and if/when Flash Attention is used. Since these parameters cannot easily be observed (their current setting) or changed from within the UI. Sometimes it seems like it tries to reload the model mid-conversation and the generation speed drops from 6 tokens to 2 per second.

gabrielle-ong · 2024-11-29T09:53:23Z

@vansangpfiev assigning to you to investigate, cc @dan-homebrew - will have a call with Sang next week

dan-homebrew added this to Jan & Cortex Nov 23, 2024

dan-homebrew converted this from a draft issue Nov 23, 2024

dan-homebrew changed the title ~~feedback: mmap for keeping Model in VRAM~~ feedback: mmap for keeping Model in VRAM when Flash Attention is used Nov 23, 2024

gabrielle-ong mentioned this issue Nov 28, 2024

Sprint 26 Planning #1735

Closed

gabrielle-ong assigned vansangpfiev Nov 29, 2024

gabrielle-ong added the engine: llama.cpp label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feedback: mmap for keeping Model in VRAM when Flash Attention is used #1717

feedback: mmap for keeping Model in VRAM when Flash Attention is used #1717

dan-homebrew commented Nov 23, 2024 •

edited

Loading

gabrielle-ong commented Nov 29, 2024

feedback: mmap for keeping Model in VRAM when Flash Attention is used #1717

feedback: mmap for keeping Model in VRAM when Flash Attention is used #1717

Comments

dan-homebrew commented Nov 23, 2024 • edited Loading

Goal

gabrielle-ong commented Nov 29, 2024

dan-homebrew commented Nov 23, 2024 •

edited

Loading