You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that your codebase has a separate implementation for Multi-head attention (MHA module) along with a separate implementation for kv caching and even the generation function is different than HF's generation.
While you are loading HF's models, you are relying on HF implementations. Could this introduce discrepancies in benchmarks?
Is it possible to build a transformer model using only your codebase relying on the local implementation of kv cache and MHA implementations?
It seems that your codebase has a separate implementation for Multi-head attention (MHA module) along with a separate implementation for kv caching and even the generation function is different than HF's generation.
While you are loading HF's models, you are relying on HF implementations. Could this introduce discrepancies in benchmarks?
Is it possible to build a transformer model using only your codebase relying on the local implementation of kv cache and MHA implementations?
mamba/benchmarks/benchmark_generation_mamba_simple.py
Line 41 in 442fab4
The text was updated successfully, but these errors were encountered: