You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@michaelfeil AWQ/GEMM kernels can work for any linear layer. However, there is a challenge in applying it to BERT models because it lacks some scaling methods. For example, we would usually scale from a layernorm to a linear layer.
If we can speed up the BERT model, we will significantly increase the throughput of many cases. Experiment with SentenceTransformers first.
The text was updated successfully, but these errors were encountered: