-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blasfeo_ddot: suggestion for improvement #91
Comments
Yes what you propose would indeed reduce the latency by 1 clock cycle: from 5 of hadd to 4=1(unpackhi)+3(add). And in general, level 1 BLAS routines are not so important in what we do and can gain much less from optimization, compared to level 2 and especially 3 routines, and therefore they received less attention. What I would found the most important reason to implement your improvement would be to get rid of the dependency on SSE3 in case of targeting machines with capabilities up to SSE2. I don't know if this is the case for you. The choice to target SSE3 (i.e. the Core microarchitecture) was to have a reasonable trade-off between handiness and availability of ISAs, also on embedded devices, which usually lag a bit behind. |
Sure if you want to make the changes and make a PR, I would be happy to merge it. But otherwise I would leave it as it is for now, other stuff has higher priority from my side. Thanks anyway for the suggestion :) |
In the 'reduce' step of
blasfeo_ddot
, a horizontal add_mm_hadd_pd
is computed. Instead, one could replacewith
effectively trading a packed double operation with a scalar one.
The text was updated successfully, but these errors were encountered: