-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PASS] InjectDoubleBuffer #405
Conversation
Hi @tqchen , May I ask a question? void Visit_(const Variable* op) final { |
We've experienced a problem due to this "touched_.erase(...)" Background:Try to combine double buffer with cuda WMMA intrin, the TIR looks like
Here, the CallNode Then, it will be removed from result of |
cc @vinx13 to see if you have some comments |
@domin1985 @cee1 This pass can only work on regular buffer accesses (e.g. |
Hi @vinx13, is there anyway to trigger this pass? (I haven't found any "te" of adding the annotation "software_pipeline_stage" or "software_pipeline_order" ...) |
@cee1 It is only supported in TIR schedule because block information is needed for analysis. In TIR, there is a schedule primitive |
This enables double buffering pre-fetching. Could be useful shared memory pre-fetching. One advantage of double buffering is that the logic explicit prefetchs next stage's input to the shared memory buffer.
Source
Target
Note
Usually when GPU fetches memory, there is a big latency before the data arrives. There are two ways to hide this cost:
There is a tradeoff here. Bigger tiles means more resources(registers) and more reuse, but harder to hide loading cost (because we launch less threads). Smaller tiles means more threads and easier to hide loading cost but less reuse.
Enable double buffering allows us to get bigger tiles and more reuse with less requirement on the context switch.
So directly enable it may not speedup things(because the old schedule is tuned to contain enough thread to hide the latency). We might need to enable it and also increase tile size to get a schedule with more reuse and also hide loading cost