-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After FFT & Winograd, what next? #110
Comments
Hi @nomi-wei , just a clarification: our fast convnet algorithms use Winograd's convolution algorithms. But the same Shmuel Winograd did co-author the Coppersmith-Winograd fast matrix multiplication algorithm, so the confusion is understandable (I probably should not even mention that Winograd also devised fast DFT algorithms ;-) |
@andravin Ha-ha, my bad. Thanks for your clarification. It's really really helpful. ;-) Thanks again! |
googling for dp4a reaches a thread with Scott Gray in as first hit :-) https://devtalk.nvidia.com/default/topic/934562/cuda-programming-and-performance/nvidia-pascal-geforce-gtx-1080-amp-gtx-1070/post/4889687/ So I would say he's aware of it :-) I was actually pondering dabbling with ints way back in 2014 http://computer-go.org/pipermail/computer-go/2014-December/007105.html ... but it's just one of many things that never survived contact with finite-hours-in-the-day :-) Considering the effort involved in making gpus work, and work quickly, I would think the first thing to do might be to demonstrate using normal cpu code that you can get ok results? You could just fire up torch, and create A few questions which occur:
Hmmm, I'm simply reciting back to you the questions that were stated to me when I mentioned the idea myself :-) http://computer-go.org/pipermail/computer-go/2014-December/007106.html |
Thanks @scott-gray @andravin for the awesome Winograd work. That really makes small conv kernel run super fast!
And the cuDNN team implement them so rapidly and nicely, really makes life much easier. Good job! @jdemouth
After these fancy ideas, I can't help thinking that what can we do to speed up training next?
Following the path of mathematics-based matrix multiplication optimization approach, with Winograd, we are likely to have reached to roof. On the top of Winograd, I know Le Gall & François did some great works around 2014, but no breakthrough improvement.(Edit: my bad, thanks @andravin for reminding.)
Another interesting thing is the lack of FP16x2 support on GP104, which we highly look forward to, but instead, we got the full throughput dp4a, very powerful int8 computation ability.
I think in theory 8-bit is enough to carry the information with quantization. But how could we make good use of dp4a in training, it would be interesting.
So I wonder if @scott-gray @andravin @jdemouth @hughperkins @bhack @soumith etc. could share any ideas about this?
Sorry for can't cc all you lovely guys in community, who care about and contribute to DL performance. And any ideas are warmly welcomed!
@soumith, if you think here it's not so proper to discuss this topic, pls help me to close it and sorry for bothering. ;)
The text was updated successfully, but these errors were encountered: