-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up LCE-A kernel #1910
speed up LCE-A kernel #1910
Conversation
This pull request was exported from Phabricator. Differential Revision: D47118335 |
Codecov Report
@@ Coverage Diff @@
## main #1910 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 173 173
Lines 15232 15264 +32
=========================================
+ Hits 15232 15264 +32
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Differential Revision: D47118335 fbshipit-source-id: 4faf47e8919ce7a6d31f24c0488ccaad59ccc021
82491ca
to
fb1c2d0
Compare
This pull request was exported from Phabricator. Differential Revision: D47118335 |
Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Differential Revision: D47118335 fbshipit-source-id: 6efd771d7940139f2d4a342ad4b78c85c7ba84bd
fb1c2d0
to
86e456b
Compare
This pull request was exported from Phabricator. Differential Revision: D47118335 |
Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Differential Revision: D47118335 fbshipit-source-id: 46de6275a7f5f0f46fc7040e6d07918310e35c9d
86e456b
to
f2adb78
Compare
This pull request was exported from Phabricator. Differential Revision: D47118335 |
Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Reviewed By: Balandat Differential Revision: D47118335 fbshipit-source-id: a777cba457c2918b28a3a86dd6a1516aee91e066
f2adb78
to
f7d77a3
Compare
This pull request was exported from Phabricator. Differential Revision: D47118335 |
Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Reviewed By: Balandat Differential Revision: D47118335 fbshipit-source-id: 82679f1565c3b8c033d45652d144f4874a54916a
f7d77a3
to
95d31df
Compare
This pull request was exported from Phabricator. Differential Revision: D47118335 |
Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Reviewed By: Balandat Differential Revision: D47118335 fbshipit-source-id: bcea4e0863f171a66569e42ec9aeec5379990b7b
95d31df
to
b836717
Compare
This pull request was exported from Phabricator. Differential Revision: D47118335 |
Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Reviewed By: Balandat Differential Revision: D47118335 fbshipit-source-id: efa091bf7ccc30559d921195e10c1375f525cca9
b836717
to
ed2ecf1
Compare
This pull request was exported from Phabricator. Differential Revision: D47118335 |
This pull request has been merged in 7eb847a. |
Summary:
The current implementation is very slow. This is particularly problematic when the number of contexts is large.
This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in
forward
and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order.14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU)
22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA)
Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts.
Current implementation:
New Implementation:
Differential Revision: D47118335