speed up LCE-A kernel #1910

sdaulton · 2023-06-29T18:14:32Z

Summary:
The current implementation is very slow. This is particularly problematic when the number of contexts is large.

This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in forward and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order.

14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU)
22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA)
Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts.

Current implementation:

8 contexts:
- Forward pass: 20ms (CPU), 45.2ms (CUDA)
- Roundtrip: 39.1ms (CPU), 99.7ms (CUDA)
128 contexts:
- Forward pass: 5.08s (CPU), 12s (CUDA)
- Roundtrip: 14.2s (CPU), 26.7s (CUDA)

New Implementation:

8 contexts:
- Forward pass: 1.44ms (CPU), 2.05ms (CUDA)
- Roundtrip: 2.22ms (CPU), 4.65ms (CUDA)
128 contexts:
- Forward pass: 4.4ms (CPU), 3.56ms (CUDA)
- Roundtrip: 6.97ms (CPU), 5.34ms (CUDA)

Differential Revision: D47118335

facebook-github-bot · 2023-06-29T18:15:33Z

This pull request was exported from Phabricator. Differential Revision: D47118335

codecov · 2023-06-29T18:29:12Z

Codecov Report

Merging #1910 (7df50d4) into main (28b1b2b) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 7df50d4 differs from pull request most recent head ed2ecf1. Consider uploading reports for the commit ed2ecf1 to get more accurate results

@@            Coverage Diff            @@
##              main     #1910   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          173       173           
  Lines        15232     15264   +32     
=========================================
+ Hits         15232     15264   +32

Impacted Files	Coverage Δ
botorch/models/contextual.py	`100.00% <ø> (ø)`
botorch/models/kernels/contextual_lcea.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Differential Revision: D47118335 fbshipit-source-id: 4faf47e8919ce7a6d31f24c0488ccaad59ccc021

facebook-github-bot · 2023-06-30T01:42:35Z

This pull request was exported from Phabricator. Differential Revision: D47118335

Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Differential Revision: D47118335 fbshipit-source-id: 6efd771d7940139f2d4a342ad4b78c85c7ba84bd

facebook-github-bot · 2023-06-30T15:02:39Z

This pull request was exported from Phabricator. Differential Revision: D47118335

Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Differential Revision: D47118335 fbshipit-source-id: 46de6275a7f5f0f46fc7040e6d07918310e35c9d

facebook-github-bot · 2023-06-30T22:04:35Z

This pull request was exported from Phabricator. Differential Revision: D47118335

Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Reviewed By: Balandat Differential Revision: D47118335 fbshipit-source-id: a777cba457c2918b28a3a86dd6a1516aee91e066

facebook-github-bot · 2023-07-03T15:19:22Z

This pull request was exported from Phabricator. Differential Revision: D47118335

Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Reviewed By: Balandat Differential Revision: D47118335 fbshipit-source-id: 82679f1565c3b8c033d45652d144f4874a54916a

facebook-github-bot · 2023-07-03T15:27:10Z

This pull request was exported from Phabricator. Differential Revision: D47118335

Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Reviewed By: Balandat Differential Revision: D47118335 fbshipit-source-id: bcea4e0863f171a66569e42ec9aeec5379990b7b

facebook-github-bot · 2023-07-03T15:34:30Z

This pull request was exported from Phabricator. Differential Revision: D47118335

Summary: Pull Request resolved: pytorch#1910 X-link: facebook/Ax#1694 The current implementation is very slow. This is particularly problematic when the number of contexts is large. This diff yields huge speed ups (>3 orders of magnitude) by removing nested for loops in `forward` and instead using batched computation. This also caches the context_covar in eval mode, with an additional requirements that the parameters for each context are contiguous and in the same order. * 14x speed up with 8 contexts and 1,154x speed up with 128 contexts (CPU) * 22x speed up with 8 contexts and 3,370x speed up with 128 contexts (CUDA) * Without this diff, it takes 5 seconds on CPU and 12 seconds with CUDA for a single forward pass with 128 contexts. Current implementation: * 8 contexts: * Forward pass: 20ms (CPU), 45.2ms (CUDA) * Roundtrip: 39.1ms (CPU), 99.7ms (CUDA) * 128 contexts: * Forward pass: 5.08s (CPU), 12s (CUDA) * Roundtrip: 14.2s (CPU), 26.7s (CUDA) New Implementation: * 8 contexts: * Forward pass: 1.44ms (CPU), 2.05ms (CUDA) * Roundtrip: 2.22ms (CPU), 4.65ms (CUDA) * 128 contexts: * Forward pass: 4.4ms (CPU), 3.56ms (CUDA) * Roundtrip: 6.97ms (CPU), 5.34ms (CUDA) Reviewed By: Balandat Differential Revision: D47118335 fbshipit-source-id: efa091bf7ccc30559d921195e10c1375f525cca9

facebook-github-bot · 2023-07-03T22:42:25Z

This pull request was exported from Phabricator. Differential Revision: D47118335

facebook-github-bot · 2023-07-04T01:45:08Z

This pull request has been merged in 7eb847a.

facebook-github-bot added CLA Signed Do not delete this pull request or issue due to inactivity. fb-exported labels Jun 29, 2023

sdaulton force-pushed the export-D47118335 branch from 82491ca to fb1c2d0 Compare June 30, 2023 01:42

sdaulton force-pushed the export-D47118335 branch from fb1c2d0 to 86e456b Compare June 30, 2023 15:02

sdaulton force-pushed the export-D47118335 branch from 86e456b to f2adb78 Compare June 30, 2023 22:04

sdaulton force-pushed the export-D47118335 branch from f2adb78 to f7d77a3 Compare July 3, 2023 15:19

sdaulton force-pushed the export-D47118335 branch from f7d77a3 to 95d31df Compare July 3, 2023 15:26

sdaulton force-pushed the export-D47118335 branch from 95d31df to b836717 Compare July 3, 2023 15:34

sdaulton force-pushed the export-D47118335 branch from b836717 to ed2ecf1 Compare July 3, 2023 22:42

facebook-github-bot closed this in 7eb847a Jul 4, 2023

facebook-github-bot added the Merged label Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up LCE-A kernel #1910

speed up LCE-A kernel #1910

sdaulton commented Jun 29, 2023

facebook-github-bot commented Jun 29, 2023

codecov bot commented Jun 29, 2023 •

edited

Loading

facebook-github-bot commented Jun 30, 2023

facebook-github-bot commented Jun 30, 2023

facebook-github-bot commented Jun 30, 2023

facebook-github-bot commented Jul 3, 2023

facebook-github-bot commented Jul 3, 2023

facebook-github-bot commented Jul 3, 2023

facebook-github-bot commented Jul 3, 2023

facebook-github-bot commented Jul 4, 2023

speed up LCE-A kernel #1910

speed up LCE-A kernel #1910

Conversation

sdaulton commented Jun 29, 2023

facebook-github-bot commented Jun 29, 2023

codecov bot commented Jun 29, 2023 • edited Loading

Codecov Report

facebook-github-bot commented Jun 30, 2023

facebook-github-bot commented Jun 30, 2023

facebook-github-bot commented Jun 30, 2023

facebook-github-bot commented Jul 3, 2023

facebook-github-bot commented Jul 3, 2023

facebook-github-bot commented Jul 3, 2023

facebook-github-bot commented Jul 3, 2023

facebook-github-bot commented Jul 4, 2023

codecov bot commented Jun 29, 2023 •

edited

Loading