Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pure OpenCL batch hashing. #78

Merged
merged 1 commit into from
Jan 20, 2021
Merged

Implement pure OpenCL batch hashing. #78

merged 1 commit into from
Jan 20, 2021

Conversation

porcuquine
Copy link
Contributor

@porcuquine porcuquine commented Jan 16, 2021

This PR implements GPU batch hashing in pure OpenCL, implemented in the new proteus module. (Proteus and Triton are both moons of Neptune, hence the naming.)

This work is intended to introduce no change of behavior when the gpu feature flag is provided. If instead, the opencl feature flag is provided, a new BatcherType, BatcherType::OpenCL can be used instead of BatcherType::GPU.

This implementation provides the following benefits when compared to the extant neptune-triton GPU implementation:

  • Better perfomance (almost 2x, see below).
  • Fewer external dependencies (removes dependence on elaborate Futhark code-generation and toolchain).
  • Much less total code.
  • Much lower GPU memory usage.
  • No known problem with multiple batch hashers being used at once. (vs. an outstanding bug in current neptune-triton code path).

Once the opencl feature has been tested and stabilized, it should be made the default, for all of these reasons.

Historical context: although replacing neptune-triton is an obvious next step now, its replacement benefits from the design which went into the Rust interface to neptune-triton, to the development of rust-gpu-tools and cl-ff-gen (neither of which existed at the time of the initial GPU implementation), and from the significant learning which went into neptune-triton's development. Although the current result is simpler, the path to it was not obvious from the outset.

Speedup is ~2x on column tree building. See gbench output using the same 2080Ti for both methods.

gbench with gpu feature:

RUST_LOG=info cargo run --release --features gpu,blst --no-default-features
    Finished release [optimized] target(s) in 0.07s
     Running `/home/porcuquine/dev/neptune/target/release/gbench`
[2021-01-16T00:07:05Z INFO  gbench] KiB: 4194304
[2021-01-16T00:07:05Z INFO  gbench] leaves: 134217728
[2021-01-16T00:07:05Z INFO  gbench] max column batch size: 400000
[2021-01-16T00:07:05Z INFO  gbench] max tree batch size: 700000
[2021-01-16T00:07:05Z INFO  gbench] GPU[Selector: BatcherType::GPU] --> Run 0
[2021-01-16T00:07:05Z INFO  gbench] GPU[Selector: BatcherType::GPU]: Creating ColumnTreeBuilder
[2021-01-16T00:07:05Z INFO  neptune::triton::cl] getting default futhark context
[2021-01-16T00:07:05Z INFO  neptune::triton::cl] getting context for ~Index(0)
[2021-01-16T00:07:06Z INFO  neptune::triton::cl] device: Device { brand: Nvidia, name: "GeForce RTX 2080 Ti", memory: 11551440896, bus_id: Some(33), platform: Platform(PlatformId(0x7f1f000b5590)), device: Device(DeviceId(0x7f1f000b5ae0)) }
[2021-01-16T00:07:11Z INFO  gbench] GPU[Selector: BatcherType::GPU]: ColumnTreeBuilder created
[2021-01-16T00:07:11Z INFO  gbench] GPU[Selector: BatcherType::GPU]: Using effective batch size 400000 to build columns
[2021-01-16T00:07:11Z INFO  gbench] GPU[Selector: BatcherType::GPU]: adding column batches
[2021-01-16T00:07:11Z INFO  gbench] GPU[Selector: BatcherType::GPU]: start commitment
...............................................................................................................................................................................................................................................................................................................................................
[2021-01-16T00:09:15Z INFO  gbench] GPU[Selector: BatcherType::GPU]: adding final column batch and building tree
[2021-01-16T00:09:31Z INFO  gbench] GPU[Selector: BatcherType::GPU]: end commitment
[2021-01-16T00:09:31Z INFO  gbench] GPU[Selector: BatcherType::GPU]: commitment time: 139.632183641s

gbench with opencl feature:

RUST_LOG=info cargo run --release --features opencl,blst --no-default-features
    Finished release [optimized] target(s) in 0.06s
     Running `/home/porcuquine/dev/neptune/target/release/gbench`
[2021-01-16T00:19:55Z INFO  gbench] KiB: 4194304
[2021-01-16T00:19:55Z INFO  gbench] leaves: 134217728
[2021-01-16T00:19:55Z INFO  gbench] max column batch size: 400000
[2021-01-16T00:19:55Z INFO  gbench] max tree batch size: 700000
[2021-01-16T00:19:55Z INFO  gbench] GPU[Selector: BatcherType::OpenCL] --> Run 0
[2021-01-16T00:19:55Z INFO  gbench] GPU[Selector: BatcherType::OpenCL]: Creating ColumnTreeBuilder
[2021-01-16T00:19:56Z INFO  neptune::proteus::gpu] device: Device { brand: Nvidia, name: "GeForce RTX 2080 Ti", memory: 11551440896, bus_id: Some(33), platform: Platform(PlatformId(0x7f7d240b5510)), device: Device(DeviceId(0x7f7d240b5a60)) }
[2021-01-16T00:19:58Z INFO  gbench] GPU[Selector: BatcherType::OpenCL]: ColumnTreeBuilder created
[2021-01-16T00:19:58Z INFO  gbench] GPU[Selector: BatcherType::OpenCL]: Using effective batch size 400000 to build columns
[2021-01-16T00:19:58Z INFO  gbench] GPU[Selector: BatcherType::OpenCL]: adding column batches
[2021-01-16T00:19:58Z INFO  gbench] GPU[Selector: BatcherType::OpenCL]: start commitment
...............................................................................................................................................................................................................................................................................................................................................
[2021-01-16T00:21:04Z INFO  gbench] GPU[Selector: BatcherType::OpenCL]: adding final column batch and building tree
[2021-01-16T00:21:14Z INFO  gbench] GPU[Selector: BatcherType::OpenCL]: end commitment
[2021-01-16T00:21:14Z INFO  gbench] GPU[Selector: BatcherType::OpenCL]: commitment time: 75.21889048s

@porcuquine porcuquine force-pushed the feat/opencl branch 8 times, most recently from 93a8f10 to 2d6acaa Compare January 16, 2021 04:36
@dignifiedquire
Copy link
Contributor

Benchmarks on RTX3090

  • GPU (current): 97.686103731s
  • OpenCL (this PR): 52.741545444s

@porcuquine porcuquine force-pushed the feat/opencl branch 2 times, most recently from f39156c to 678b26c Compare January 19, 2021 17:48
Copy link
Contributor

@dignifiedquire dignifiedquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two small notes, nice work

CHANGELOG.md Outdated
@@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://book.async.rs/overview

## Unreleased

## 2.4.1 - 2021-1-15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn’t this go under unreleased technically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it. I was originally hoping to just release this version immediately after.

GPU(GPUBatchHasher<A>),
#[cfg(not(feature = "gpu"))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn’t this be not any gpu or opencl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... I think that whole variant can be removed now. My latest does that and seems to build fine on macos now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants