Optimize opencl and make it default gpu feature. #1420

porcuquine · 2021-02-25T03:18:11Z

#1397 Added support for neptune's opencl feature, but did not demonstrate the expected performance gain. (Expectation was ~2x. See: argumentcomputer/neptune#78.)

Apparently the bottleneck was the read_range provided by merkletree::store::disk, which ended up calling PoseidonDomain::from_slice on every Fr element when reading from disk. The cleanest way to fix this is probably to optimize read_range to perform better. However, I was not able to find a simple way to make this change in the face of the layers of generic types.

Instead, I took advantage of definite knowledge that the underlying data is Fr and use an unsafe transmute from bytes.

As previously discussed, this PR makes neptune/opencl the default feature when the gpu feature is active — and removes the gpu2 feature. Whether this should be merged immediately or wait until the gpu2 flag has been released and widely tested depends on whether we already have confidence in it. As far as I know, there is no reason to believe neptune/opencl is problematic: it's dramatically simpler and performs better — but I will let @cryptonemo and/or @dignifiedquire make the decision.

Shown below, now column tree building takes about 71 seconds, and regular tree-building takes about 11 seconds. This is compared with 75 and 10 seconds on the isolated neptune benchmark (gbench). Total time for building both trees is now just over 11 minutes, which is as originally expected.

2021-02-25T01:43:13.825 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree c using the GPU
2021-02-25T01:43:13.825 INFO storage_proofs_porep::stacked::vanilla::proof > Building column hashes
2021-02-25T01:43:13.940 INFO neptune::proteus::gpu > device: Device { brand: Nvidia, name: "GeForce RTX 2080 Ti", memory: 11551440896, bus_id: Some(33), platform: Platform(PlatformId(0x7f914809c1c0)), device: Device(DeviceId(0x7f91480b60e0)) }
2021-02-25T01:44:26.966 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 1/8 of length 153391689
2021-02-25T01:45:38.013 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 2/8 of length 153391689
2021-02-25T01:46:49.541 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 3/8 of length 153391689
2021-02-25T01:48:01.341 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 4/8 of length 153391689
2021-02-25T01:49:12.919 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 5/8 of length 153391689
2021-02-25T01:50:24.656 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 6/8 of length 153391689
2021-02-25T01:51:35.644 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 7/8 of length 153391689
2021-02-25T01:52:46.195 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 8/8 of length 153391689
2021-02-25T01:52:51.734 INFO storage_proofs_porep::stacked::vanilla::proof > tree_c done
2021-02-25T01:52:51.734 INFO storage_proofs_porep::stacked::vanilla::proof > building tree_r_last
2021-02-25T01:52:51.734 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree r last using the GPU
2021-02-25T01:52:53.150 INFO neptune::proteus::gpu > device: Device { brand: Nvidia, name: "GeForce RTX 2080 Ti", memory: 11551440896, bus_id: Some(33), platform: Platform(PlatformId(0x7f914809c1c0)), device: Device(DeviceId(0x7f91480b60e0)) }
2021-02-25T01:52:55.778 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 1/8
2021-02-25T01:53:06.627 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 2/8
2021-02-25T01:53:17.852 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 3/8
2021-02-25T01:53:30.120 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 4/8
2021-02-25T01:53:41.409 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 5/8
2021-02-25T01:53:53.343 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 6/8
2021-02-25T01:54:02.641 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 7/8
2021-02-25T01:54:11.945 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 8/8
2021-02-25T01:54:21.940 INFO storage_proofs_porep::stacked::vanilla::proof > tree_r_last done

cryptonemo · 2021-02-25T12:42:51Z

storage-proofs-porep/src/stacked/vanilla/proof.rs

-                        let encoded_data = last_layer_labels
-                            .read_range(start..end)
-                            .expect("failed to read layer range")
+                        let mut layer_bytes = vec![0u8; (end - start) * std::mem::size_of::<Fr>()];


Since the code is being optimized, I'm concerned about keeping this large allocation around. I know it's not worse than what was there previously, but I suspect we can both reduce it and gain performance by using mmap sort of like this: https://github.com/filecoin-project/bellperson/blob/master/src/groth16/mapped_params.rs#L141

porcuquine · 2021-02-27T17:44:38Z

I think the lifecycle test I ran for this was misconfigured and failed to exercise it — and that this is actually giving the wrong result. I'll try to fix next week.

NOTE: it's not great this was not caught by CI. We should probably at least run the test that would have caught this in CI. I'll adjust the env vars on the relevant CI job too.

porcuquine · 2021-03-02T02:44:16Z

I changed this to now just call bytes_into_fr — since we do indeed need that transformation. The first pass was not fast enough. It showed results somewhere between the starting point and the initial benchmark posted for this PR above. By switching to do the conversion in parallel, I got performance to be comparable to the first version, so I think we're good here now.

I tried modifying the CI config to use the GPU tree and column builders when running a lifecycle test using the GPU, but CI is not configured to use GPUs at all, apparently. Lifecycle test is passing locally, though.

Here's a benchmark showing a total running time of 11:21 for tree building. That's ten seconds slower than before. We are indeed doing more work with the added conversion, so a small penalty is not surprising.

2021-03-02T02:07:10.400 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree c using the GPU
2021-03-02T02:07:10.400 INFO storage_proofs_porep::stacked::vanilla::proof > Building column hashes
2021-03-02T02:07:10.551 INFO neptune::proteus::gpu > device: Device { brand: Nvidia, name: "GeForce RTX 2080 Ti", memory: 11551440896, bus_id: Some(33), platform: Platform(PlatformId(0x7f819809c1c0)), device: Device(DeviceId(0x7f81980b60e0)) }
2021-03-02T02:08:26.882 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 1/8 of length 153391689
2021-03-02T02:09:40.463 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 2/8 of length 153391689
2021-03-02T02:10:54.277 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 3/8 of length 153391689
2021-03-02T02:12:08.194 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 4/8 of length 153391689
2021-03-02T02:13:20.888 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 5/8 of length 153391689
2021-03-02T02:14:31.574 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 6/8 of length 153391689
2021-03-02T02:15:42.719 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 7/8 of length 153391689
2021-03-02T02:16:53.711 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 8/8 of length 153391689
2021-03-02T02:16:59.212 INFO storage_proofs_porep::stacked::vanilla::proof > tree_c done
2021-03-02T02:16:59.212 INFO storage_proofs_porep::stacked::vanilla::proof > building tree_r_last
2021-03-02T02:16:59.212 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree r last using the GPU
2021-03-02T02:17:00.716 INFO neptune::proteus::gpu > device: Device { brand: Nvidia, name: "GeForce RTX 2080 Ti", memory: 11551440896, bus_id: Some(33), platform: Platform(PlatformId(0x7f819809c1c0)), device: Device(DeviceId(0x7f81980b60e0)) }
2021-03-02T02:17:03.539 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 1/8
2021-03-02T02:17:14.632 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 2/8
2021-03-02T02:17:26.127 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 3/8
2021-03-02T02:17:38.286 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 4/8
2021-03-02T02:17:50.742 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 5/8
2021-03-02T02:18:01.777 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 6/8
2021-03-02T02:18:11.560 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 7/8
2021-03-02T02:18:21.320 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 8/8
2021-03-02T02:18:31.414 INFO storage_proofs_porep::stacked::vanilla::proof > tree_r_last done

Here's a passing lifecycle test run on a machine with a GPU:

➜  filecoin-proofs git:(feat/optimize-tree-building) ✗ FIL_PROOFS_USE_GPU_TREE_BUILDER=1 FIL_PROOFS_USE_GPU_COLUMN_BUILDER=1 cargo test --release --features=blst,gpu --no-default-features -- --ignored lifecycle_2k
    Finished release [optimized] target(s) in 0.11s
     Running /home/porcuquine/dev/rust-fil-proofs/target/release/deps/filecoin_proofs-63e948d1681f8da0

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out

     Running /home/porcuquine/dev/rust-fil-proofs/target/release/deps/api-67208300f75bdb4d

running 2 tests
test test_seal_lifecycle_2kib_porep_id_v1_1_base_8 ... ok
test test_seal_lifecycle_2kib_porep_id_v1_base_8 ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 19 filtered out

     Running /home/porcuquine/dev/rust-fil-proofs/target/release/deps/constants-1203d7a0c6bab322

porcuquine · 2021-03-02T03:09:10Z

UPDATE: I tried adding a CI test of tree building running on an actual GPU instance. Let's see whether that works.

This commit makes the test actually run.

porcuquine · 2021-03-03T01:54:39Z

I am closing this in favor of #1422. We may want to use this branch as a starting point or just reopen the PR when eventually ready to eliminate the gpu2 feature flag, since the changes need to do so are here.

porcuquine force-pushed the feat/optimize-tree-building branch 2 times, most recently from 755f0ba to 3faf8ec Compare February 25, 2021 03:30

porcuquine marked this pull request as ready for review February 25, 2021 03:33

porcuquine requested review from cryptonemo and dignifiedquire as code owners February 25, 2021 03:33

porcuquine force-pushed the feat/optimize-tree-building branch from 3faf8ec to e264a41 Compare February 25, 2021 03:54

cryptonemo reviewed Feb 25, 2021

View reviewed changes

porcuquine force-pushed the feat/optimize-tree-building branch 2 times, most recently from 6cbdb5f to ff3152b Compare March 2, 2021 02:36

porcuquine force-pushed the feat/optimize-tree-building branch from ff3152b to 9ff7acb Compare March 2, 2021 03:07

porcuquine force-pushed the feat/optimize-tree-building branch 17 times, most recently from 3d4b663 to 6f33252 Compare March 2, 2021 07:50

porcuquine force-pushed the feat/optimize-tree-building branch 6 times, most recently from 9f82f1c to 17364e7 Compare March 2, 2021 08:49

Optimize opencl and make it default gpu feature.

6d8259b

porcuquine force-pushed the feat/optimize-tree-building branch from 17364e7 to 6d8259b Compare March 2, 2021 08:54

Fix CI

bea4507

This commit makes the test actually run.

porcuquine mentioned this pull request Mar 2, 2021

Optimize tree-building for neptune/opencl (gpu2 flag). #1422

Merged

porcuquine closed this Mar 3, 2021

porcuquine deleted the feat/optimize-tree-building branch September 16, 2021 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize opencl and make it default gpu feature. #1420

Optimize opencl and make it default gpu feature. #1420

porcuquine commented Feb 25, 2021 •

edited

Loading

cryptonemo Feb 25, 2021

porcuquine commented Feb 27, 2021

porcuquine commented Mar 2, 2021

porcuquine commented Mar 2, 2021

porcuquine commented Mar 3, 2021

Optimize opencl and make it default gpu feature. #1420

Optimize opencl and make it default gpu feature. #1420

Conversation

porcuquine commented Feb 25, 2021 • edited Loading

cryptonemo Feb 25, 2021

Choose a reason for hiding this comment

porcuquine commented Feb 27, 2021

porcuquine commented Mar 2, 2021

porcuquine commented Mar 2, 2021

porcuquine commented Mar 3, 2021

porcuquine commented Feb 25, 2021 •

edited

Loading