Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aviad/blake2s #576

Open
wants to merge 25 commits into
base: V2
Choose a base branch
from
Open

Aviad/blake2s #576

wants to merge 25 commits into from

Conversation

aviadingo
Copy link
Contributor

Describe the changes

adds Blake2s cuda capability.

Copy link
Contributor

@ChickenLover ChickenLover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job with the PR. I left a bunch of style related comments. In theory they can be ignored as we are merging this to V2

*/

#pragma once
typedef unsigned char BYTE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move these inside blake2s namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

THROW_ICICLE_ERR(
IcicleError_t::InvalidArgument,
"Hash max preimage length does not match merkle tree arity multiplied by digest elements");
// if (compression.preimage_max_length < tree_config.arity * tree_config.digest_elements)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just delete those at this point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

__device__ __forceinline__ void cuda_blake2s_init_state(cuda_blake2s_ctx_t* ctx)
{
memcpy(ctx->state, ctx->chain, BLAKE2S_CHAIN_LENGTH);
// ctx->state[8] = ctx->t0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these commented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's from an old version. deleted

return a;
}

__device__ uint32_t cuda_blake2s_ROTR32(uint32_t a, uint8_t b) { return (a >> b) | (a << (32 - b)); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth to add __inline__

cudaMalloc(&cuda_outdata, BLAKE2S_BLOCK_SIZE * n_batch);
assert(keylen <= 32);

// CUDA_BLAKE2S_CTX ctx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

WORD block = (n_batch + thread - 1) / thread;
kernel_blake2s_hash<<<block, thread>>>(cuda_indata, inlen, cuda_outdata, n_batch, BLAKE2S_BLOCK_SIZE);
cudaMemcpy(out, cuda_outdata, BLAKE2S_BLOCK_SIZE * n_batch, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation does not support async. All of our other primitives do. So maybe worth adding HashConfig as an input and changing all the functions to their async alternatives

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mcm_cuda_blake2s_hash_batch is not a part of the API, it's only for unit tests.
run_hash_many_kernel() is used in the same way as our other implementations

kernel_blake2s_hash<<<block, thread>>>(cuda_indata, inlen, cuda_outdata, n_batch, BLAKE2S_BLOCK_SIZE);
cudaMemcpy(out, cuda_outdata, BLAKE2S_BLOCK_SIZE * n_batch, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
cudaError_t error = cudaGetLastError();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use our error-management functions (you can find an example in any of our primitives)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +12 to +14
extern "C" cudaError_t blake2s_cuda(
BYTE* input, BYTE* output, WORD number_of_blocks, WORD input_block_size, WORD output_block_size, HashConfig& config)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the parameter order in blake2s_cuda is a bit different from keccak256_cuda (and other keccak functions).

keccak256_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, HashConfig& config)

To keep things consistent, so my suggestion smt like this:

blake2s_cuda(BYTE* input, WORD input_block_size, WORD number_of_blocks, WORD output_block_size, BYTE* output, HashConfig& config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. I'm not sure if it is worth changing it since this is usually not exposed to the user and it will require updates to the rust signatures as well

@jeremyfelder jeremyfelder self-requested a review November 5, 2024 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants