TODO.gpu

Last Reviewed: Seth T. 2022-10-19

Quality of Life

1. Verify kernel_info.maxThreadsPerBlock <= TPB_DEFAULT in BIT select loop.
2. Add info on what GPU kernels are available.
3. Copy over `s_bit` to the GPU in chucks of 2^20 to reduce gpu memory alloc
4. Improve Peak memory reporting during GPU runs
   See https://stackoverflow.com/questions/11631191/why-does-the-cuda-runtime-reserve-80-gib-virtual

Big improvements

* Automatic tuning for TPB, gpucurves
  gpu_throughput_test could be automated in the code and run when B1 > THRESHOLD
  Getting the wrong gpucurves / TPB pair can cost 2x/4x performance.

* Flag to sleep between kernel calls
  This reduces performance but can help the responsiveness of the computer.
  Even 5ms of pause (~7% performance) made the system seem much more stable.

* Some benchmark (gpu_throughput_test.sh?) to prevent performance regressions.
  Ideally it would record <git commit, GPU, registers, performance at BITS/CURVES> 

* Consider what changes would be needed to run multiple numbers at a time.
  `set_p_2p`, `findfactor`, and `process_results` seem easy to change
  `np0` would have to become an array.
  `n_log2` becames `max_n_log2`.

Testing

1. Improve carry bit testing (see overflow test in check_gpuecm.sage)
2. Ask a C++ person and verify that GPU and CPU both use same endianness
   This could affect `to_mpz`, `from_mpz`, `allocate_and_set_s_bits`, and `set_p_2p`
   This is possibly handled by `endian` parameter in mpz_export

Several things have been tried to improve performance that didn't turn out. These are recorded
so we don't forget.

1. Branchless
    if(carry) { cgbn_sub(r, r, modulus) }
  can be replaced with
    cgbn_sub(r, r, zero_or_modulus[carry]);
  In theory this should help the different threads all stay alligned, in practice it didn't

2. Reducing number of reserved carry bits
  CARRY_BITS can be reduced form 6 to 2(?) be checking carry bit in addition to overflow
  after cgbn_add. This increase the size of number that can be run in cgbn_kernel.
  There's some performance penenalty for this so the code is uncommitted, but this is a big
  improvement for numbers 506-510 and 1018-1022 bits.

3. Removing the bn_t variables CB,DA,AA,BB,k,dK
  Only 2 (or possible 3) temporary variables are needed. This change didn't impact performance
  and hurt readability so was backed out. It's always possible that it could reduce registers

4. Add fast squaring to CGBN, see https://github.com/NVlabs/CGBN/issues/19