Merge pull request #1589 from ifdefelse/eip-1057

Update EIP-1057 to match current ProgPoW spec
ethereum · Nov 23, 2018 · 063b860 · 063b860
2 parents 2b8c3e4 + bf4566e
commit 063b860
Showing 1 changed file with 122 additions and 90 deletions.
diff --git a/EIPS/eip-1057.md b/EIPS/eip-1057.md
@@ -1,7 +1,7 @@
 ---
 eip: 1057
 title: ProgPoW, a Programmatic Proof-of-Work
-author: Radix Pi <radix.pi.314@gmail.com>, Ifdef Else <ifdefelse@protonmail.com> 
+author: IfDefElse <ifdefelse@protonmail.com>
 discussions-to: https://ethereum-magicians.org/t/eip-progpow-a-programmatic-proof-of-work/272
 status: Draft
 type: Standards Track
@@ -15,11 +15,11 @@ The following is a proposal for an alternate proof-of-work algorithm - **“Prog
 
 ## Abstract
 
-The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block. 
+The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block.
 
 For Ethereum - a community based on widely distributed commodity hardware - specialized ASICs enable certain participants to gain a much greater chance of generating the next block, and undermine the distributed security.
 
-ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?” 
+ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?”
 
 EIP<NaN> presents an algorithm that is tuned for commodity GPUs where there is minimal opportunity for ASIC specialization. This prevents specialized ASICs without resorting to a game of whack-a-mole where the network changes algorithms every few months.
 
@@ -29,7 +29,7 @@ Until Ethereum transitions to a pure proof-of-stake model, proof-of-work will co
 
 Ethash allows for the creation of an ASIC that is roughly twice as efficient as a commodity GPU. Ethash’s memory accesses are paired with a very small amount of fixed compute. Most of a GPU’s capacity and complexity sits idle, wasting power, while waiting for DRAM accesses. A specialized ASIC can implement a much smaller (and cheaper) compute engine that burns much less power.
 
-As miner rewards are reduced with Casper FFG, it will remain profitable to mine on a specialized ASIC long after GPUs have exited the network. This will make it easier for an entity that has access to private ASICs to stage a 51% attack on the Ethereum network. 
+As miner rewards are reduced with Casper FFG, it will remain profitable to mine on a specialized ASIC long after GPUs have exited the network. This will make it easier for an entity that has access to private ASICs to stage a 51% attack on the Ethereum network.
 
 ## Specification
 
@@ -57,18 +57,22 @@ In contrast to Ethash, the changes detailed below make ProgPoW dependent on the
 
 **Increases the DRAM read from 128 bytes to 256 bytes.**
 
-*The DRAM read from the DAG is the same as Ethash’s, but with the size increased to `256 bytes`. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses.* 
+*The DRAM read from the DAG is the same as Ethash’s, but with the size increased to `256 bytes`. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses.*
 
-The DAG file is generated according to traditional Ethash specifications, with an additional `PROGPOW_SIZE_CACHE` bytes generated that will be cached in the L1.
+The DAG file is generated according to traditional Ethash specifications.
 
 ProgPoW can be tuned using the following parameters. The proposed settings have been tuned for a range of existing, commodity GPUs:
 
-* `PROGPOW_LANES:` The number of parallel lanes that coordinate to calculate a single hash instance; default is `32.`
-* `PROGPOW_REGS:` The register file usage size; default is `16.` 
-* `PROGPOW_CACHE_BYTES:` The size of the cache; default is `16 x 1024.`
-* `PROGPOW_CNT_MEM:` The number of frame buffer accesses, defined as the outer loop of the algorithm; default is `64` (same as Ethash).
-* `PROGPOW_CNT_CACHE:` The number of cache accesses per loop; default is `8.`
-* `PROGPOW_CNT_MATH:` The number of math operations per loop; default is `8.`
+* `PROGPOW_PERIOD`: Number of blocks before changing the random program; default is `50`.
+* `PROGPOW_LANES`: The number of parallel lanes that coordinate to calculate a single hash instance; default is `16`.
+* `PROGPOW_REGS`: The register file usage size; default is `32`.
+* `PROGPOW_DAG_LOADS`: Number of uint32 loads from the DAG per lane; default is `4`;
+* `PROGPOW_CACHE_BYTES`: The size of the cache; default is `16 x 1024`.
+* `PROGPOW_CNT_DAG`: The number of DAG accesses, defined as the outer loop of the algorithm; default is `64` (same as Ethash).
+* `PROGPOW_CNT_CACHE`: The number of cache accesses per loop; default is `12`.
+* `PROGPOW_CNT_MATH`: The number of math operations per loop; default is `20`.
+
+The random program changes every `PROGPOW_PERIOD` blocks (default `50`, roughly 12.5 minutes) to ensure the hardware executing the algorithm is fully programmable. If the program only changed every DAG epoch (roughly 5 days) certain miners could have time to develop hand-optimized versions of the random sequence, giving them an undue advantage.
 
 ProgPoW uses **FNV1a** for merging data. The existing Ethash uses FNV1 for merging, but FNV1a provides better distribution properties.
 
@@ -90,12 +94,14 @@ typedef struct {
 // http://www.cse.yorku.ca/~oz/marsaglia-rng.html
 uint32_t kiss99(kiss99_t &st)
 {
- uint32_t znew = (st.z = 36969 * (st.z & 65535) + (st.z >> 16));
- uint32_t wnew = (st.w = 18000 * (st.w & 65535) + (st.w >> 16));
- uint32_t MWC = ((znew << 16) + wnew);
- uint32_t SHR3 = (st.jsr ^= (st.jsr << 17), st.jsr ^= (st.jsr >> 13), st.jsr ^= (st.jsr << 5));
- uint32_t CONG = (st.jcong = 69069 * st.jcong + 1234567);
- return ((MWC^CONG) + SHR3);
+ st.z = 36969 * (st.z & 65535) + (st.z >> 16);
+ st.w = 18000 * (st.w & 65535) + (st.w >> 16);
+ uint32_t MWC = ((st.z << 16) + st.w);
+ st.jsr ^= (st.jsr << 17);
+ st.jsr ^= (st.jsr >> 13);
+ st.jsr ^= (st.jsr << 5);
+ st.jcong = 69069 * st.jcong + 1234567;
+ return ((MWC^st.jcong) + st.jsr);
 }
 ```
 
@@ -121,75 +127,112 @@ void fill_mix(
 }
 ```
 
-The main search algorithm uses the Keccak sponge function (a width of 800 bits, with a bitrate of 448, and a capacity of 352) to generate a seed, expands the seed, does a sequence of loads and random math on the mix data, and then compresses the result into a final Keccak permutation (with the same parameters as the first) for target comparison.
+Like ethash Keccak is used to seed the sequence per-nonce and to produce the final result. The keccak-f800 variant is used as the 32-bit word size matches the native word size of modern GPUs. The implementation is a variant of SHAKE with width=800, bitrate=576, capacity=224, output=256, and no padding. The result of keccak is treated as a 256-bit big-endian number - that is result byte 0 is the MSB of the value.
 
 ```cpp
+hash32_t keccak_f800_progpow(hash32_t header, uint64_t seed, hash32_t digest)
+{
+ uint32_t st[25];
+
+ for (int i = 0; i < 25; i++)
+  st[i] = 0;
+ for (int i = 0; i < 8; i++)
+  st[i] = header.uint32s[i];
+ st[8] = seed;
+ st[9] = seed >> 32;
+ for (int i = 0; i < 8; i++)
+  st[10+i] = digest.uint32s[i];
+
+ for (int r = 0; r < 22; r++)
+  keccak_f800_round(st, r);
+
+ hash32_t ret;
+ for (int i=0; i<8; i++)
+  ret.uint32s[i] = st[i];
+ return ret;
+}
+```
 
+The flow of the overall algorithm is:
+* A keccak hash of the header + nonce to create a seed
+* Use the seed to generate initial mix data
+* Loop multiple times, each time hashing random loads and random math into the mix data
+* Hash all the mix data into a single 256-bit value
+* A final keccak hash that is compared against the target
+
+```cpp
 bool progpow_search(
- const uint64_t prog_seed,
+ const uint64_t prog_seed, // value is (block_number/PROGPOW_PERIOD)
  const uint64_t nonce,
  const hash32_t header,
- const uint64_t target,
- const uint64_t *g_dag, // gigabyte DAG located in framebuffer
- const uint64_t *c_dag // kilobyte DAG located in l1 cache
+ const hash32_t target, // miner can use a uint64_t target, doesn't need the full 256 bit target
+ const uint32_t *dag // gigabyte DAG located in framebuffer - the first portion gets cached
 )
 {
  uint32_t mix[PROGPOW_LANES][PROGPOW_REGS];
- uint32_t result[4];
- for (int i = 0; i < 4; i++)
-  result[i] = 0;
+ hash32_t digest;
+ for (int i = 0; i < 8; i++)
+ digest.uint32s[i] = 0;
 
  // keccak(header..nonce)
- uint64_t seed = keccak_f800(header, nonce, result);
+ hash32_t seed_256 = keccak_f800_progpow(header, nonce, digest);
+ // endian swap so byte 0 of the hash is the MSB of the value
+ uint64_t seed = bswap(seed_256[0]) << 32 | bswap(seed_256[1]);
 
  // initialize mix for all lanes
  for (int l = 0; l < PROGPOW_LANES; l++)
- fill_mix(seed, l, mix);
+ fill_mix(seed, l, mix[l]);
 
  // execute the randomly generated inner loop
- for (int i = 0; i < PROGPOW_CNT_MEM; i++)
- progPowLoop(prog_seed, i, mix, g_dag, c_dag);
+ for (int i = 0; i < PROGPOW_CNT_DAG; i++)
+ progPowLoop(prog_seed, i, mix, dag);
 
- // Reduce mix data to a single per-lane result
- uint32_t lane_hash[PROGPOW_LANES];
+ // Reduce mix data to a per-lane 32-bit digest
+ uint32_t digest_lane[PROGPOW_LANES];
  for (int l = 0; l < PROGPOW_LANES; l++)
  {
-  lane_hash[l] = 0x811c9dc5
+ digest_lane[l] = 0x811c9dc5
  for (int i = 0; i < PROGPOW_REGS; i++)
- fnv1a(lane_hash[l], mix[l][i]);
+ fnv1a(digest_lane[l], mix[l][i]);
  }
- // Reduce all lanes to a single 128-bit result
- for (int i = 0; i < 4; i++)
- result[i] = 0x811c9dc5;
+ // Reduce all lanes to a single 256-bit digest
+ for (int i = 0; i < 8; i++)
+ digest.uint32s[i] = 0x811c9dc5;
  for (int l = 0; l < PROGPOW_LANES; l++)
- fnv1a(result[l%4], lane_hash[l])
+ fnv1a(digest.uint32s[l%8], digest_lane[l])
 
- // keccak(header .. keccak(header..nonce) .. result);
- return (keccak_f800(header, seed, result) <= target);
+ // keccak(header .. keccak(header..nonce) .. digest);
+ return (keccak_f800_progpow(header, seed, digest) <= target);
 }
 ```
 
 The inner loop uses FNV and KISS99 to generate a random sequence from the `prog_seed`. This random sequence determines which mix state is accessed and what random math is performed. Since the `prog_seed` changes relatively infrequently it is expected that `progPowLoop` will be compiled while mining instead of interpreted on the fly.
 
 ```cpp
-
-kiss99_t progPowInit(uint64_t prog_seed, int mix_seq[PROGPOW_REGS])
+kiss99_t progPowInit(uint64_t prog_seed, int mix_seq_dst[PROGPOW_REGS], int mix_seq_cache[PROGPOW_REGS])
 {
  kiss99_t prog_rnd;
  uint32_t fnv_hash = 0x811c9dc5;
  prog_rnd.z = fnv1a(fnv_hash, prog_seed);
  prog_rnd.w = fnv1a(fnv_hash, prog_seed >> 32);
  prog_rnd.jsr = fnv1a(fnv_hash, prog_seed);
  prog_rnd.jcong = fnv1a(fnv_hash, prog_seed >> 32);
- // Create a random sequence of mix destinations for merge()
- // guaranteeing every location is touched once
- // Uses Fisher–Yates shuffle
+ // Create a random sequence of mix destinations for merge() and mix sources for cache reads
+ // guarantees every destination merged once
+ // guarantees no duplicate cache reads, which could be optimized away
+ // Uses Fisher-Yates shuffle
  for (int i = 0; i < PROGPOW_REGS; i++)
- mix_seq[i] = i;
+ {
+ mix_seq_dst[i] = i;
+ mix_seq_cache[i] = i;
+ }
  for (int i = PROGPOW_REGS - 1; i > 0; i--)
  {
- int j = kiss99(prog_rnd) % (i + 1);
- swap(mix_seq[i], mix_seq[j]);
+ int j;
+ j = kiss99(prog_rnd) % (i + 1);
+ swap(mix_seq_dst[i], mix_seq_dst[j]);
+ j = kiss99(prog_rnd) % (i + 1);
+ swap(mix_seq_cache[i], mix_seq_cache[j]);
  }
  return prog_rnd;
 }
@@ -241,60 +284,66 @@ The main loop:
 
 ```cpp
 // Helper to get the next value in the per-program random sequence
-#define rnd() (kiss99(prog_rnd))
+#define rnd()  (kiss99(prog_rnd))
 // Helper to pick a random mix location
-#define mix_src() (rnd() % PROGPOW_REGS)
+#define mix_src()  (rnd() % PROGPOW_REGS)
 // Helper to access the sequence of mix destinations
-#define mix_dst() (mix_seq[(mix_seq_cnt++)%PROGPOW_REGS])
+#define mix_dst() (mix_seq_dst[(mix_seq_dst_cnt++)%PROGPOW_REGS])
+// Helper to access the sequence of cache sources
+#define mix_cache() (mix_seq_cache[(mix_seq_cache_cnt++)%PROGPOW_REGS])
 
 void progPowLoop(
  const uint64_t prog_seed,
  const uint32_t loop,
  uint32_t mix[PROGPOW_LANES][PROGPOW_REGS],
- const uint64_t *g_dag,
- const uint32_t *c_dag)
+ const uint32_t *dag)
 {
  // All lanes share a base address for the global load
  // Global offset uses mix[0] to guarantee it depends on the load result
- uint32_t offset_g = mix[loop%PROGPOW_LANES][0] % DAG_SIZE;
+ uint32_t offset_g = mix[loop%PROGPOW_LANES][0] % (DAG_BYTES / (PROGPOW_LANES*PROGPOW_DAG_LOADS*sizeof(uint32_t)));
  // Lanes can execute in parallel and will be convergent
  for (int l = 0; l < PROGPOW_LANES; l++)
  {
- // global load to sequential locations
- uint64_t data64 = g_dag[offset_g + l];
+ // global load to the 256 byte DAG entry
+ // every lane can access every part of the entry
+ uint32_t data_g[PROGPOW_DAG_LOADS];
+ uint32_t offset_l = offset_g * PROGPOW_LANES + (l ^ loop) % PROGPOW_LANES;
+ for (int i = 0; i < PROGPOW_DAG_LOADS; i++)
+ data_g[i] = dag[offset_l * PROGPOW_DAG_LOADS + i];
 
  // initialize the seed and mix destination sequence
- int mix_seq[PROGPOW_REGS];
- int mix_seq_cnt = 0;
- kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq);
+ int mix_seq_dst[PROGPOW_REGS];
+ int mix_seq_cache[PROGPOW_REGS];
+ int mix_seq_dst_cnt = 0;
+ int mix_seq_cache_cnt = 0;
+ kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq_dst, mix_seq_cache);
 
- uint32_t offset, data32;
  int max_i = max(PROGPOW_CNT_CACHE, PROGPOW_CNT_MATH);
  for (int i = 0; i < max_i; i++)
  {
  if (i < PROGPOW_CNT_CACHE)
  {
  // Cached memory access
- // lanes access random location
- offset = mix[l][mix_src()] % PROGPOW_CACHE_WORDS;
- data32 = c_dag[offset];
- merge(mix[l][mix_dst()], data32, rnd());
+ // lanes access random 32-bit locations within the first portion of the DAG
+ uint32_t offset = mix[l][mix_cache()] % (PROGPOW_CACHE_BYTES/sizeof(uint32_t));
+ uint32_t data = dag[offset];
+ merge(mix[l][mix_dst()], data, rnd());
  }
  if (i < PROGPOW_CNT_MATH)
  {
  // Random Math
- data32 = math(mix[l][mix_src()], mix[l][mix_src()], rnd());
- merge(mix[l][mix_dst()], data32, rnd());
+ uint32_t data = math(mix[l][mix_src()], mix[l][mix_src()], rnd());
+ merge(mix[l][mix_dst()], data, rnd());
  }
  }
- // Consume the global load data at the very end of the loop
- // Allows full latency hiding
- merge(mix[l][0], data64, rnd());
- merge(mix[l][mix_dst()], data64>>32, rnd());
+ // Consume the global load data at the very end of the loop to allow full latency hiding
+ // Always merge into mix[0] to feed the offset calculation
+ merge(mix[l][0], data_g[0], rnd());
+ for (int i = 1; i < PROGPOW_DAG_LOADS; i++)
+ merge(mix[l][mix_dst()], data_g[i], rnd());
  }
 }
 ```
-
 ## Rationale
 
 ProgPoW utilizes almost all parts of a commodity GPU, excluding:
@@ -308,28 +357,11 @@ Since the GPU is almost fully utilized, there’s little opportunity for specia
 
 ## Backwards Compatibility
 
-This algorithm is not backwards compatible with the existing Ethash, and will require a fork for adoption. Furthermore, the network hashrate will halve as the time spent in the core is now balanced with time spent in memory.
-
-## Test Cases
-
-This PoW algorithm was tested against six different models from two different manufacturers. Selected models span two different chips and memory types from each manufacturer (Polaris20-GDDR5 and Vega10-HBM2 for AMD; GP104-GDDR5 and GP102-GDDR5X for NVIDIA). The average hashrate results are listed below. Additional tests are ongoing.
-
-As the algorithm nearly fully utilizes GPU functions in a natural way, the results reflect relative GPU performance that is similar to other gaming and graphics applications.
-
--------------------------------
-| Model | Hashrate (MH/s) |
-| --------- | --------------- |
-| RX580 | 9.4 |
-| Vega56 | 16.6 |
-| Vega64   | 18.7 |
-| GTX1070Ti | 13.1 |
-| GTX1080   | 14.9 |
-| GTX1080Ti | 21.8 |
--------------------------------
+This algorithm is not backwards compatible with the existing Ethash, and will require a fork for adoption. Furthermore, the network hashrate will halve since twice as much memory is loaded per hash.
 
 ## Implementation
 
-Please refer to the official code located at [ProgPOW](https://github.com/ifdefelse/ProgPOW) for the full code, implemented in the standard ethminer. 
+Please refer to the official code located at [ProgPOW](https://github.com/ifdefelse/ProgPOW) for the full code, implemented in the standard ethminer.
 
 ## Copyright