Skip to content

Commit

Permalink
Merge pull request #1589 from ifdefelse/eip-1057
Browse files Browse the repository at this point in the history
Update EIP-1057 to match current ProgPoW spec
  • Loading branch information
gcolvin committed Nov 23, 2018
2 parents 2b8c3e4 + bf4566e commit 063b860
Showing 1 changed file with 122 additions and 90 deletions.
212 changes: 122 additions & 90 deletions EIPS/eip-1057.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
eip: 1057
title: ProgPoW, a Programmatic Proof-of-Work
author: Radix Pi <radix.pi.314@gmail.com>, Ifdef Else <ifdefelse@protonmail.com>
author: IfDefElse <ifdefelse@protonmail.com>
discussions-to: https://ethereum-magicians.org/t/eip-progpow-a-programmatic-proof-of-work/272
status: Draft
type: Standards Track
Expand All @@ -15,11 +15,11 @@ The following is a proposal for an alternate proof-of-work algorithm - **“Prog

## Abstract

The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block.
The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block.

For Ethereum - a community based on widely distributed commodity hardware - specialized ASICs enable certain participants to gain a much greater chance of generating the next block, and undermine the distributed security.

ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?”
ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?”

EIP<NaN> presents an algorithm that is tuned for commodity GPUs where there is minimal opportunity for ASIC specialization. This prevents specialized ASICs without resorting to a game of whack-a-mole where the network changes algorithms every few months.

Expand All @@ -29,7 +29,7 @@ Until Ethereum transitions to a pure proof-of-stake model, proof-of-work will co

Ethash allows for the creation of an ASIC that is roughly twice as efficient as a commodity GPU. Ethash’s memory accesses are paired with a very small amount of fixed compute. Most of a GPU’s capacity and complexity sits idle, wasting power, while waiting for DRAM accesses. A specialized ASIC can implement a much smaller (and cheaper) compute engine that burns much less power.

As miner rewards are reduced with Casper FFG, it will remain profitable to mine on a specialized ASIC long after GPUs have exited the network. This will make it easier for an entity that has access to private ASICs to stage a 51% attack on the Ethereum network.
As miner rewards are reduced with Casper FFG, it will remain profitable to mine on a specialized ASIC long after GPUs have exited the network. This will make it easier for an entity that has access to private ASICs to stage a 51% attack on the Ethereum network.

## Specification

Expand Down Expand Up @@ -57,18 +57,22 @@ In contrast to Ethash, the changes detailed below make ProgPoW dependent on the

**Increases the DRAM read from 128 bytes to 256 bytes.**

*The DRAM read from the DAG is the same as Ethash’s, but with the size increased to `256 bytes`. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses.*
*The DRAM read from the DAG is the same as Ethash’s, but with the size increased to `256 bytes`. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses.*

The DAG file is generated according to traditional Ethash specifications, with an additional `PROGPOW_SIZE_CACHE` bytes generated that will be cached in the L1.
The DAG file is generated according to traditional Ethash specifications.

ProgPoW can be tuned using the following parameters. The proposed settings have been tuned for a range of existing, commodity GPUs:

* `PROGPOW_LANES:` The number of parallel lanes that coordinate to calculate a single hash instance; default is `32.`
* `PROGPOW_REGS:` The register file usage size; default is `16.`
* `PROGPOW_CACHE_BYTES:` The size of the cache; default is `16 x 1024.`
* `PROGPOW_CNT_MEM:` The number of frame buffer accesses, defined as the outer loop of the algorithm; default is `64` (same as Ethash).
* `PROGPOW_CNT_CACHE:` The number of cache accesses per loop; default is `8.`
* `PROGPOW_CNT_MATH:` The number of math operations per loop; default is `8.`
* `PROGPOW_PERIOD`: Number of blocks before changing the random program; default is `50`.
* `PROGPOW_LANES`: The number of parallel lanes that coordinate to calculate a single hash instance; default is `16`.
* `PROGPOW_REGS`: The register file usage size; default is `32`.
* `PROGPOW_DAG_LOADS`: Number of uint32 loads from the DAG per lane; default is `4`;
* `PROGPOW_CACHE_BYTES`: The size of the cache; default is `16 x 1024`.
* `PROGPOW_CNT_DAG`: The number of DAG accesses, defined as the outer loop of the algorithm; default is `64` (same as Ethash).
* `PROGPOW_CNT_CACHE`: The number of cache accesses per loop; default is `12`.
* `PROGPOW_CNT_MATH`: The number of math operations per loop; default is `20`.

The random program changes every `PROGPOW_PERIOD` blocks (default `50`, roughly 12.5 minutes) to ensure the hardware executing the algorithm is fully programmable. If the program only changed every DAG epoch (roughly 5 days) certain miners could have time to develop hand-optimized versions of the random sequence, giving them an undue advantage.

ProgPoW uses **FNV1a** for merging data. The existing Ethash uses FNV1 for merging, but FNV1a provides better distribution properties.

Expand All @@ -90,12 +94,14 @@ typedef struct {
// http://www.cse.yorku.ca/~oz/marsaglia-rng.html
uint32_t kiss99(kiss99_t &st)
{
uint32_t znew = (st.z = 36969 * (st.z & 65535) + (st.z >> 16));
uint32_t wnew = (st.w = 18000 * (st.w & 65535) + (st.w >> 16));
uint32_t MWC = ((znew << 16) + wnew);
uint32_t SHR3 = (st.jsr ^= (st.jsr << 17), st.jsr ^= (st.jsr >> 13), st.jsr ^= (st.jsr << 5));
uint32_t CONG = (st.jcong = 69069 * st.jcong + 1234567);
return ((MWC^CONG) + SHR3);
st.z = 36969 * (st.z & 65535) + (st.z >> 16);
st.w = 18000 * (st.w & 65535) + (st.w >> 16);
uint32_t MWC = ((st.z << 16) + st.w);
st.jsr ^= (st.jsr << 17);
st.jsr ^= (st.jsr >> 13);
st.jsr ^= (st.jsr << 5);
st.jcong = 69069 * st.jcong + 1234567;
return ((MWC^st.jcong) + st.jsr);
}
```
Expand All @@ -121,75 +127,112 @@ void fill_mix(
}
```

The main search algorithm uses the Keccak sponge function (a width of 800 bits, with a bitrate of 448, and a capacity of 352) to generate a seed, expands the seed, does a sequence of loads and random math on the mix data, and then compresses the result into a final Keccak permutation (with the same parameters as the first) for target comparison.
Like ethash Keccak is used to seed the sequence per-nonce and to produce the final result. The keccak-f800 variant is used as the 32-bit word size matches the native word size of modern GPUs. The implementation is a variant of SHAKE with width=800, bitrate=576, capacity=224, output=256, and no padding. The result of keccak is treated as a 256-bit big-endian number - that is result byte 0 is the MSB of the value.

```cpp
hash32_t keccak_f800_progpow(hash32_t header, uint64_t seed, hash32_t digest)
{
uint32_t st[25];

for (int i = 0; i < 25; i++)
st[i] = 0;
for (int i = 0; i < 8; i++)
st[i] = header.uint32s[i];
st[8] = seed;
st[9] = seed >> 32;
for (int i = 0; i < 8; i++)
st[10+i] = digest.uint32s[i];

for (int r = 0; r < 22; r++)
keccak_f800_round(st, r);

hash32_t ret;
for (int i=0; i<8; i++)
ret.uint32s[i] = st[i];
return ret;
}
```
The flow of the overall algorithm is:
* A keccak hash of the header + nonce to create a seed
* Use the seed to generate initial mix data
* Loop multiple times, each time hashing random loads and random math into the mix data
* Hash all the mix data into a single 256-bit value
* A final keccak hash that is compared against the target
```cpp
bool progpow_search(
const uint64_t prog_seed,
const uint64_t prog_seed, // value is (block_number/PROGPOW_PERIOD)
const uint64_t nonce,
const hash32_t header,
const uint64_t target,
const uint64_t *g_dag, // gigabyte DAG located in framebuffer
const uint64_t *c_dag // kilobyte DAG located in l1 cache
const hash32_t target, // miner can use a uint64_t target, doesn't need the full 256 bit target
const uint32_t *dag // gigabyte DAG located in framebuffer - the first portion gets cached
)
{
uint32_t mix[PROGPOW_LANES][PROGPOW_REGS];
uint32_t result[4];
for (int i = 0; i < 4; i++)
result[i] = 0;
hash32_t digest;
for (int i = 0; i < 8; i++)
digest.uint32s[i] = 0;
// keccak(header..nonce)
uint64_t seed = keccak_f800(header, nonce, result);
hash32_t seed_256 = keccak_f800_progpow(header, nonce, digest);
// endian swap so byte 0 of the hash is the MSB of the value
uint64_t seed = bswap(seed_256[0]) << 32 | bswap(seed_256[1]);
// initialize mix for all lanes
for (int l = 0; l < PROGPOW_LANES; l++)
fill_mix(seed, l, mix);
fill_mix(seed, l, mix[l]);
// execute the randomly generated inner loop
for (int i = 0; i < PROGPOW_CNT_MEM; i++)
progPowLoop(prog_seed, i, mix, g_dag, c_dag);
for (int i = 0; i < PROGPOW_CNT_DAG; i++)
progPowLoop(prog_seed, i, mix, dag);
// Reduce mix data to a single per-lane result
uint32_t lane_hash[PROGPOW_LANES];
// Reduce mix data to a per-lane 32-bit digest
uint32_t digest_lane[PROGPOW_LANES];
for (int l = 0; l < PROGPOW_LANES; l++)
{
lane_hash[l] = 0x811c9dc5
digest_lane[l] = 0x811c9dc5
for (int i = 0; i < PROGPOW_REGS; i++)
fnv1a(lane_hash[l], mix[l][i]);
fnv1a(digest_lane[l], mix[l][i]);
}
// Reduce all lanes to a single 128-bit result
for (int i = 0; i < 4; i++)
result[i] = 0x811c9dc5;
// Reduce all lanes to a single 256-bit digest
for (int i = 0; i < 8; i++)
digest.uint32s[i] = 0x811c9dc5;
for (int l = 0; l < PROGPOW_LANES; l++)
fnv1a(result[l%4], lane_hash[l])
fnv1a(digest.uint32s[l%8], digest_lane[l])
// keccak(header .. keccak(header..nonce) .. result);
return (keccak_f800(header, seed, result) <= target);
// keccak(header .. keccak(header..nonce) .. digest);
return (keccak_f800_progpow(header, seed, digest) <= target);
}
```

The inner loop uses FNV and KISS99 to generate a random sequence from the `prog_seed`. This random sequence determines which mix state is accessed and what random math is performed. Since the `prog_seed` changes relatively infrequently it is expected that `progPowLoop` will be compiled while mining instead of interpreted on the fly.

```cpp

kiss99_t progPowInit(uint64_t prog_seed, int mix_seq[PROGPOW_REGS])
kiss99_t progPowInit(uint64_t prog_seed, int mix_seq_dst[PROGPOW_REGS], int mix_seq_cache[PROGPOW_REGS])
{
kiss99_t prog_rnd;
uint32_t fnv_hash = 0x811c9dc5;
prog_rnd.z = fnv1a(fnv_hash, prog_seed);
prog_rnd.w = fnv1a(fnv_hash, prog_seed >> 32);
prog_rnd.jsr = fnv1a(fnv_hash, prog_seed);
prog_rnd.jcong = fnv1a(fnv_hash, prog_seed >> 32);
// Create a random sequence of mix destinations for merge()
// guaranteeing every location is touched once
// Uses Fisher–Yates shuffle
// Create a random sequence of mix destinations for merge() and mix sources for cache reads
// guarantees every destination merged once
// guarantees no duplicate cache reads, which could be optimized away
// Uses Fisher-Yates shuffle
for (int i = 0; i < PROGPOW_REGS; i++)
mix_seq[i] = i;
{
mix_seq_dst[i] = i;
mix_seq_cache[i] = i;
}
for (int i = PROGPOW_REGS - 1; i > 0; i--)
{
int j = kiss99(prog_rnd) % (i + 1);
swap(mix_seq[i], mix_seq[j]);
int j;
j = kiss99(prog_rnd) % (i + 1);
swap(mix_seq_dst[i], mix_seq_dst[j]);
j = kiss99(prog_rnd) % (i + 1);
swap(mix_seq_cache[i], mix_seq_cache[j]);
}
return prog_rnd;
}
Expand Down Expand Up @@ -241,60 +284,66 @@ The main loop:
```cpp
// Helper to get the next value in the per-program random sequence
#define rnd() (kiss99(prog_rnd))
#define rnd() (kiss99(prog_rnd))
// Helper to pick a random mix location
#define mix_src() (rnd() % PROGPOW_REGS)
#define mix_src() (rnd() % PROGPOW_REGS)
// Helper to access the sequence of mix destinations
#define mix_dst() (mix_seq[(mix_seq_cnt++)%PROGPOW_REGS])
#define mix_dst() (mix_seq_dst[(mix_seq_dst_cnt++)%PROGPOW_REGS])
// Helper to access the sequence of cache sources
#define mix_cache() (mix_seq_cache[(mix_seq_cache_cnt++)%PROGPOW_REGS])
void progPowLoop(
const uint64_t prog_seed,
const uint32_t loop,
uint32_t mix[PROGPOW_LANES][PROGPOW_REGS],
const uint64_t *g_dag,
const uint32_t *c_dag)
const uint32_t *dag)
{
// All lanes share a base address for the global load
// Global offset uses mix[0] to guarantee it depends on the load result
uint32_t offset_g = mix[loop%PROGPOW_LANES][0] % DAG_SIZE;
uint32_t offset_g = mix[loop%PROGPOW_LANES][0] % (DAG_BYTES / (PROGPOW_LANES*PROGPOW_DAG_LOADS*sizeof(uint32_t)));
// Lanes can execute in parallel and will be convergent
for (int l = 0; l < PROGPOW_LANES; l++)
{
// global load to sequential locations
uint64_t data64 = g_dag[offset_g + l];
// global load to the 256 byte DAG entry
// every lane can access every part of the entry
uint32_t data_g[PROGPOW_DAG_LOADS];
uint32_t offset_l = offset_g * PROGPOW_LANES + (l ^ loop) % PROGPOW_LANES;
for (int i = 0; i < PROGPOW_DAG_LOADS; i++)
data_g[i] = dag[offset_l * PROGPOW_DAG_LOADS + i];
// initialize the seed and mix destination sequence
int mix_seq[PROGPOW_REGS];
int mix_seq_cnt = 0;
kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq);
int mix_seq_dst[PROGPOW_REGS];
int mix_seq_cache[PROGPOW_REGS];
int mix_seq_dst_cnt = 0;
int mix_seq_cache_cnt = 0;
kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq_dst, mix_seq_cache);
uint32_t offset, data32;
int max_i = max(PROGPOW_CNT_CACHE, PROGPOW_CNT_MATH);
for (int i = 0; i < max_i; i++)
{
if (i < PROGPOW_CNT_CACHE)
{
// Cached memory access
// lanes access random location
offset = mix[l][mix_src()] % PROGPOW_CACHE_WORDS;
data32 = c_dag[offset];
merge(mix[l][mix_dst()], data32, rnd());
// lanes access random 32-bit locations within the first portion of the DAG
uint32_t offset = mix[l][mix_cache()] % (PROGPOW_CACHE_BYTES/sizeof(uint32_t));
uint32_t data = dag[offset];
merge(mix[l][mix_dst()], data, rnd());
}
if (i < PROGPOW_CNT_MATH)
{
// Random Math
data32 = math(mix[l][mix_src()], mix[l][mix_src()], rnd());
merge(mix[l][mix_dst()], data32, rnd());
uint32_t data = math(mix[l][mix_src()], mix[l][mix_src()], rnd());
merge(mix[l][mix_dst()], data, rnd());
}
}
// Consume the global load data at the very end of the loop
// Allows full latency hiding
merge(mix[l][0], data64, rnd());
merge(mix[l][mix_dst()], data64>>32, rnd());
// Consume the global load data at the very end of the loop to allow full latency hiding
// Always merge into mix[0] to feed the offset calculation
merge(mix[l][0], data_g[0], rnd());
for (int i = 1; i < PROGPOW_DAG_LOADS; i++)
merge(mix[l][mix_dst()], data_g[i], rnd());
}
}
```

## Rationale

ProgPoW utilizes almost all parts of a commodity GPU, excluding:
Expand All @@ -308,28 +357,11 @@ Since the GPU is almost fully utilized, there’s little opportunity for specia

## Backwards Compatibility

This algorithm is not backwards compatible with the existing Ethash, and will require a fork for adoption. Furthermore, the network hashrate will halve as the time spent in the core is now balanced with time spent in memory.

## Test Cases

This PoW algorithm was tested against six different models from two different manufacturers. Selected models span two different chips and memory types from each manufacturer (Polaris20-GDDR5 and Vega10-HBM2 for AMD; GP104-GDDR5 and GP102-GDDR5X for NVIDIA). The average hashrate results are listed below. Additional tests are ongoing.

As the algorithm nearly fully utilizes GPU functions in a natural way, the results reflect relative GPU performance that is similar to other gaming and graphics applications.

-------------------------------
| Model | Hashrate (MH/s) |
| --------- | --------------- |
| RX580 | 9.4 |
| Vega56 | 16.6 |
| Vega64   | 18.7 |
| GTX1070Ti | 13.1 |
| GTX1080   | 14.9 |
| GTX1080Ti | 21.8 |
-------------------------------
This algorithm is not backwards compatible with the existing Ethash, and will require a fork for adoption. Furthermore, the network hashrate will halve since twice as much memory is loaded per hash.

## Implementation

Please refer to the official code located at [ProgPOW](https://github.com/ifdefelse/ProgPOW) for the full code, implemented in the standard ethminer.
Please refer to the official code located at [ProgPOW](https://github.com/ifdefelse/ProgPOW) for the full code, implemented in the standard ethminer.

## Copyright

Expand Down

0 comments on commit 063b860

Please sign in to comment.