Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During fast/light sync, check seal on a small subset of headers #1010

Merged
merged 2 commits into from
Jul 13, 2018

Conversation

gsalgado
Copy link
Contributor

@gsalgado gsalgado commented Jul 12, 2018

For every header batch, pick a random subset and check the seal only on those.

Before this change: Imported 192 headers in 1.33 seconds, new head: #3583663

After this change: Imported 192 headers in 0.53 seconds, new head: #3585558

Closes: #932

@gsalgado gsalgado force-pushed the issue-932 branch 3 times, most recently from 93abba7 to c97ea99 Compare July 12, 2018 11:02
@@ -46,7 +46,7 @@
],
'test': [
"hypothesis==3.44.26",
"pytest~=3.2",
"pytest~=3.3",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because we now have some tests that use the caplog fixture, which is not available in 3.2

if seal_check_frequency == 1:
headers_to_check_seal = all_indices
else:
headers_to_check_seal = random.sample(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went as far as writing up a big comment with a secure version of random.sample implemented using secrets.choice. But then I deleted it because it felt a bit overly paranoid/tinfoil-hat. But then I still wanted to raise the issue.

I respect that there is very little likelihood that we're exposing attack surface by using random.sample here, but given the potential severity of a security exploit in this type of software, how would you feel about replacing this with such a small utility that uses secure sources of randomness. My basic implementation is below:

def samp(pop: Set, size: int) -> Set:  # could probably use a generic for the items in the set `Set[G]`
    if size >= len(pop):
        return pop
    rejects = set()
    for _ in range(len(pop) - size):
        # tuple cast is needed to use `secrets.choice`
        rejects.add(secrets.choice(tuple(pop.difference(rejects))))
    return pop.difference(rejects)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wrote this using secrets.choice(), but changed it to random because we need to be python3.5 compatible as this is in the evm package, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct it seems. 👍

if header.parent_hash != parent.hash:
raise ValidationError(
"Invalid header chain; {} has parent {}, but expected {}".format(
header, header.parent_hash, parent.hash))
vm_class = self.get_vm_class_for_block_number(header.block_number)
vm_class.validate_header(header, parent)
if i in headers_to_check_seal:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantic nitpick

When I see inclusion checks I default to wanting the population object to be a set type. Thoughts on casting headers_to_check_seal to a set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

if i in headers_to_check_seal:
check_seal = True
else:
check_seal = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just embed the call to vm_class.validate_header(header, parent, check_seal=True) here and get rid of the check_seal variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, validate_header() does more than checking the seal, and we still want to perform those for every header -- it's only the seal check that we want to skip sometimes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, duh, now I understand what you mean. Will do that

p2p/constants.py Outdated
# (https://github.com/ethereum/go-ethereum/pull/1889#issue-47241762), but in order to err on the
# side of caution, we use 192/4 to ensure we check 4 headers out of every batch (of length
# MAX_HEADERS_FETCH).
FAST_SYNC_SEAL_CHECK_FREQUENCY = 192 // 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's clearer to specify the period here, like:

# check one in 4 headers
FAST_SYNC_SEAL_CHECK_PERIOD = 4

Then, we aren't assuming this value is used with 192 headers, either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI, geth checks 1 in 100, and as I mentioned below we're doing 1 in 48. Why would we do 1 in 4 instead?

Copy link
Contributor

@carver carver Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I see now that the 'frequency` name was throwing me off. So I think my only issue with this is that it should be renamed to period. Frequency would be 1/period, ~0.02.

Edit: frequency weakly implies that the checks are cyclical instead of random, and period implies that even more strongly. Maybe the best solution is to just invert the number and keep it as frequency. Or come up with an alternative name for frequency like chance or probability or likelihood

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about seal_check_random_sample_rate?

            sample_size = len(all_indices) // seal_check_random_sample_rate
            headers_to_check_seal = set(random.sample(all_indices, sample_size))

headers_to_check_seal = all_indices
else:
headers_to_check_seal = random.sample(
all_indices, len(all_indices) // seal_check_frequency)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's written as if the seal_check constant is a period. Otherwise, 192 // (192/4) would only check 4 headers total in a chain of 192.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I intended. In geth they check every 100-th header, and we're doing every 48th

evm/vm/base.py Outdated
@@ -712,7 +713,8 @@ def validate_block(self, block):
)

@classmethod
def validate_header(cls, header: BlockHeader, parent_header: BlockHeader) -> None:
def validate_header(
cls, header: BlockHeader, parent_header: BlockHeader, check_seal: bool) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to default check_seal to True, which should remove some of the boilerplate changes.

Copy link
Contributor

@carver carver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the code as choosing too few headers.

I misread the comment about geth. 👍 to the logic.

Copy link
Contributor

@carver carver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 with a rename from frequency.

Other options would be:

  • FAST_SYNC_SEAL_CHECK_PERIOD
  • FAST_SYNC_SEAL_CHECK_INVERSE_FREQUENCY
  • FAST_SYNC_SEAL_CHECK_ONE_IN_N
  • FAST_SYNC_SEAL_CHECK_EVERY
  • FAST_SYNC_SEAL_CHECK_SKIP (although this is semantically off-by-one)

OR to invert the value to be a frequency (plus the related logic of course), which is easier to name.

Edit: frequency weakly implies that the checks are cyclical instead of random, and period implies that even more strongly. Maybe the best solution is to just invert the number and keep it as frequency. We can drop the cyclical implication entirely with an alternative name like chance or probability or likelihood.

gsalgado added 2 commits July 13, 2018 07:50
For every header batch, pick a random subset and check the seal only on
those.
@gsalgado gsalgado merged commit 54f6ad6 into ethereum:master Jul 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants