Skip to content

Add fetch_candidate_head #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Feb 2, 2018

Conversation

mhchia
Copy link
Contributor

@mhchia mhchia commented Jan 12, 2018

What was wrong?

As #242 described, get_next_log and fetch_candidate_head is needed.

How was it fixed?

get_next_log: since there is a get_new_entries() in web3.eth.filter, which gives latest "unread" logs in the RPC server, when we fetch new logs from get_new_entries(), we are sure that they are latest and unseen. Besides, because get_next_log should only get one latest log, we have to save the other logs fetched altogether from get_new_entries to a stack, and treat them as "new" logs, waiting to be chosen by get_new_log in the future. The way to maintain the stack(VMC.new_collation_added_logs) is, if get_next_log is called, first call get_new_entries and put the new logs on top of the stack(VMC.new_collation_added_logs). Then, return the log on the top, ensuring we get the latest and unseen log.

fetch_candidate_head: implement the same way as it is described in doc.md

TODOs

  • Use eth_getLogs over filter.xxx, maintaining new logs ourselves

Cute Animal Picture

put a cute animal picture link inside the parentheses

@mhchia mhchia changed the title Sharding vmc handler handle logs Add get_next_log and fetch_candidate_head Jan 12, 2018
@mhchia
Copy link
Contributor Author

mhchia commented Jan 12, 2018

Note: The reason I currently use raw_log over CollationAdded.log is because bytes is only allowed to be at most 32 bytes in the log declaration in vyper. However the size of headers is far bigger than that. It should be changed once vyper fix this issue.

@mhchia mhchia changed the title Add get_next_log and fetch_candidate_head Add fetch_candidate_head Jan 12, 2018

# Update collations_with_score
self.collations_with_score[shard_id][_score][self.num_collations_with_score[shard_id][_score]] = entire_header_hash
self.num_collations_with_score[shard_id][_score] = self.num_collations_with_score[shard_id][_score] + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I believe this PR would be merged sooner than my next PR about VMC, so I think it's good to remove all collations_with_score and num_collations_with_score codes in this PR since we're using log now. 🙂

Copy link
Contributor Author

@mhchia mhchia Jan 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. 👍 Thank you

),
)
)
def test_vmc_fetch_candidate_head(vmc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# For handling logs filtering
# Event:
# CollationAdded(indexed uint256 shard, bytes collationHeader, bool isNewHead, uint256 score)
collation_added_topic = "0x" + keccak(b"CollationAdded(int128,bytes4096,bool,int128)").hex()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd case this as COLLATION_ADDED_TOPIC to make it clear that it's a constant value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Will do.

# For handling logs filtering
# Event:
# CollationAdded(indexed uint256 shard, bytes collationHeader, bool isNewHead, uint256 score)
collation_added_topic = "0x" + keccak(b"CollationAdded(int128,bytes4096,bool,int128)").hex()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use eth_utils.event_signature_to_log_topic to compute this topic value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cool! will change to it.

self.current_checking_score[shard_id] = None

@to_dict
def parse_collation_added_data(self, data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good stand-alone function that could be split off and tested in isolation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't quite understand it. Do you mean it should be moved outside the class? and can be tested in isolation?

def _get_new_logs(self, shard_id):
# use `get_new_entries` over `get_all_entries`
# 1. Prevent from the increasing size of logs
# 2. Leave the efforts maintaining `new_logs` in RPC servers
Copy link
Member

@pipermerriam pipermerriam Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a pattern I've been wanting to try out that this might be a good candidate for. It's meant to emulate get_new_entries without any of the complication of installing the filter and tracking the filter_id. It looks roughly like this.

from cytoolz import (
    merge,
    partition_all,
)


def log_filter(web3, filter_params, page_size=10):
    # TODO: handle pending/latest/earliest
    from_block = filter_params['fromBlock']
    to_block = filter_params['toBlock']

    # TODO: handle dynamic block ranges to enforce reasonable request sizes.
    for block_number in range(from_block, to_block + 1):
        params = merge(filter_params, {'fromBlock': block_number, 'toBlock': block_number})
        logs = web3.eth.getLogs(params)

        for page in partition_all(page_size, logs):
            yield page

Issue in web3.py here: ethereum/web3.py#551

I suspect you could implement a specialized version of this that would be more robust than relying on filters on the server, eliminating the need to handle things like re-installing the filter if it gets lost.

Copy link
Contributor Author

@mhchia mhchia Jan 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem.
Thank you for letting me know of this. I didn't consider the situation that filters get lost in servers before.
I will do it. :)

Copy link
Contributor Author

@mhchia mhchia Jan 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I just found getLogs has not been implemented in eth_tester, I copied and pasted some code from eth_tester.create_log_filter and eth_tester._process_block_logs to a new function eth_tester.get_logs, and made web3.py calls it. Now it seems to work fine.(still need testing).

I have one question is, for now I plan to save block_number every time I get new logs with eth_getLogs, to identify the latest block which our new logs have been synced to. Should I handle the situation when forks happen? If we take forks in mainchain into consideration, it seems to be more complicated, and we need to store more things.
Thank you for your wisdom:)

Copy link
Contributor Author

@mhchia mhchia Jan 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pipermerriam I'm sorry that after thinking for a while, I still have some questions about the code.

  1. Is it correct that we must specify the fromBlock and toBlock to ensure the logs we are going to get are in the blocks range? if so, outside the log_filter there should be something recording the range of blocks which are already queried, right?
  2. Why not just use
    for block_number in range(from_block, to_block + 1):
        params = merge(filter_params, {'fromBlock': block_number, 'toBlock': block_number})
        logs = web3.eth.getLogs(params)
        for log in logs:
            yield log

instead of

    for block_number in range(from_block, to_block + 1):
        params = merge(filter_params, {'fromBlock': block_number, 'toBlock': block_number})
        logs = web3.eth.getLogs(params)
        for page in sliding_window(page_size, logs):
            yield page

? What is purpose to use the sliding_window?
Since there would be many overlapping between each next of log_filter, why not just use yield log?
e.g.

>>> logs = [1,2,3,4,5]
>>> from cytoolz import sliding_window
>>> z = sliding_window(3, logs)
>>> next(z)
(1, 2, 3)
>>> next(z)
(2, 3, 4)
>>> next(z)
(3, 4, 5)

Thank you so much for the directions!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why I put sliding_window in there. I think I got my wires crossed while hammering out that example code. Will try to get you answers to other questions asap but I'm out of time today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sliding window, it should be toolz.partition_all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that makes sense.
Thank you so much.
Please take your time. It is no hurry at all!

Copy link
Member

@pipermerriam pipermerriam Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one question is, for now I plan to save block_number every time I get new logs with eth_getLogs, to identify the latest block which our new logs have been synced to. Should I handle the situation when forks happen? If we take forks in mainchain into consideration, it seems to be more complicated, and we need to store more things.

Here is some untested code that won't actually work but should demonstrate the idea:

from eth_utils import (
    reversed_return,
)

@reversed_return
def get_previous_headers(w3, history_size=256):
    block = w3.eth.getBlock('latest')
    yield block['hash']

    # initialize the list of recent hashes
    for _ in range(history_size - 1):
        block = w3.eth.getBlock(block['parentHash'])
        yield block['hash']

        # break the loop if we hit the genesis block.
        if block['number'] == 0:
            break


def check_chain_head(w3, recent_hashes):
    block = w3.eth.getBlock('latest')

    new_hashes = []

    for _ in range(history_size):
        if block['hash'] in recent_hashes:
            break
        new_hashes.append(block['hash'])
        block = w3.eth.getBlock(block['parentHash'])
    else:
        raise Exception('No common ancestor found for block: {0}'.format(block['hash']))

    first_common_ancestor_idx = recent_hashes.index(block['hash'])

    unchanged_hashes = recent_hashes[:first_common_ancestor_idx]
    revoked_hashes = recent_hashes[first_common_ancestor_idx:]

    recent_hashes = recent_hashes + tuple(new_hashes)
    recent_hashes = recent_hashes[-1 * history_size:]
    return revoked_hashes, new_block_hashes

The idea is to build a utility that will track the most recent hashes in the head of the chain in such a way that it can provide you with two values.

  • the block hashes that were revoked from the head of the chain due to re-org
  • the new block hashes that we haven't yet seen

You can use the revoked hashes to roll back any logs that you processed which are no longer valid (assuming you need to do that).

You can use the new hashes to define the fromBlock and toBlock for fetching any new logs that you haven't yet seen.

Questions? Does this make sense? The code above could probably more easily be written as a class.

Copy link
Contributor Author

@mhchia mhchia Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a beautiful and simple idea! In this way we can focus on the revoked_changes and unseen_and_canonical_changes, and no need to worry about the other re-orgs between them. I think I can soon implement the utility based on your code.
Still one question: Why is history_size set to 256? Why shouldn't it be larger or smaller?
I really appreciate your help and patience!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

256 was a totally arbitrary number. Should probably be configurable.

yield pipe(
log['data'],
decode_hex,
self.parse_collation_added_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like moving the decode_hex into parse_collation_added_data would make this a lot easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep you're right. I have changed it. Thank you:)

raise FilterNotFound(
"CollationAdded filter haven't been set up in shard {}".format(shard_id)
)
for i in range(len(self.unchecked_collation_added_logs[shard_id]) - 1, -1, -1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified with:

for log_entry in reversed(self.unchecked_collation_added_logs[shard_id]):
    if log_entry == ...:
        ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing it out. Have changed it.

@mhchia
Copy link
Contributor Author

mhchia commented Jan 18, 2018

@pipermerriam I implemented your algorithm, and made a PR for eth_getLogs ethereum/eth-tester#50
Could you help me to see whether there is anything wrong if you have spare time?
If there is no issue in that PR, I will make a PR in web3.py to enable eth_getLogs for eth_tester in web3.py
Thank you!

@pipermerriam
Copy link
Member

👍 will get your eth-tester pr reviewed and merged as soon as the travis-ci log-jam clears.

)


class LogHandler(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can drop the (object) part here since we're in python3.

# ----------> higher score
self.recent_block_hashes = self.get_recent_block_hashes()

def get_recent_block_hashes(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make a good stand-alone function. I believe there are other places in the code that do this exact same logic. I think one of @hwwhww 's recent PRs introduced similar code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's the VM. get_prev_hashes(last_block_hash, db) function for the main chain client to build the previous-256-headers-hashes-list:

py-evm/evm/vm/base.py

Lines 418 to 431 in 10d29c4

@classmethod
@to_tuple
def get_prev_hashes(cls, last_block_hash, db):
if last_block_hash == GENESIS_PARENT_HASH:
return
block_header = get_block_header_by_hash(last_block_hash, db)
for _ in range(MAX_PREV_HEADER_DEPTH):
yield block_header.hash
try:
block_header = get_parent_header(block_header, db)
except (IndexError, BlockNotFound):
break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you:)

reversed_recent_hashes = tuple(reversed(recent_hashes))
return reversed_recent_hashes

def check_chain_head(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would also make a good stand-alone function, especially because it would be easy to test in isolation which would be nice.

Using this would then look something like this.

last_seen_recent_block_hashes = get_recent_block_hashes(w3, 256)
# .. then we check for changes
revoked_hashes, new_block_hashes = check_chain_head(w3, recent_block_hashes)
new_recent_block_hashes = tuple(concatv(
    last_seen_recent_block_hashes[:-1 * len(revoked_hashes)],
    new_block_hashes,
))[-1 * 256:]

This would reduce the size of this class by a significant size and probably make the code a bit easier to parse the code which actually uses this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have done it. Thank you for the wise instruction.
May I ask how can I determine which functions should stay in the class, while the others should be put outside the class LogHandler? Is the reason that get_new_logs stays in the class because it is the only interface used by VMCHandler?
Is the reason check_chain_head, get_recent_block_hashes, preprocess_block_param can be put outside the class LogHandler because they only depend on web3 and can be used directly without the logics of logs?
Sorry for asking this newbie question. Just want to know what is the better criteria to design the class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to apologize and I'm all for asking questions of any kind.

My general rule of thumb is the following loose collection of axioms and guidelines.

  • classes are typically stateful and statefullness is hard to manage.
  • functions (especially pure ones) are easier to reason about than classes.
  • a class with only one function is just a function disguised as a class. This also regularly applies to classes with only 2-3 functions.
  • a method that doesn't use self should almost always be a function.
  • it is ok to include convenience functions using @staticmethod, but normally should only be done if the convenience is external, meaning that outside actors who use this class almost always need the convenience function. If it's only needed internally, it should just be used as a stand-alone function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for giving me these guidelines! They are really helpful.
Just want to conclude,
is it correct that: In general, we should use functions over classes(as pointed in https://github.com/pipermerriam/ethereum-dev-tactical-manual#the-zen-of-the-python-ethereum-team). And it seems the right timing to choose a class over functions is when we have to maintain states besides a set of functions. (and should move functions which is not necessarily resided inside the class to stand-alone functions if possible?)

'latest': current_block_number,
'pending': current_block_number + 1,
}
return mapping.get(block_param, block_param)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this correctly, this will return block_param as-is if it isn't one of the ealiers/latest/pending. I think this would be better written as a 4 branch if/elif/else statement which cases each of earlierst/latest/pending and raises an exception in the final else since that should be an error case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok no problem:) Will fix it.

if isinstance(block_param, int):
return block_param
elif not isinstance(block_param, str):
raise ValueError("block parameter {0} is in wrong type".format(block_param))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of saying, "wrong type" it would likely make for a more helpful error message to invert this and say "must be str type"

new_collation_added_logs = {}
# shard_id -> list
# newer <---------------> older
unchecked_collation_added_logs = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here about mutable data structures and setting this in the __init__

self.shards.add(shard_id)
self.unchecked_collation_added_logs[shard_id] = []
self.new_collation_added_logs[shard_id] = []
self.current_checking_score[shard_id] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can remove all of this initialization if you use collection.defaultdict instances for all of these three data structures.

self.unchecked_collation_added_logs = collections.defaultdict(list)
self.new_collation_added_logs = collections.defaultdict(list)
self.current_checking_score = {}

Then you can access both self.unchecked_collation_added_logs[shard_id] and self.new_collation_added_logs[shard_id] as if they have already been initialized thanks to the magic of defaultdicts.

This should allow you to remove all of the initialization and ensure_ initialization code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea. Thank you for pointing out. I will change it.


def get_next_log(self, shard_id):
new_logs = self._get_new_logs(shard_id)
self.new_collation_added_logs[shard_id] += new_logs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .extend(new_logs) is the more canonical way to do this.

# checking in order of oldest to most recent.
self.ensure_shard_variables_initialied(shard_id)

for i in reversed(range(len(self.unchecked_collation_added_logs[shard_id]))):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more readable as:

# tuple(..) call is to force the iterator to fully evaluate so that the later `.pop(..)` doesn't cause problems.
unchecked_logs = tuple(reversed(enumerate(self.unchecked_collation_added_logs[shard_id])))
current_score = self.current_checking_score[shard_id]

for idx, log_entry in unchecked_logs:
    if log_entry['score'] == current_score:
        return self.unchecked_collation_added_logs[shard_id].pop(idx)

It's really just moving things around, but it removes one index based lookup, as well as storing the current score in a local variable to make the inner loop more legible and easier to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion! It is way more readable. Will change it.

self.default_privkey = default_privkey
self.default_sender_address = default_privkey.public_key.to_canonical_address()
self.config = get_sharding_config()
super().__init__(*args, **kwargs)

def init_shard_variables(self, shard_id):
self.shards.add(shard_id)
self.unchecked_collation_added_logs[shard_id] = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about renaming this variable to self.unchecked_logs and self.new_logs and current_score. Should make a lot of the lines in this class way shorter and easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Actually the reason why I was naming this long name is I thought there would be multiple types of logs to be handled, so I specified the name.
But since they are not handled yet, I think it's a good idea to renaming them to shorter ones for now. Thank you.

Copy link
Member

@pipermerriam pipermerriam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some bugs in this but I'm not sure. Can you look into some of the comments about whether unchecked_logs is correctly ordered?

# older <---------------> newer
self.new_logs = defaultdict(list)
# shard_id -> list
# newer <---------------> older
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn't accurate.

  • Lets say we get logs 1, 2, 3, 4, 5
  • Log 1 is the oldest
  • Log 5 is the newest
  • They come in as two batches, 1, 2, 3 and then when we query again after some blocks, 4, 5

In this case, I think that unchecked_logs will be 3, 2, 1, 5, 4. Maybe I'm reading the code wrong...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. It is not accurate. I will remove this comment.

Copy link
Contributor Author

@mhchia mhchia Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the algorithm fetch_candidate_head, it is from https://github.com/ethereum/sharding/blob/develop/docs/doc.md#fetch-candidate-heads-in-reverse-sorted-order .
It's idea is

The idea is that this algorithm is guaranteed to check potential head candidates in highest-to-lowest sorted order of score, with the second priority being oldest to most recent.

I am thinking one example, which follows what you assume above.
Assume we have batch1 = [1, 2, 3] and batch2 = [4, 5].
First, batch1 arrives and we have new_logs = [1, 2, 3].
Before batch2 arrives, the outcome of fetch_candidate_head will be 3, 2, 1. Anytime as long as batch2 arrives, the result of fetch_candidate_head would be the element in batch2, instead of the one in batch1.
So I think for this case the algorithm is right.

However, if we assume the batch1 = [1, 1] and batch2 = [2]. As long as batch2 arrives after batch1 and after the first fetch_candidate_head is called, the order of the outcome would be 1, 1, 2, which seems not consistent with the idea of the algorithm "The idea is that this algorithm is guaranteed to check potential head candidates in highest-to-lowest sorted order of score, with the second priority being oldest to most recent.".
I have an example code
About this I think I should confirm with @vbuterin for its detail.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended principle is that you want to get logs in reverse logical order, so latest block to oldest block, and within a block last transaction to first transaction, and within a transaction last to first created.

# is_new_head = true log
while True:
# TODO: currently just raise when there is no log anymore
self.unchecked_logs[shard_id].append(self.get_next_log(shard_id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my comment about the logs potentially not being maintained in the right order, I think this is the offending line. It seems like the logs should be pushed onto the front of the list rather than apppended to the end. If this is the case, then you might consider using a collections.deque which has fast inserts and pops to both the beginning and end of the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the late reply. I think I need to think for a while of it tomorrow.
Will soon give an explanation and what to revise. Thank you for pointing these out!


def __init__(self, *args, default_privkey, **kwargs):
def __init__(self, *args, log_handler, default_privkey, **kwargs):
self.log_handler = log_handler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're going to need one log handler per shard. Since the log handler is stateful, each time you ask for more logs, it only returns the ones for the given shard_id. If you're tracking multiple shard_ids then I think you're going to miss logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I didn't thought of this case. Thank you for discovering that.
Will fix it.

Copy link
Contributor Author

@mhchia mhchia Jan 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I was thinking that we could maintain one recent_block_hashes for all shards, to share it over shards. But in this way we will need to store all logs from all shards. It seems not better than keeping one log_handler for each shard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, one handler per shard seems cleanest despite the added overhead. We can address the performance issues later if needed.

@mhchia
Copy link
Contributor Author

mhchia commented Jan 25, 2018

@pipermerriam
After discussing with @vbuterin , it seems fetch_candidate_head is fine to work in that way, if I didn't understand wrong.
Since a node calls fecth_candidate_head inside GUESS_HEAD after it find itself is a collator in some period. At that time, the node can save the current status of new_logs, unchecked_logs, and current_score. If there is a new block arriving, just restore them to the saved status, append new logs to new_logs, and re-run GUESS_HEAD.
Is this scheme acceptable to you? and do you have any suggestion on this?

Thank you!

@mhchia mhchia force-pushed the sharding_vmc_handler_handle_logs branch from 953bf0a to 20989f0 Compare January 25, 2018 15:44
@pipermerriam
Copy link
Member

I'm good with whatever you guys have settled on regarding the proper functionality.

# shard_id -> list
self.unchecked_logs[shard_id] = []
# shard_id -> score
self.current_score[shard_id] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this entire class would be cleaned up by implementing one more data structure to house all of the shard data.

class ShardTracker:
    shard_id = ...
    log_handler = ...
    new_logs = ...
    unchecked_logs = ...
    current_score = ...

I suspect that some, if not most of the business logic for updating the shard state could then be moved into that object at which point this class would just manage all of the individual instances of that class.

Thoughts? I think it would clean this class up in a significant way.

Copy link
Contributor Author

@mhchia mhchia Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a good idea. I will do it.
Just two questions about it.

  1. Should I put ShardTracker in another file shard_tracker.py?
  2. Since all logics regarding shards logs are moved to ShardTracker(including get_next_log, fetch_candidate_head), if there is a need for an API in VMCHandler like fetch_candidate_head, may I just set a same name method fetch_candidate_head in VMCHandler and simply calls ShardTracker.fetch_candidate_head? (Though I think there might only be new methods like guess_head in VMC which will be implemented in the future)

Very much appreciate!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, may I request a bumpversion in web3.py, to resolve the error in CI?
Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's probably fine to start with it here in this file or to put it in it's own module. Up to you. shard_tracker.py seems like a fine choice as well.

  2. If this makes VMCHandler obsolete and you think you're able to do what you need with only the ShardTracker class, then it's probably a good idea to do away with it. Really depends on how these APIs will be used and whether you need an abstraction to track many shards and whether it still makes sense for that to be a class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pipermerriam Thank you for the great suggestions! I have moved those logics to ShardTracker, currently in the same module vmc_handler as VMCHandler.
For now I just make get_next_log, fetch_candidate_head directly calls the corresponding shard_tracker. I think I can determine what to remove and how to restructure between ShardTracker and VMCHandler after GUESS_HEAD and CREATE_COLLATION are implemented, which will be done in the next stage.

@pipermerriam
Copy link
Member

I believe you can bump the web3 version locally here in your branch to fix CI

Copy link
Member

@pipermerriam pipermerriam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only required changes are removing the use of mutable data structures as class attributes. Everything else is cleanup. This looks very close to mergable.

@@ -40,6 +40,10 @@ class NextLogUnavailable(Exception):
pass


class ShardNotTracked(Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternate name suggestion: UnknownShard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better, Thank you:)

self.log_handlers[shard_id] = log_handler
self.init_log_variables(shard_id)
self.new_logs = []
self.unchecked_logs = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be set in the constructor (and set to None here in the class definition). Mutable data structures as class attributes are bad, as they are shared across all instances of the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are in constructor already?
I think setting them as None is a good idea.
Is the purpose of doing this is to explicitly show their existence?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore my comment. I didn't see that this was within __init__.py (thought I was looking at the class body).

The reason for setting them to None on the class is entirely convention to improve readability. I find it helpful to be able to look at the class definition and get a full view of all of the various properties the class has. It's harder to gather that information if you have to dig it out of code, even simple code like this in the constructor.

self.unchecked_logs[shard_id].append(self.get_next_log(shard_id))
if self.unchecked_logs[shard_id][-1]['is_new_head'] is True:
self.unchecked_logs.append(self.get_next_log())
if self.unchecked_logs[-1]['is_new_head'] is True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i'm reading this logic right.

  • retrieve the new log with get_next_log()
  • append to self.unchecked_logs
  • if newest log is_new_head == True
    • break from loop and pop off newest log.

I think this is simpler written as:

while True:
    log_entry = self.get_new_log()
    if log_entry['is_new_head']:
        break
    else:
        self.unchecked_logs.append(log_entry)

...

It avoids appending and then popping by only appending in the case that the new log isn't the new head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. Thank you for pointing it out.
I've modified the code.


# TODO: currently just calls `shard_tracker.get_next_log`
def get_next_log(self, shard_id):
if shard_id not in self.shard_trackers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating this logic, I would suggest using self.get_shard_tracker(shard_id) here to retrieve the tracker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out.
Will change it.


# TODO: currently just calls `shard_tracker.fetch_candidate_head`
def fetch_candidate_head(self, shard_id):
if shard_id not in self.shard_trackers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here about removing this logic in favor of using self.get_shard_tracker(shard_id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change it. Thank you:)

yield 'topics', topics

def get_new_logs(self, address=None, topics=None):
revoked_hashes, new_block_hashes = check_chain_head(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have already asked this question. Does something need to be done with the revoked hashes as well? Should the tracker be doing anything with these to revert anything in the event that one of it's log entries for which is updated it's state is revoked?

Copy link
Contributor Author

@mhchia mhchia Jan 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm not sure about this. Please let me think about it more thoroughly. Sorry about this.
IMO now, If we just want to mimic the behavior of the version using filter.get_new_entries, I think we don't need to do the revoke.
Do you have any suggestion? Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I'm not familiar enough with the underlying algorithm to be able to provide a suggestion on what the right thing to do here is. Just noticed that the revoked data wasn't used and wanted be sure that wasn't an oversight. Feel free to mark it with a TODO comment and open an issue to track it and we can merge with the understanding that we'll need to address it down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. Thank you so much for keeping update with this part.
Yeah you're right it is worth keeping notice of this. I marked it with TODO and opened the issue #344.

Since most of them are test-related, we leave only `create_vmc_tx` and
`get_vmc_json` in vmc_utils.py, which specify the vmc tx and the
contract path, and move the others to `test_vmc_handler.py`
TODO
        - Still needs to test `fetch_candidate_head`
        - Refactor `fixture.py`
mhchia added 20 commits January 30, 2018 20:05
- Use event_signature_to_log_topic
- Modify parse_collation_added_data
- Change the range in `fetch_candidate_head` to reversed
TODO:
        - Add `filter_log`
And change `preprocess_block_param` to if-else
And fix defaultdict(None) to defaultdict(lambda: None)
new_collation_added_logs -> new_logs
unchecked_collation_added_logs -> unchecked_logs
current_checking_score -> current_score
- check_chain_head
- get_recent_block_hashes
- preprocess_block_param
- Maybe we can use one `log_handler` for all shards
- Re-run `GUESS_HEAD` when every new block arrives, to solve the problem
  of `fetch_candidate_head`
- Modify `fetch_candidate_head`, to avoid unnecessary `push` and `pop`
when the log is new head
- Explicitly declare the logs variables in ShardTracker
- Modify `ShardNotTracked` to `UnknownShard`
@mhchia mhchia force-pushed the sharding_vmc_handler_handle_logs branch from 8683a38 to 360ea86 Compare January 31, 2018 16:24
@hwwhww
Copy link
Contributor

hwwhww commented Jan 31, 2018

Mind writing a brief document in the PR description?

to_canonical_address,
to_checksum_address,
)

from evm.vm.forks.spurious_dragon.transactions import (
SpuriousDragonTransaction,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why use SpuriousDragonTransaction instead of ByzantiumTransaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out!
I will change it.

logger = logging.getLogger("evm.chain.sharding.LogHandler")

def __init__(self, w3, history_size=256):
self.history_size = history_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For LogHandler or higher levels (ShardTracker, VMC) APIs, is the LogHandler.history_size a fixed number of previous block hashes/headers capacity from the head block and it is only set in initialization?
If it's true, we may need a more flexible method to enable us to query for older block hashes (more than previous 256 block headers) dynamically during guess_head algorithm. The reason is that we want to verify as many collations as we can.

@mhchia mhchia merged commit 8171501 into ethereum:sharding Feb 2, 2018
@mhchia mhchia deleted the sharding_vmc_handler_handle_logs branch February 28, 2018 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants