-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel computation of node transition probabilities #114
base: master
Are you sure you want to change the base?
Parallel computation of node transition probabilities #114
Conversation
…n for computation
node2vec/node2vec.py
Outdated
def _precompute_probabilities(self): | ||
""" | ||
Precomputes transition probabilities for each node. | ||
""" | ||
nodes = list(self.graph.nodes()) | ||
if not self.quiet: | ||
nodes = tqdm(nodes, desc='Computing transition probabilities') | ||
|
||
with ThreadPoolExecutor(max_workers=self.workers) as executor: | ||
futures = [executor.submit(self._compute_node_probabilities, source) for source in nodes] | ||
for future in as_completed(futures): | ||
future.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about this piece of code when trying to parallelize. Because of Q and P, essentially to compute probabilities of transition, we start with source
, then we go to current_node
and compute the probabilities for this current_node
. When you parallelize over source
you can mistakenly from two different source
s get to the same current_node
and overwrite previously calculated probabilities.
If I recall correctly, that is the original reason why I did not succeed in parallelizing this, I'm sure there is a way but I'm not sure this is it.
Please tell me what you think about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is true. A thread safe approach could work. I will try it and revert back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure thread safety is the only issue - I'm concered that even if a node is handled only by one thread, as it is updating its neighbors, on another iteration, another node that has some common neighbor with the previously-computed node will overwrite what the previous iteration made and this should be addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach I had in mind is using a dictionary of locks for each node to prevent the probabilities computed for a node from being overwritten by another thread. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, would this prevent overwriting in parallel, or will it mark a node as calculated and then prevent from re-calculating and overwriting? I believe the latter is preferable to make it faster
Also, since it is hard to evaluate by only looking at the code I think we need to do a sanity check by taking an example graph, running it the old way and the new way and making sure the result is quite similar, by comparing the structure of the resultign embedding space. Verifying that the same nodes have a similar neighborhood etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you could supply code like that to demonstrate that the changes that you introduce still keep the result quite similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I agree we should do a sanity check.
How consistent should the output be since the embedding is not usually deterministic? I did a quick check and I ran the model using the old method twice. Do you have a suggestion on how best to do the comparison?
Running the model the old way the first time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the output of which command?
To do a sanity check, I would create an "obvious graph" where one node will be very close to a neighborhood, and some other nodes will be really far. Then, the sanity check should show that on both cases, the far nodes will not show in their vicinity the neighborhood which would be close to the one node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I was on break.
I will try and implement your suggestion.
The previous output is from these command
# Embed nodes
model = node2vec.fit(window=10, min_count=1, batch_words=4)
# Look for most similar nodes
model.wv.most_similar('2')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would
- Generate a graph but not a uniform random one, one that has some kind of asymetric structure, save it so we can use it for testing.
- Run old model, choose a random node number, lets say
7
and look for the 3 most similar and 3 least similar nodes to7
. Save them and their distances. - Run the old model 100 times (in a loop) from scratch each time on the same graph, for each iteration save the distances to the 3 most similar and 3 least similar that we got from step 2. Now we have a distribution of expected distance from each one of the 6 nodes to node number
7
. - Run new model, check the distances from
7
to the 6 nodes, and see that statistically it is likely that it is from the same distribution
I implemented the parallel computation of the node transition probabilities using concurrent.futures.