Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel computation of node transition probabilities #114

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

instabaines
Copy link

I implemented the parallel computation of the node transition probabilities using concurrent.futures.

Comment on lines 127 to 138
def _precompute_probabilities(self):
"""
Precomputes transition probabilities for each node.
"""
nodes = list(self.graph.nodes())
if not self.quiet:
nodes = tqdm(nodes, desc='Computing transition probabilities')

with ThreadPoolExecutor(max_workers=self.workers) as executor:
futures = [executor.submit(self._compute_node_probabilities, source) for source in nodes]
for future in as_completed(futures):
future.result()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about this piece of code when trying to parallelize. Because of Q and P, essentially to compute probabilities of transition, we start with source, then we go to current_node and compute the probabilities for this current_node. When you parallelize over source you can mistakenly from two different sources get to the same current_node and overwrite previously calculated probabilities.

If I recall correctly, that is the original reason why I did not succeed in parallelizing this, I'm sure there is a way but I'm not sure this is it.

Please tell me what you think about that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is true. A thread safe approach could work. I will try it and revert back.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure thread safety is the only issue - I'm concered that even if a node is handled only by one thread, as it is updating its neighbors, on another iteration, another node that has some common neighbor with the previously-computed node will overwrite what the previous iteration made and this should be addressed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach I had in mind is using a dictionary of locks for each node to prevent the probabilities computed for a node from being overwritten by another thread. Let me know what you think

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, would this prevent overwriting in parallel, or will it mark a node as calculated and then prevent from re-calculating and overwriting? I believe the latter is preferable to make it faster

Also, since it is hard to evaluate by only looking at the code I think we need to do a sanity check by taking an example graph, running it the old way and the new way and making sure the result is quite similar, by comparing the structure of the resultign embedding space. Verifying that the same nodes have a similar neighborhood etc.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could supply code like that to demonstrate that the changes that you introduce still keep the result quite similar

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I agree we should do a sanity check.
How consistent should the output be since the embedding is not usually deterministic? I did a quick check and I ran the model using the old method twice. Do you have a suggestion on how best to do the comparison?

Running the model the old way the first time
image

Running the model the old way the second time
image

Running the model using the new way
image

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the output of which command?

To do a sanity check, I would create an "obvious graph" where one node will be very close to a neighborhood, and some other nodes will be really far. Then, the sanity check should show that on both cases, the far nodes will not show in their vicinity the neighborhood which would be close to the one node

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I was on break.
I will try and implement your suggestion.

The previous output is from these command

# Embed nodes
model = node2vec.fit(window=10, min_count=1, batch_words=4)  

# Look for most similar nodes
model.wv.most_similar('2')

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would

  1. Generate a graph but not a uniform random one, one that has some kind of asymetric structure, save it so we can use it for testing.
  2. Run old model, choose a random node number, lets say 7 and look for the 3 most similar and 3 least similar nodes to 7. Save them and their distances.
  3. Run the old model 100 times (in a loop) from scratch each time on the same graph, for each iteration save the distances to the 3 most similar and 3 least similar that we got from step 2. Now we have a distribution of expected distance from each one of the 6 nodes to node number 7.
  4. Run new model, check the distances from 7 to the 6 nodes, and see that statistically it is likely that it is from the same distribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants