Parallel computation of node transition probabilities #114

instabaines · 2024-06-10T19:04:44Z

I implemented the parallel computation of the node transition probabilities using concurrent.futures.

…obabilities

…n for computation

eliorc · 2024-07-15T07:26:34Z

node2vec/node2vec.py

+    def _precompute_probabilities(self):
+        """
+        Precomputes transition probabilities for each node.
+        """
+        nodes = list(self.graph.nodes())
+        if not self.quiet:
+            nodes = tqdm(nodes, desc='Computing transition probabilities')
+
+        with ThreadPoolExecutor(max_workers=self.workers) as executor:
+            futures = [executor.submit(self._compute_node_probabilities, source) for source in nodes]
+            for future in as_completed(futures):
+                future.result()


I'm a bit worried about this piece of code when trying to parallelize. Because of Q and P, essentially to compute probabilities of transition, we start with source, then we go to current_node and compute the probabilities for this current_node. When you parallelize over source you can mistakenly from two different sources get to the same current_node and overwrite previously calculated probabilities.

If I recall correctly, that is the original reason why I did not succeed in parallelizing this, I'm sure there is a way but I'm not sure this is it.

Please tell me what you think about that

Yes. This is true. A thread safe approach could work. I will try it and revert back.

I'm not sure thread safety is the only issue - I'm concered that even if a node is handled only by one thread, as it is updating its neighbors, on another iteration, another node that has some common neighbor with the previously-computed node will overwrite what the previous iteration made and this should be addressed

The approach I had in mind is using a dictionary of locks for each node to prevent the probabilities computed for a node from being overwritten by another thread. Let me know what you think

Interesting, would this prevent overwriting in parallel, or will it mark a node as calculated and then prevent from re-calculating and overwriting? I believe the latter is preferable to make it faster

Also, since it is hard to evaluate by only looking at the code I think we need to do a sanity check by taking an example graph, running it the old way and the new way and making sure the result is quite similar, by comparing the structure of the resultign embedding space. Verifying that the same nodes have a similar neighborhood etc.

It would be great if you could supply code like that to demonstrate that the changes that you introduce still keep the result quite similar

Yes. I agree we should do a sanity check.
How consistent should the output be since the embedding is not usually deterministic? I did a quick check and I ran the model using the old method twice. Do you have a suggestion on how best to do the comparison?

Running the model the old way the first time

Running the model the old way the second time

Running the model using the new way

This is the output of which command?

To do a sanity check, I would create an "obvious graph" where one node will be very close to a neighborhood, and some other nodes will be really far. Then, the sanity check should show that on both cases, the far nodes will not show in their vicinity the neighborhood which would be close to the one node

Hi, I was on break.
I will try and implement your suggestion.

The previous output is from these command

# Embed nodes model = node2vec.fit(window=10, min_count=1, batch_words=4) # Look for most similar nodes model.wv.most_similar('2')

I would

Generate a graph but not a uniform random one, one that has some kind of asymetric structure, save it so we can use it for testing.

Run old model, choose a random node number, lets say 7 and look for the 3 most similar and 3 least similar nodes to 7. Save them and their distances.

Run the old model 100 times (in a loop) from scratch each time on the same graph, for each iteration save the distances to the 3 most similar and 3 least similar that we got from step 2. Now we have a distribution of expected distance from each one of the 6 nodes to node number 7.

Run new model, check the distances from 7 to the 6 nodes, and see that statistically it is likely that it is from the same distribution

…robabilities

instabaines added 4 commits June 5, 2024 10:35

updated dependecies version to use python==3.11,networkx=3.x

6abcf1d

added helper function for parrallel computation of node transition pr…

9bf9a7b

…obabilities

revert back to default dependencies

e98a4b8

update node transition probabilities function to use parrallel funtio…

d06e22a

…n for computation

eliorc reviewed Jul 15, 2024

View reviewed changes

implemented thread lock for parallel computation of node transition p…

37e5ae6

…robabilities

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel computation of node transition probabilities #114

Parallel computation of node transition probabilities #114

instabaines commented Jun 10, 2024

eliorc Jul 15, 2024

instabaines Jul 16, 2024

eliorc Jul 18, 2024

instabaines Jul 22, 2024

eliorc Jul 25, 2024

eliorc Jul 25, 2024

instabaines Jul 29, 2024

eliorc Aug 3, 2024

instabaines Aug 13, 2024

eliorc Aug 15, 2024

Parallel computation of node transition probabilities #114

Are you sure you want to change the base?

Parallel computation of node transition probabilities #114

Conversation

instabaines commented Jun 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment