Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed random number generators #70

Open
eric-czech opened this issue Feb 11, 2022 · 10 comments
Open

Seed random number generators #70

eric-czech opened this issue Feb 11, 2022 · 10 comments
Labels
help wanted Extra attention is needed

Comments

@eric-czech
Copy link

Awesome project @RemyLau, thanks for sharing your work on it!

Is there a way to seed the embedding functions/classes so that they produce the same results every time? I didn't see anything to that end in https://github.com/krishnanlab/PecanPy_benchmarks.

I haven't tried it yet, but I'd assume after a brief look at the code that seeding the numpy generator both within and outside of a numba function might do it (something like numba/numba#6002 (comment)). I'm not sure if that will work with @njit(parallel=True) though. Have you already figured out how to make that work?

@RemyLau
Copy link
Contributor

RemyLau commented Feb 11, 2022

Hi @eric-czech, thanks for the issue! This is an excellent suggestion and is good practice for reproducibility! I've wanted to implement this also #23, but as you said, not exactly sure how it will work with njit parallel=True. But I'll give it a try and see!

@RemyLau RemyLau linked a pull request Feb 11, 2022 that will close this issue
@RemyLau
Copy link
Contributor

RemyLau commented Feb 11, 2022

Hi @eric-czech, I've created a PR that introduces the random state option for random walk generation #71. Could you check that out and see if that is sufficient?

@eric-czech
Copy link
Author

Thanks @RemyLau! I'll give it a try soon and report back.

@RemyLau RemyLau removed a link to a pull request Feb 12, 2022
@eric-czech
Copy link
Author

No good on #71 unfortunately (with a caveat):

import networkx as nx
import pandas as pd
g = nx.random_geometric_graph(100, .1)
pd.DataFrame([
    dict(n1=f'N{e[0]}', n2=f'N{e[1]}')
    for e in g.edges
]).to_csv('/tmp/edges.csv', sep='\t', index=False, header=False)

# Run once (2 workers)
!pecanpy --input /tmp/edges.csv --output /tmp/edges1.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 2
# Run a second time (2 workers)
!pecanpy --input /tmp/edges.csv --output /tmp/edges2.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 2
!cmp /tmp/edges1.emb /tmp/edges2.emb
# /tmp/edges1.emb /tmp/edges2.emb differ: byte 8, line 2

But it does work with only one worker now, which wasn't the case before:

!pecanpy --input /tmp/edges.csv --output /tmp/edges1.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 1
!pecanpy --input /tmp/edges.csv --output /tmp/edges2.emb --task pecanpy --dimensions 64 --mode FirstOrderUnweighted --random_state 1 --workers 1
!cmp /tmp/edges1.emb /tmp/edges2.emb
# All good

Do you know if numba will let you pass a RandomState as an input (for node2vec_walks)? I wonder if that would work.

@eric-czech
Copy link
Author

eric-czech commented Feb 16, 2022

🤔 actually I don't think that would work either even if it did let you do that (which it doesn't).

@RemyLau
Copy link
Contributor

RemyLau commented Feb 16, 2022

@eric-czech I think this is might be an issue with gensim word2vec, but not the random walk generation. I've explicitly tested the reproducibility of the walks. I'm happy to find out how to control the gensim word2vec random seed also, but before that could you check to see if the walks (not necessarily the final embeddings) are consistent between runs?

To generate walks, you could use the following (hopefully bug-free 🤞) code snippet:

from pecanpy import pecanpy

g = pecanpy. FirstOrderUnweighted(random_state=1)
g.read_edg(path_to_edg, weighted=False, directed=False)
walks = g.simulate_walks(num_walks=10, walk_length=80)

@RemyLau
Copy link
Contributor

RemyLau commented Feb 16, 2022

In terms of gensim word2vec random state, there's a seed parameter that we can set for this purpose. However, they do note that to fully ensure the deterministic and reproducible result, we need to do two things:

  1. Use single thread
  2. Set PYTHONHASHSEED environment variable before launching Python

I think for now I'll just set the seed parameter with the one specified to the pecanpy cli, and it's up to the user to do the two things above... I'll try to see if I can get consistent results by doing these.

@eric-czech
Copy link
Author

To generate walks, you could use the following (hopefully bug-free 🤞) code snippet

Gave it a shot but no luck. I tried:

pd.DataFrame([
    dict(n1=f'N{e[0]}', n2=f'N{e[1]}')
    for e in nx.random_geometric_graph(100, .1).edges
]).to_csv('/tmp/edges.csv', sep='\t', index=False, header=False)

g = pecanpy.FirstOrderUnweighted(random_state=0, workers=1)
g.read_edg('/tmp/edges.csv', weighted=False, directed=False)
walks1 = g.simulate_walks(num_walks=10, walk_length=80)

g = pecanpy.FirstOrderUnweighted(random_state=0, workers=1)
g.read_edg('/tmp/edges.csv', weighted=False, directed=False)
walks2 = g.simulate_walks(num_walks=10, walk_length=80)

(pd.Series(walks1) == pd.Series(walks2)).value_counts()
# False    900
# True      40
# dtype: int64

If I dump that into a script that just generates the walks from some pre-existing /tmp/edges.csv, then I can get identical walks with numba.set_num_threads(1) in the beginning of the script. It doesn't work with any more workers/threads though.

I've explicitly tested the reproducibility of the walks

Makes sense given

set_num_threads(1)
.

Overall I suppose it's not that big of a deal if the Word2Vec part can't be parallelized. Thanks for checking that in the docs. It's a bummer though!

Let me know if you find anything else but feel free to close this otherwise.

@eric-czech
Copy link
Author

For posterity, my example above with two runs in the same python process does work with numba.set_num_threads(1) first.

@RemyLau
Copy link
Contributor

RemyLau commented Feb 16, 2022

Thanks a lot, @eric-czech! At the time being, I haven't come up with a good solution for taking care of this reproducibility issue with multi-threading yet. I'll keep this issue open for now, and hopefully, I'll be able to find something later to mitigate this (at least the random walk part).

@RemyLau RemyLau added the help wanted Extra attention is needed label Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants