Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelism for makeIdentifiers #11

Open
hayesall opened this issue Jul 20, 2018 · 1 comment
Open

Parallelism for makeIdentifiers #11

hayesall opened this issue Jul 20, 2018 · 1 comment
Assignees

Comments

@hayesall
Copy link
Member

A large amount of the running time tends to be spent in parse.makeIdentifiers(), which is essentially a triple-nested for loop over blocks, sentences, and words.

Previously this was "resolved" by wrapping the outer loop with tqdm to estimate how long the process would take. This did not actually change anything but likely would make someone feel better about the situation.


joblib may be a viable way to execute the outer loop in parallel:

from joblib import Parallel, delayed
from tqdm import tqdm

def foo(block, blockID):
    """
    :param block: The current block to be processed (list of lists).
    :param blockID: Index of the current block (int).
    """
    return [blockID]

Blocks = list(range(5000))
facts = Parallel(n_jobs=-1)(delayed(foo)(Blocks[i], i) for i in tqdm(range(len(Blocks))))

In the short example above, the "Blocks" would in reality be the the list of blocks generated earlier. foo(block, blockID) would be something similar to the current parse.makeIdentifiers() method, but blockID is passed as a parameter rather than an integer that increments at the end of the outer loop.

@hayesall
Copy link
Member Author

Current progress is on batflyer/rnlp (parallel). I did a short round of testing to estimate the sort of performance gains that we might expect, graphed below.

Both plots were tested on the same corpus and ran on my local machine.

  • Top graph set blockSize=1
  • Bottom graph set blockSize=2.
  • x-axis varies the number of cores
  • y-axis displays the amount of time (in seconds) that it took to process the blocks.

time_vs_cores

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant