Add Multiprocess execution #82

eSoares · 2020-08-26T10:40:06Z

Add a LinkModel that performs the simulation in multiprocess, one process per SNR.
The implementation works as a drop-in replacement, just need to import multiprocess_links instead of links.

Also adds a Wifi80211 multiprocess, where it can work one process per SNR, but can go further and perform one process per MCS x SNR. Example of usage available!
This multiplexing helps a lot, from the example available, running in the classic single process would take about 3min:7s while the multiprocess (under my 8 core work machine) takes about 40 seconds.

Because the limitations of the multiprocess in Python, any parameter to be passed to the process that executes the simulation need to be pickle serializable. This causes some problems when passing custom functions for receiving, decoding and so-on, but as long as they are in the root of the file and not nested (a function defined inside of a function) should work.

eSoares · 2020-08-26T23:11:08Z

I have run the exact same tests has travis in a couple of different machines without any problem...
If anyone know what could be the cause, fell free to comment!

BastienTr · 2020-08-27T06:31:33Z

What's surprising is that both failed tests have exactly the same error

BastienTr · 2020-08-27T11:31:29Z

The test is also OK on my laptop... Could this error come from a different library version ? Travis uses the most recent packages and we may have older ones.

coveralls · 2020-08-27T14:31:11Z

Coverage increased (+0.8%) to 80.069% when pulling 1205fc8 on eSoares:multiprocess into 2a60fc5 on veeresht:master.

eSoares · 2020-08-27T14:58:07Z

Looks like forcing to only use one process works for now... At least I had it run in travis twice with success.
(Other run at: https://travis-ci.org/github/eSoares/CommPy/builds/721700025 )

BastienTr · 2020-09-07T12:17:58Z

Hi @eSoares,
I'm not sure what to do with this PR which bothers me for many reasons.

Including code that can give a silent bug in some environments seems hazardous to me. Even if it does work on our individual computers, the failure on Travis suggest that this failure could occur for other users.
The limitation to picklable items looks cumbersome and confusing to the untrained eye. If we merge this PR, we could switch to pathos to work around this limitation. It would be better in my eyes even if it means adding a dependency.
The proposed parallelization is not optimal. Here, you parallelize the SNR loop which is not efficient. Indeed, the size of the iterations is varying (the best SNR requires a lot more computations than the worst to get the same number of errors). The better solution is either to parallelize on the runs (each core process a channel/receiver configuration...) or to divide the total computation on each core (if we ask for 200 errors on 2 cores, each core computes 100 errors and the average BER is returned).
I'm not sure that adding multiprocessing into the library is appropriate. Indeed, multiprocessing optimization should be left to the user, who often has the opportunity to parallelize more widely (launch several test configurations in parallel, for example). Moreover, numpy may already use several core in some situation depending on the installation setup. I had to turn off numpy's multithreading before adding my multiprocessing trick. If not, the computation time was far worst than the simple solution.

For your information, here is the multiprocessing I use, which gives much better speedup and scalability. Indeed, each core compute a whole SNR vs BER curve and then have more or less the same computation time. I use p_tqdm which is a wrapper around pathos that generate progress bars.

As explained earlier I have to turn off numpy's multithreading. This solution looks simple enough to me to let the user write is own code. That's why, I think that we should leave to the user this kind of optimization. May be add an example that guides on this point.

from commpy.links import link_performance
from p_tqdm import p_map

# Models as a list
models = []
# Define each model to test here...

# Helper function
def perf(model):
    return link_performance(model, SNRs, nb_it, nb_err, chunk, code_rate)

BERs = p_map(perf, models, num_cpus=min(12, nb_test))

This is only my point of view. I am fully open to your opinions and arguments.

eSoares · 2020-09-07T23:53:36Z

Hello Bastian,

I agree with some of your concerns.
My PR is in the way of making easier use cases that I have found, note that nothing in this code interferes or changes the possibility to use other strategies to divide work.

Including code that can give a silent bug in some environments seems hazardous to me. Even if it does work on our individual computers, the failure on Travis suggest that this failure could occur for other users.

I can try to set the seed inside each process, that should make the multiprocess reproducible and more stable. If this PR goes forward, I will make this change.

The limitation to picklable items looks cumbersome and confusing to the untrained eye. If we merge this PR, we could switch to pathos to work around this limitation. It would be better in my eyes even if it means adding a dependency.

Did not know about this library, but looks a great improvement! Will use it in future! 👍

The proposed parallelization is not optimal. Here, you parallelize the SNR loop which is not efficient. Indeed, the size of the iterations is varying (the best SNR requires a lot more computations than the worst to get the same number of errors). The better solution is either to parallelize on the runs (each core process a channel/receiver configuration...) or to divide the total computation on each core (if we ask for 200 errors on 2 cores, each core computes 100 errors and the average BER is returned).

I guess the best strategy would be to parallelize the iterations inside link_performance, both the SNRs and the amount of sends until send_max, only the first is parallelized in the current state.
A good change would be to iterate linearly for each SNR, but parallelize the sends until send_max with some shared memory to synchronize the amount of runs and BERs to know where to stop in each SNR.

I'm not sure that adding multiprocessing into the library is appropriate. Indeed, multiprocessing optimization should be left to the user, who often has the opportunity to parallelize more widely (launch several test configurations in parallel, for example).

Well, that is up for debate if such elements should exist in the library or not. Even if they are not in the core part, at least in some examples it should exist.

Moreover, numpy may already use several core in some situation depending on the installation setup. I had to turn off numpy's multithreading before adding my multiprocessing trick. If not, the computation time was far worst than the simple solution.

For what I found is particularly problematic in matrix multiplication[1] (which is not used?), but the rest of operations were not problematic. But I'm no export in mixing numpy with multiprocess, so I take your statement and should be used with care :)
Maybe add a set of number of threads that numpy can use to 1 before any execution [2] and the taskset after import [1] if this code goes forward.

For your information, here is the multiprocessing I use, which gives much better speedup and scalability. Indeed, each core compute a whole SNR vs BER curve and then have more or less the same computation time. I use p_tqdm which is a wrapper around pathos that generate progress bars.

Thank you, is a nice library that I did not know of. 😃
As I mostly use docker to run the code (in a remote machine) the progress bar to the cli is not really useful, but still good to know the option.

About your example code in particular, that is kinda of what I do, but instead of parallelizing multiple models (since I only want to test one), I parallelize multiple SNRs (and MCS in the case of WiFi).

In the end the question about "what makes sense to parallelize" is "depends what you're comparing" and of course "how many cores can you throw at it". If you have enough cores, parallelizing for each SNR may take more advantage of the available computing.

Still even if this PR doesn't move forward, I think that at least an example and a set of instructions to be careful with numpy a multiprocess should result, for some less knowledgeable (as myself) don't make the mistakes you are pointing out! 😃

1 - https://stackoverflow.com/a/15647474
2 - https://stackoverflow.com/a/48665619

BastienTr · 2020-09-08T09:44:05Z

I can try to set the seed inside each process, that should make the multiprocess reproducible and more stable. If this PR goes forward, I will make this change.

Do you think that it's related to the seed? Here are the results for the erroneous test from the Travis logs:

build 230 outputs x: array([0.489444, 0.470556, 0.419444, 0.683333, 0.014394]),
build 231 outputs x: array([0.489444, 0.470556, 0.419444, 0.683333, 0.014394]).
The result are strictly identical! This does not look like a random-related bug to me. Howerver, I can't find another explaination...

Well, that is up for debate if such elements should exist in the library or not. Even if they are not in the core part, at least in some examples it should exist.

I agree that this should be somewhere either as a core element or as an example with guidelines. The latter looks better to me since this would be the place to explain the workaround with numpy's multithreading. However, adding multiprocessing as a core element is not irrational at all. @veeresht, do you have a long-term opinion?

For what I found is particularly problematic in matrix multiplication[1] (which is not used?), but the rest of operations were not problematic. But I'm no export in mixing numpy with multiprocess, so I take your statement and should be used with care :)
Maybe add a set of number of threads that numpy can use to 1 before any execution [2] and the taskset after import [1] if this code goes forward.

I run into the problem when propagating through MIMOFlatChannel objects. MIMOFlatChannel.propagate relies on a complex einsum wich uses numpy multithreading. I ends up on the same StackOverflow threads and now use export OPENBLAS_NUM_THREADS=1 or export MKL_NUM_THREADS=1 before my parallel runs.

CommPy/commpy/channels.py

Lines 376 to 379 in 2a60fc5

    
           # Add correlation and mean 
        
           einsum('ij,ajk,lk->ail', sqrtm(self.fading_param[2]), self.channel_gains, sqrtm(self.fading_param[1]), 
        
                  out=self.channel_gains, optimize='greedy') 
        
           self.channel_gains += self.fading_param[0]

As I mostly use docker to run the code (in a remote machine) the progress bar to the cli is not really useful, but still good to know the option.

Same here but I redirect the progress bar to a log file so that I have a costless insight on the computation time of each iterations.

About your example code in particular, that is kinda of what I do, but instead of parallelizing multiple models (since I only want to test one), I parallelize multiple SNRs (and MCS in the case of WiFi).
In the end the question about "what makes sense to parallelize" is "depends what you're comparing" and of course "how many cores can you throw at it". If you have enough cores, parallelizing for each SNR may take more advantage of the available computing.
&
I guess the best strategy would be to parallelize the iterations inside link_performance, both the SNRs and the amount of sends until send_max, only the first is parallelized in the current state.
A good change would be to iterate linearly for each SNR, but parallelize the sends until send_max with some shared memory to synchronize the amount of runs and BERs to know where to stop in each SNR.

If you test several MCS at once, it means that you're testing several models at once. In CommPy, a model includes the transmitter and the receiver.

Anyway, if you have enough cores to parallize on each model and then still having some core available, you can split the workload on the send_max axis easily. Here is a snippet where you have only one model. It is not hard to go for a more general code with the same idea.

# Create a tuple of models
models = (model,) * nb_core

# Helper function
def perf(model):
    return link_performance(model, SNRs, nb_it // nb_core, nb_err // nb_core, chunk, code_rate)

# Split the computation and agregate the result
partial_BERs = p_map(perf, models)
np.array(partial_BERs).mean(0)

eSoares added 4 commits August 25, 2020 12:48

Add multiprocess_links a drop-in replacement to run in multiprocess

218a745

Add multiprocess_links test

28797b5

Add wifi multiprocess_links tests

135a480

Fix grabbing only the BER from the result of the simulation

b2460e3

Force maximum amount of processes at testing

def04dd

eSoares added 4 commits September 7, 2020 11:00

Merge branch 'master' into multiprocess

c980cce

Fix test for multiprocess

5f4087b

Add documentation

64fff06

Merge remote-tracking branch 'my-origin/multiprocess' into multiprocess

1205fc8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Multiprocess execution #82

Add Multiprocess execution #82

eSoares commented Aug 26, 2020

eSoares commented Aug 26, 2020

BastienTr commented Aug 27, 2020 •

edited

Loading

BastienTr commented Aug 27, 2020

coveralls commented Aug 27, 2020 •

edited

Loading

eSoares commented Aug 27, 2020

BastienTr commented Sep 7, 2020

eSoares commented Sep 7, 2020

BastienTr commented Sep 8, 2020

Add Multiprocess execution #82

Are you sure you want to change the base?

Add Multiprocess execution #82

Conversation

eSoares commented Aug 26, 2020

eSoares commented Aug 26, 2020

BastienTr commented Aug 27, 2020 • edited Loading

BastienTr commented Aug 27, 2020

coveralls commented Aug 27, 2020 • edited Loading

eSoares commented Aug 27, 2020

BastienTr commented Sep 7, 2020

eSoares commented Sep 7, 2020

BastienTr commented Sep 8, 2020

BastienTr commented Aug 27, 2020 •

edited

Loading

coveralls commented Aug 27, 2020 •

edited

Loading