Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] What prevents DefaultCommunicator from working with multi-node MPI #744

Open
XapaJIaMnu opened this issue Oct 10, 2020 · 1 comment

Comments

@XapaJIaMnu
Copy link
Contributor

Hey,

I have been experimenting with some CPU training (without CUDA/NCCL) and I run into this ABORT statement.
https://github.com/marian-nmt/marian-dev/blob/master/src/training/communicator.h#L128

However, I don't see immediately what implementation part is missing in order to get full MPI support. I tried commenting out the ABORT and it seems that it works. Are gradients not exchanged? Git blame shows @frankseide put it there?

Could you guys let me know what part I should add to properly fix this issue?

Cheers,

Nick

@emjotde
Copy link
Member

emjotde commented Nov 10, 2020

We talked per e-mail, but basically you are welcome to own that part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants