Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add generic multiple machine launcher #202

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

gmuraru
Copy link
Contributor

@gmuraru gmuraru commented Dec 8, 2020

Types of changes

This still needs to be tested on some AWS instances.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Docs change / refactoring / dependency upgrade

Motivation and Context / Related issue

Allow folks to have an easier method to deploy on multiple machines (not necessarily AWS instances).

How Has This Been Tested (if it applies)

Run the following code - tested for the moment on 2 machines (personal laptops).

 python scripts/multiple_machines/generic_launcher.py --ip_addresses=192.168.100.12,192.168.100.14 --ssh_user=george --ssh_key_file="/home/george/.ssh/id_rsa.pub" --aux_files=examples/mpc_linear_svm/mpc_linear_svm.py examples/mpc_linear_svm/launcher.py \
          --features 50 \
          --examples 100 \
          --epochs 50 \
          --lr 0.5 \
          --skip_plaintext

And got the following output:

Running world size 2 with ip_addresses: ['192.168.100.12', '192.168.100.14']
Connecting to 192.168.100.12...
Connected to 192.168.100.12
Connecting to 192.168.100.14...
Connected to 192.168.100.14
Remote path tmp-dir-a39bb3de-39ad-11eb-8752-836d041ba8e6
Uploading `examples/mpc_linear_svm/mpc_linear_svm.py` to 192.168.100.12 as tmp-dir-a39bb3de-39ad-11eb-8752-836d041ba8e6/mpc_linear_svm.py...
Uploading `examples/mpc_linear_svm/launcher.py` to 192.168.100.12 as tmp-dir-a39bb3de-39ad-11eb-8752-836d041ba8e6/launcher.py...
`examples/mpc_linear_svm/mpc_linear_svm.py` uploaded to 192.168.100.12
`examples/mpc_linear_svm/launcher.py` uploaded to 192.168.100.12
Uploading `examples/mpc_linear_svm/launcher.py` to 192.168.100.14 as tmp-dir-a39bb3de-39ad-11eb-8752-836d041ba8e6/launcher.py...
Uploading `examples/mpc_linear_svm/mpc_linear_svm.py` to 192.168.100.14 as tmp-dir-a39bb3de-39ad-11eb-8752-836d041ba8e6/mpc_linear_svm.py...
`examples/mpc_linear_svm/launcher.py` uploaded to 192.168.100.14
`examples/mpc_linear_svm/mpc_linear_svm.py` uploaded to 192.168.100.14
[192.168.100.12 STDOUT] total 24
[192.168.100.12 STDOUT] drwxrwxr-x   2 george george  4096 Dec  9 01:32 .
[192.168.100.12 STDOUT] drwxr-xr-x 179 george george 12288 Dec  9 01:32 ..
[192.168.100.12 STDOUT] -rwxrwxr-x   1 george george  2526 Dec  9 01:32 launcher.py
[192.168.100.12 STDOUT] -rw-rw-r--   1 george george  3185 Dec  9 01:32 mpc_linear_svm.py
[192.168.100.14 STDOUT] total 52
[192.168.100.14 STDOUT] drwxrwxr-x   2 george george  4096 dec  9 01:32 .
[192.168.100.14 STDOUT] drwxr-x--x 484 george george 36864 dec  9 01:32 ..
[192.168.100.14 STDOUT] -rwxrwxr-x   1 george george  2526 dec  9 01:32 launcher.py
[192.168.100.14 STDOUT] -rw-rw-r--   1 george george  3185 dec  9 01:32 mpc_linear_svm.py
Run command: export WORLD_SIZE=2; export RENDEZVOUS=env://; export MASTER_ADDR=192.168.100.12; export MASTER_PORT=29500; export RANK=0;export GLOO_SOCKET_IFNAME=wlp59s0; cd tmp-dir-a39bb3de-39ad-11eb-8752-836d041ba8e6 ;  ./launcher.py --features 50 --examples 100 --epochs 50 --lr 0.5 --skip_plaintext
Run command: export WORLD_SIZE=2; export RENDEZVOUS=env://; export MASTER_ADDR=192.168.100.12; export MASTER_PORT=29500; export RANK=1;export GLOO_SOCKET_IFNAME=wlp2s0; cd tmp-dir-a39bb3de-39ad-11eb-8752-836d041ba8e6 ;  ./launcher.py --features 50 --examples 100 --epochs 50 --lr 0.5 --skip_plaintext
[192.168.100.12 STDOUT] 2020-12-09 01:32:31,664 - 1151987 - root - INFO - ==================
[192.168.100.12 STDOUT] 2020-12-09 01:32:31,664 - 1151987 - root - INFO - DistributedCommunicator with rank 0
[192.168.100.12 STDOUT] 2020-12-09 01:32:31,664 - 1151987 - root - INFO - ==================
[192.168.100.12 STDOUT] 2020-12-09 01:32:32,411 - 1151987 - root - INFO - World size = 2
[192.168.100.12 STDOUT] 2020-12-09 01:32:32,446 - 1151987 - root - INFO - ==================
[192.168.100.12 STDOUT] 2020-12-09 01:32:32,446 - 1151987 - root - INFO - CrypTen Training
[192.168.100.12 STDOUT] 2020-12-09 01:32:32,446 - 1151987 - root - INFO - ==================
[192.168.100.12 STDOUT] 2020-12-09 01:32:32,652 - 1151987 - root - INFO - Epoch 0 --- Training Accuracy 46.00%
[192.168.100.12 STDOUT] 2020-12-09 01:32:32,726 - 1151987 - root - INFO -     Time 0.278729 (0.278729)
[192.168.100.12 STDOUT] 2020-12-09 01:32:32,886 - 1151987 - root - INFO - Epoch 1 --- Training Accuracy 49.00%
[192.168.100.12 STDOUT] 2020-12-09 01:32:32,951 - 1151987 - root - INFO -     Time 0.225659 (0.252194)
..........
[192.168.100.12 STDOUT] 2020-12-09 01:32:45,940 - 1151987 - root - INFO - Epoch 49 --- Training Accuracy 100.00%
[192.168.100.12 STDOUT] 2020-12-09 01:32:46,004 - 1151987 - root - INFO -     Time 0.240453 (0.270870)
[192.168.100.12 STDOUT] 2020-12-09 01:32:46,005 - 1151987 - root - INFO - CrypTen Weights:
[192.168.100.12 STDOUT] 2020-12-09 01:32:46,030 - 1151987 - root - INFO - tensor([[ 0.6267,  0.0714, -0.8281,  1.1058,  2.2261,  1.0264, -0.2439,  0.3458,
[192.168.100.12 STDOUT]          -0.2066,  1.1105, -0.4470, -0.9671,  0.8811, -0.0458, -0.3212, -0.4173,
[192.168.100.12 STDOUT]          -0.1075, -0.4312,  1.6203,  0.9271,  1.3630, -0.4866, -0.4703,  0.1390,
[192.168.100.12 STDOUT]          -0.1463, -1.2916,  0.5051, -0.2045,  0.1124, -0.6340,  0.1265, -0.7409,
[192.168.100.12 STDOUT]           0.1304,  1.9369, -1.3592, -0.3962, -1.2901,  1.4486, -0.7696,  0.2924,
[192.168.100.12 STDOUT]           0.8608,  0.3240,  0.0914, -0.4166,  0.7992,  0.2739,  0.5351,  0.6998,
[192.168.100.12 STDOUT]           0.6881,  0.6510]])
[192.168.100.12 STDOUT] 2020-12-09 01:32:46,033 - 1151987 - root - INFO - CrypTen Bias:
[192.168.100.12 STDOUT] 2020-12-09 01:32:46,040 - 1151987 - root - INFO - tensor([0.2052])

Checklist

  • The documentation is up-to-date with the changes I made.
  • I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
  • All tests passed, and additional code has been covered with new tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 8, 2020
@@ -23,8 +23,6 @@
import logging
import os

from examples.multiprocess_launcher import MultiProcessLauncher
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was needed such that there is not needed to be sent when running this example on multiple machines.

We can also send this file, but we would also need to create the directory structure. (examples ...)

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knottb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

knottb pushed a commit to gmuraru/CrypTen that referenced this pull request Dec 18, 2020
…arty and save_from_party respectively (facebookresearch#202)

Summary:
Pull Request resolved: fairinternal/CrypTen#202

Change current crypten.load and crypten.save functions to load_from_party and save_from_party respectively.

Reviewed By: knottb

Differential Revision: D21026301

fbshipit-source-id: 7ed8a8b483432caa826198867d22a54542393178
@gmuraru
Copy link
Contributor Author

gmuraru commented May 29, 2021

I think this can be closed since it was imported inside Phabricator, right? @knottb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants