-
Notifications
You must be signed in to change notification settings - Fork 82
Fix writer initialization bug affecting horovod TF #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
d8bbfe9
Fix bug affecting horovod TF runs wrt initialization of writers. Also…
rahul003 5514ea5
util function refactor
rahul003 70c38ae
Fix lack of support for PS
rahul003 577dbcf
load json once
rahul003 05ddc62
rename var
rahul003 d50b90d
merge lack of ps support branch
rahul003 d9d4cb5
Remove unnecessary set num workers in constructor, it is always set i…
rahul003 5ef155d
Merge branch 'fix_lack_of_support' into horovod
rahul003 2328197
Restructure order of calls of distribution strategy setting
rahul003 7cf37d2
Fix mirrored strategy CPU
rahul003 72f16f9
Fix config tests
rahul003 95c5522
Fix return of all writers for mirrored
rahul003 d833f6c
cleanup
rahul003 90b2c92
Save all workers bug handled for mirrored strategy
rahul003 5eb6451
Fix mirrored strategy all workers bug
rahul003 e1287d0
Remove prints from tests
rahul003 2c6fb1b
Fix None being returned for writers
rahul003 d797a0a
Merge branch 'master' of https://github.com/awslabs/sagemaker-debugge…
rahul003 101a167
Add horovod examples
rahul003 1b35974
Add horovod ZCC tests
rahul003 2a6d25a
Fix model dir path for estimator and keras examples. Also fix estimat…
rahul003 e20a7f5
Change num tensors saved
rahul003 aefa9ba
change model dir for keras test
rahul003 c81e7dd
Address review comments
rahul003 9fb4197
Switch to horovodrun
rahul003 ac48e45
Place config file in outdir
rahul003 73e9724
Fix horovodrun command
rahul003 4cb6425
copy env dict so next job is not affected
rahul003 7ed5296
Add comments to scripts explaining modifications to enable SMD
rahul003 74d5575
Added asserts for prepared tensors when needed. Refactored distributi…
rahul003 ffa1d2d
revert to get_distribution_strategy
rahul003 01bb276
Merge master
rahul003 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,158 @@ | ||
| """ | ||
| This script is a simple MNIST training script which uses Horovod and Tensorflow's Keras interface. | ||
| It has been orchestrated with SageMaker Debugger hook to allow saving tensors during training. | ||
| Here, the hook has been created using its constructor to allow running this locally for your experimentation. | ||
| When you want to run this script in SageMaker, it is recommended to create the hook from json file. | ||
| Please see scripts in either 'sagemaker_byoc' or 'sagemaker_official_container' folder based on your use case. | ||
|
|
||
| This script has been adapted from an example in Horovod repository https://github.com/uber/horovod | ||
| """ | ||
|
|
||
| # Standard Library | ||
| import argparse | ||
| import math | ||
| import os | ||
|
|
||
| # Third Party | ||
| import horovod.tensorflow.keras as hvd | ||
| import tensorflow as tf | ||
| from tensorflow import keras | ||
| from tensorflow.keras import backend as K | ||
| from tensorflow.keras.datasets import mnist | ||
| from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D | ||
| from tensorflow.keras.models import Sequential | ||
|
|
||
| # First Party | ||
| import smdebug.tensorflow as smd | ||
|
|
||
|
|
||
| def str2bool(v): | ||
| if isinstance(v, bool): | ||
| return v | ||
| if v.lower() in ("yes", "true", "t", "y", "1"): | ||
| return True | ||
| elif v.lower() in ("no", "false", "f", "n", "0"): | ||
| return False | ||
| else: | ||
| raise argparse.ArgumentTypeError("Boolean value expected.") | ||
|
|
||
|
|
||
| def main(args): | ||
| # Horovod: initialize Horovod. | ||
| hvd.init() | ||
|
|
||
| if not args.use_only_cpu: | ||
| # Horovod: pin GPU to be used to process local rank (one GPU per process) | ||
| config = tf.ConfigProto() | ||
| config.gpu_options.allow_growth = True | ||
| config.gpu_options.visible_device_list = str(hvd.local_rank()) | ||
| else: | ||
| config = None | ||
|
|
||
| K.set_session(tf.Session(config=config)) | ||
|
|
||
| batch_size = 128 | ||
| num_classes = 10 | ||
|
|
||
| # Horovod: adjust number of epochs based on number of GPUs. | ||
| epochs = int(math.ceil(args.num_epochs / hvd.size())) | ||
|
|
||
| # Input image dimensions | ||
| img_rows, img_cols = 28, 28 | ||
|
|
||
| # The data, shuffled and split between train and test sets | ||
| (x_train, y_train), (x_test, y_test) = mnist.load_data() | ||
|
|
||
| if K.image_data_format() == "channels_first": | ||
| x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols) | ||
| x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols) | ||
| input_shape = (1, img_rows, img_cols) | ||
| else: | ||
| x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1) | ||
| x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1) | ||
| input_shape = (img_rows, img_cols, 1) | ||
|
|
||
| x_train = x_train.astype("float32") | ||
| x_test = x_test.astype("float32") | ||
| x_train /= 255 | ||
| x_test /= 255 | ||
| print("x_train shape:", x_train.shape) | ||
| print(x_train.shape[0], "train samples") | ||
| print(x_test.shape[0], "test samples") | ||
|
|
||
| # Convert class vectors to binary class matrices | ||
| y_train = keras.utils.to_categorical(y_train, num_classes) | ||
| y_test = keras.utils.to_categorical(y_test, num_classes) | ||
|
|
||
| model = Sequential() | ||
| model.add(Conv2D(32, kernel_size=(3, 3), activation="relu", input_shape=input_shape)) | ||
| model.add(Conv2D(64, (3, 3), activation="relu")) | ||
| model.add(MaxPooling2D(pool_size=(2, 2))) | ||
| model.add(Dropout(0.25)) | ||
| model.add(Flatten()) | ||
| model.add(Dense(128, activation="relu")) | ||
| model.add(Dropout(0.5)) | ||
| model.add(Dense(num_classes, activation="softmax")) | ||
|
|
||
| # Horovod: adjust learning rate based on number of GPUs. | ||
| opt = keras.optimizers.Adadelta(1.0 * hvd.size()) | ||
|
|
||
| # Horovod: add Horovod Distributed Optimizer. | ||
| opt = hvd.DistributedOptimizer(opt) | ||
|
|
||
| ##### Enabling SageMaker Debugger ########### | ||
| # creating hook | ||
| smd_hook = smd.KerasHook( | ||
| out_dir=args.out_dir, | ||
| save_config=smd.SaveConfig(save_interval=args.save_interval), | ||
| include_collections=["weights", "gradients"], | ||
| include_workers=args.include_workers, | ||
| ) | ||
|
|
||
| ##### Enabling SageMaker Debugger ########### | ||
| # wrapping optimizer so hook can identify gradients | ||
| opt = smd_hook.wrap_optimizer(opt) | ||
|
|
||
| model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=["accuracy"]) | ||
|
|
||
| callbacks = [ | ||
| # Horovod: broadcast initial variable states from rank 0 to all other processes. | ||
| # This is necessary to ensure consistent initialization of all workers when | ||
| # training is started with random weights or restored from a checkpoint. | ||
| hvd.callbacks.BroadcastGlobalVariablesCallback(0), | ||
| ##### Enabling SageMaker Debugger ########### | ||
| # adding smd hook as a callback | ||
| smd_hook, | ||
| ] | ||
|
|
||
| # Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them. | ||
| if hvd.rank() == 0: | ||
| callbacks.append( | ||
| keras.callbacks.ModelCheckpoint(os.path.join(args.model_dir, "checkpoint-{epoch}.h5")) | ||
| ) | ||
|
|
||
| model.fit( | ||
| x_train, | ||
| y_train, | ||
| batch_size=batch_size, | ||
| callbacks=callbacks, | ||
| epochs=epochs, | ||
| verbose=1 if hvd.rank() == 0 else 0, | ||
| validation_data=(x_test, y_test), | ||
| ) | ||
| score = model.evaluate(x_test, y_test, verbose=0) | ||
| print("Test loss:", score[0]) | ||
| print("Test accuracy:", score[1]) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add some code at the end demonstrating creating/reading from a trial? Unless this is meant to be used in SageMaker with builtin rules and no custom analysis. |
||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--use_only_cpu", type=str2bool, default=False) | ||
| parser.add_argument("--num_epochs", type=int, default=5, help="Number of epochs to train for") | ||
| parser.add_argument("--out_dir", type=str) | ||
| parser.add_argument("--save_interval", type=int, default=500) | ||
| parser.add_argument("--include_workers", type=str, default="one") | ||
| parser.add_argument("--model_dir", type=str, default="/tmp/mnist_model") | ||
| args = parser.parse_args() | ||
|
|
||
| main(args) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
152 changes: 152 additions & 0 deletions
152
examples/tensorflow/sagemaker_byoc/horovod_keras_mnist.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,152 @@ | ||
| """ | ||
| This script is a simple MNIST training script which uses Horovod and Tensorflow's Keras interface. | ||
| It has been orchestrated with SageMaker Debugger hooks to allow saving tensors during training. | ||
| These hooks have been instrumented to read from json configuration that SageMaker will put in the training container. | ||
| Configuration provided to the SageMaker python SDK when creating a job will be passed on to the hook. | ||
| This allows you to use the same script with differing configurations across different runs. | ||
| If you use an official SageMaker Framework container (i.e. AWS Deep Learning Container), then | ||
| you do not have to orchestrate your script as below. Hooks will automatically be added in those environments. | ||
| For more information, please refer to https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md | ||
|
|
||
| This script has been adapted from an example in Horovod repository https://github.com/uber/horovod | ||
| """ | ||
| # Standard Library | ||
| import argparse | ||
| import math | ||
| import os | ||
|
|
||
| # Third Party | ||
| import horovod.tensorflow.keras as hvd | ||
| import tensorflow as tf | ||
| from tensorflow import keras | ||
| from tensorflow.keras import backend as K | ||
| from tensorflow.keras.datasets import mnist | ||
| from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D | ||
| from tensorflow.keras.models import Sequential | ||
|
|
||
| # First Party | ||
| import smdebug.tensorflow as smd | ||
|
|
||
|
|
||
| def str2bool(v): | ||
| if isinstance(v, bool): | ||
| return v | ||
| if v.lower() in ("yes", "true", "t", "y", "1"): | ||
| return True | ||
| elif v.lower() in ("no", "false", "f", "n", "0"): | ||
| return False | ||
| else: | ||
| raise argparse.ArgumentTypeError("Boolean value expected.") | ||
|
|
||
|
|
||
| def main(args): | ||
| # Horovod: initialize Horovod. | ||
| hvd.init() | ||
|
|
||
| if not args.use_only_cpu: | ||
| # Horovod: pin GPU to be used to process local rank (one GPU per process) | ||
| config = tf.ConfigProto() | ||
| config.gpu_options.allow_growth = True | ||
| config.gpu_options.visible_device_list = str(hvd.local_rank()) | ||
| else: | ||
| config = None | ||
|
|
||
| K.set_session(tf.Session(config=config)) | ||
|
|
||
| batch_size = 128 | ||
| num_classes = 10 | ||
|
|
||
| # Horovod: adjust number of epochs based on number of GPUs. | ||
| epochs = int(math.ceil(args.num_epochs / hvd.size())) | ||
|
|
||
| # Input image dimensions | ||
| img_rows, img_cols = 28, 28 | ||
|
|
||
| # The data, shuffled and split between train and test sets | ||
| (x_train, y_train), (x_test, y_test) = mnist.load_data() | ||
|
|
||
| if K.image_data_format() == "channels_first": | ||
| x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols) | ||
| x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols) | ||
| input_shape = (1, img_rows, img_cols) | ||
| else: | ||
| x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1) | ||
| x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1) | ||
| input_shape = (img_rows, img_cols, 1) | ||
|
|
||
| x_train = x_train.astype("float32") | ||
| x_test = x_test.astype("float32") | ||
| x_train /= 255 | ||
| x_test /= 255 | ||
| print("x_train shape:", x_train.shape) | ||
| print(x_train.shape[0], "train samples") | ||
| print(x_test.shape[0], "test samples") | ||
|
|
||
| # Convert class vectors to binary class matrices | ||
| y_train = keras.utils.to_categorical(y_train, num_classes) | ||
| y_test = keras.utils.to_categorical(y_test, num_classes) | ||
|
|
||
| model = Sequential() | ||
| model.add(Conv2D(32, kernel_size=(3, 3), activation="relu", input_shape=input_shape)) | ||
| model.add(Conv2D(64, (3, 3), activation="relu")) | ||
| model.add(MaxPooling2D(pool_size=(2, 2))) | ||
| model.add(Dropout(0.25)) | ||
| model.add(Flatten()) | ||
| model.add(Dense(128, activation="relu")) | ||
| model.add(Dropout(0.5)) | ||
| model.add(Dense(num_classes, activation="softmax")) | ||
|
|
||
| # Horovod: adjust learning rate based on number of GPUs. | ||
| opt = keras.optimizers.Adadelta(1.0 * hvd.size()) | ||
|
|
||
| # Horovod: add Horovod Distributed Optimizer. | ||
| opt = hvd.DistributedOptimizer(opt) | ||
|
|
||
| ##### Enabling SageMaker Debugger ########### | ||
| # Create hook from the configuration provided through sagemaker python sdk | ||
| smd_hook = smd.KerasHook.create_from_json_file() | ||
|
|
||
| ##### Enabling SageMaker Debugger ########### | ||
| # wrap the optimizer so the hook can identify the gradients | ||
| opt = smd_hook.wrap_optimizer(opt) | ||
|
|
||
| model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=["accuracy"]) | ||
|
|
||
| callbacks = [ | ||
| # Horovod: broadcast initial variable states from rank 0 to all other processes. | ||
| # This is necessary to ensure consistent initialization of all workers when | ||
| # training is started with random weights or restored from a checkpoint. | ||
| hvd.callbacks.BroadcastGlobalVariablesCallback(0), | ||
| ##### Enabling SageMaker Debugger ########### | ||
| # pass smd_hook as a callback | ||
| smd_hook, | ||
rahul003 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ] | ||
|
|
||
| # Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them. | ||
| if hvd.rank() == 0: | ||
| callbacks.append( | ||
| keras.callbacks.ModelCheckpoint(os.path.join(args.model_dir, "checkpoint-{epoch}.h5")) | ||
| ) | ||
|
|
||
| model.fit( | ||
| x_train, | ||
| y_train, | ||
| batch_size=batch_size, | ||
| callbacks=callbacks, | ||
| epochs=epochs, | ||
| verbose=1 if hvd.rank() == 0 else 0, | ||
| validation_data=(x_test, y_test), | ||
| ) | ||
| score = model.evaluate(x_test, y_test, verbose=0) | ||
| print("Test loss:", score[0]) | ||
| print("Test accuracy:", score[1]) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--use_only_cpu", type=str2bool, default=False) | ||
| parser.add_argument("--num_epochs", type=int, default=5, help="Number of epochs to train for") | ||
| parser.add_argument("--model_dir", type=str, default="/tmp/mnist_model") | ||
| args = parser.parse_args() | ||
|
|
||
| main(args) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.