You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sorry for the delay, I think that works. Just to make sure we cover our basis please look at the following in order to make Amazon SageMaker training easy:
• Create a dataset-root/model-root location so we choose where we grab datasets from (rather than assuming they are in a certain location.
• Create a flag to specify where we save model-artifacts and checkpoints
• Ensure training works on SageMaker Training (sometimes there are problems with Horovod and downloading pre-trained models).
• Do one of the 2 following solutions:
Modify the arguments so that SageMaker training can work more easily.
a. Modify, input arguments so they align with the arguments SageMaker needs for training.
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])
And some sort of checkpoints-save flag where we can choose where to save checkpoints.
Probably the better solution (for integration into more platforms)
a. Create a “SageMaker Train” script which wraps our scripts and provides the functionality necessary for 1) (primarily, I list of environ variables, that it goes ahead and grabs.)
Let me know if I need to add more clarity. I'd be more than willing to help write the blog once we have worked out those kinks.
In order to drive usability of the platform I propose that we make all scripts have sagemaker support.
The text was updated successfully, but these errors were encountered: