-
Notifications
You must be signed in to change notification settings - Fork 83
Skip logging the input tensors to the loss block #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why are we doing this? |
Also linking the previous PR for MXNet #64 |
@vandanavk Could you implement this using these methods: https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#methods-on-a-collection |
Ya, saw this and assumed this was discussed and decided. Blocking input is only required when loss is written for smexperiements and others. |
Can you also update the description section with the contents that is emitted to minerva. |
How are the inputs of loss block even going to Minerva? That might mean losses collection is not correctly configured |
For PyTorch, include inputs and outputs is in _prepare_collections(). |
RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option. After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum |
Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253} What is iteration number btw? Is it step_num, if yes shouldn't it be 0,100,200 etc. |
This is because smexperiments had a check
and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test |
This is the latest @Vikas-kum :
|
smdebug/core/hook.py
Outdated
tensor_name, tensor_val, sm_metric=True, write_tb=False, write_event=False | ||
) | ||
self.scalar_cache.append(scalar_obj) | ||
if "_input" not in tensor_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not hardcode it like this. How about
register_loss fn:
self.collection_manager.get_collection('losses').add_module_tensors(loss_module)
this by default only saves the outputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so inputs will not be saved by default for loss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this alone wont work because the default regex for loss collection is [Ll]oss, so input tensors are included. then we'll have to change add_module_Tensors() to overwrite the existing regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the regex for losses collection is changed to exclude input there? Then we won't need to add the not _input check for all the scalar tensors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See how regex for weights excludes the ones which have both weights and gradients in them
as gradient tensor names are of the format
gradient/weight/x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll have to do this in register default collections then. any regex mentioned after this initial register_default_collections call, appends regex to the existing.
if the user explicitly wants input tensors to be included, then call add_module_tensors() later. if input is explicitly added this way, then it will goto SM metrics too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, change it here itself
self.get(CollectionKeys.LOSSES).include("[Ll]oss") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, that's what you referred to as well. Yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the input a non-scalar? How is it going to Minerva? Ah I see because of the mean.
Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right? |
This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module |
Thanks for clarifying. Let training run for 10 steps of train and 10 steps of eval for 2 epoch(1 epoch is 10 step train and 10 step eval) |
@Vikas-kum here's the log with the current code. https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4. |
@vandanavk I see that iteration number is -1 for some values, and also the same (metricName, iterationNumber) is duplicated a few times. I see that the iteration number is being set to mode_step, which is causing this duplication. I don't know what's the right way to address this, as even global step is not exactly the cleanest thing to do as the graph will have merged metric across modes if we do that. It looks like SM Metrics doesn't have the right abstraction here. But we should at least not be sending iteration number as -1, can you take a look at that |
There are some issues with current implementation , if I look at results - As I suspected, iteration number is repeated. :) One way is you append mode to loss tensor name, so that metrics cna be shown as TRAIN_LOSS and EVAL_LOSS , GLOBAL_LOSS when emitting to minerva. Also please look at what is -1 iteration number, it needs to be handled. @rahul003 @leleamol @jarednielsen I think the above issue will be in all the frameworks. Would we be able to please make sure we handle this issue across every framework. |
@Vikas-kum
self.writer is None, so TRAIN mode step -1 is not logged. but when set to EVAL mode, this check fails as writer is valid and iteration number -1 is written |
no this is bug. Please refer to comment here , i suggested a way to handle mode. |
@Vikas-kum The output now looks like this https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4 |
@rahul003 the behavior in terms of logging inputs is the following with the code in this PR
the earlier option was:
|
I think that's not a problem because the input to Loss block doesn't have a place in the losses collection. We can guide users to add that module to a different collection, say |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this should work for mxnet as well, right? Can you add a test for mxnet too
The step number fix can come in a different pr
This reverts commit fd9155c.
Write correct mode for each scalar in write_scalars() Increment global mode step number irrespective of which mode it is
53806aa
to
3c38889
Compare
This reverts commit 8476c5f. We dont need to close writers anymore because mode is stored in scalar cache
Is this ready now? |
@Vikas-kum |
@rahul003 @Vikas-kum All review comments addressed. |
Thanks. I see xgboost is not tested. Can you run and check for zero code change xgboost ? |
@Vikas-kum this is the output https://gist.github.com/vandanavk/bdc4cb43d11f033b518d96b7d8a4def5 when tests/xgboost/test_hook.py is executed (this is the output of all 13 tests executed) |
Why is there repetitions here ? rain-rmse_GLOBAL is repeated for iterations . Also, timestamp ordering is weird, 1577487854.368032 comes after 1577487854.3779159 . {"MetricName": "train-rmse_GLOBAL", "Value": 0.389416, "Timestamp": 1577487854.3779159, "IterationNumber": 0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vandana confirmed that the repitions are from different test cases. https://gist.github.com/vandanavk/bdc4cb43d11f033b518d96b7d8a4def5
* Skip logging the input tensors to the loss block * Add loss inputs for PT functional loss * append mode name * Write correct mode for each scalar in write_scalars() * Increment global mode step number irrespective of which mode it is (cherry picked from commit cfcfd7a)
Description of changes:
The PR includes
We will only log the output of loss blocks.
Frameworks tested:
Loss module metrics logged for SM metrics before this PR
Loss module - metrics logged for SM metrics with this PR
Log for an entire epoch of training (using TRAIN and EVAL modes
https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4
Style and formatting:
I have run
pre-commit install
to ensure that auto-formatting happens with every commit.Issue number, if available
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.