Skip logging the input tensors to the loss block #86

vandanavk · 2019-12-04T18:44:44Z

Description of changes:

The PR includes

the change which prevents logging the input tensors to the loss block.
We will only log the output of loss blocks.
specify mode in ScalarCache to be used to retrieve the correct mode name and step number in write_scalars()
increment global step number irrespective of which mode it is
modifying save_scalar tests according to this change (test output after this PR)

Frameworks tested:

Loss module metrics logged for SM metrics before this PR

cat 62064.json 
{"MetricName": "CrossEntropyLoss_input_0", "Value": -0.032386019825935364, "Timestamp": 1575495078851}
{"MetricName": "CrossEntropyLoss_input_1", "Value": 0.0, "Timestamp": 1575495078852}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.299854040145874, "Timestamp": 1575495078852}
{"MetricName": "CrossEntropyLoss_input_0", "Value": -0.026646751910448074, "Timestamp": 1575495078860, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_input_1", "Value": 0.0, "Timestamp": 1575495078860, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.188162088394165, "Timestamp": 1575495078860, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_input_0", "Value": -0.012318896129727364, "Timestamp": 1575495078865, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_input_1", "Value": 0.0, "Timestamp": 1575495078865, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.9457279443740845, "Timestamp": 1575495078865, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_input_0", "Value": 0.020270144566893578, "Timestamp": 1575495078872, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_input_1", "Value": 0.0, "Timestamp": 1575495078872, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.3983503580093384, "Timestamp": 1575495078872, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_input_0", "Value": 0.14442947506904602, "Timestamp": 1575495078877, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_input_1", "Value": 0.0, "Timestamp": 1575495078877, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.11063049733638763, "Timestamp": 1575495078877, "IterationNumber": 4}

Loss module - metrics logged for SM metrics with this PR

{"MetricName": "CrossEntropyLoss_output_0_GLOBAL", "Value": 2.33306884765625, "Timestamp": 1575672948.792166, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0_GLOBAL", "Value": 2.227065086364746, "Timestamp": 1575672948.800318, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0_GLOBAL", "Value": 2.0023903846740723, "Timestamp": 1575672948.804405, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0_GLOBAL", "Value": 1.5051748752593994, "Timestamp": 1575672948.8086379, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0_GLOBAL", "Value": 0.22368499636650085, "Timestamp": 1575672948.812764, "IterationNumber": 4}

Log for an entire epoch of training (using TRAIN and EVAL modes
https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

jarednielsen · 2019-12-04T18:48:05Z

Why are we doing this?

smdebug/pytorch/hook.py

rahul003 · 2019-12-04T18:59:45Z

Also linking the previous PR for MXNet #64
which I had approved. But in hindsight this is the wrong approach if you want to save losses by default. Defaulting to output of loss block makes sense for the losses collection, but the input to loss block is a valuable debugging tool. It should be possible to save them.

rahul003 · 2019-12-04T19:00:35Z

@vandanavk Could you implement this using these methods: https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#methods-on-a-collection

vandanavk · 2019-12-04T19:08:12Z

Also linking the previous PR for MXNet #64
which I had approved. But in hindsight this is the wrong approach if you want to save losses by default. Defaulting to output of loss block makes sense for the losses collection, but the input to loss block is a valuable debugging tool. It should be possible to save them.

Ya, saw this and assumed this was discussed and decided. Blocking input is only required when loss is written for smexperiements and others.

tests/pytorch/test_loss.py

Vikas-kum · 2019-12-04T19:23:22Z

Can you also update the description section with the contents that is emitted to minerva.

rahul003 · 2019-12-04T20:54:43Z

How are the inputs of loss block even going to Minerva? That might mean losses collection is not correctly configured

vandanavk · 2019-12-04T21:05:34Z

How are the inputs of loss block even going to Minerva? That might mean losses collection is not correctly configured

For PyTorch, include inputs and outputs is in _prepare_collections().

vandanavk · 2019-12-04T21:32:51Z

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.

After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Vikas-kum · 2019-12-04T21:49:49Z

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.

After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

What is iteration number btw? Is it step_num, if yes shouldn't it be 0,100,200 etc.

vandanavk · 2019-12-04T21:52:20Z

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check

if IterationNumber:
    set itertionnumber

and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test

vandanavk · 2019-12-04T21:57:47Z

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check
if IterationNumber:
    set itertionnumber
and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test

This is the latest @Vikas-kum :

cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}

rahul003 · 2019-12-04T21:58:13Z

smdebug/core/hook.py

-                    tensor_name, tensor_val, sm_metric=True, write_tb=False, write_event=False
-                )
-                self.scalar_cache.append(scalar_obj)
+        if "_input" not in tensor_name:


Let's not hardcode it like this. How about

register_loss fn: self.collection_manager.get_collection('losses').add_module_tensors(loss_module)

this by default only saves the outputs

so inputs will not be saved by default for loss?

also, this alone wont work because the default regex for loss collection is [Ll]oss, so input tensors are included. then we'll have to change add_module_Tensors() to overwrite the existing regex

What if the regex for losses collection is changed to exclude input there? Then we won't need to add the not _input check for all the scalar tensors

See how regex for weights excludes the ones which have both weights and gradients in them
as gradient tensor names are of the format

gradient/weight/x

we'll have to do this in register default collections then. any regex mentioned after this initial register_default_collections call, appends regex to the existing.
if the user explicitly wants input tensors to be included, then call add_module_tensors() later. if input is explicitly added this way, then it will goto SM metrics too

No, change it here itself

sagemaker-debugger/smdebug/pytorch/collection.py

Line 42 in e26ef31

self.get(CollectionKeys.LOSSES).include("[Ll]oss")

Never mind, that's what you referred to as well. Yeah

Isn't the input a non-scalar? How is it going to Minerva? Ah I see because of the mean.

Vikas-kum · 2019-12-04T22:02:21Z

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check
if IterationNumber:
    set itertionnumber
and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test
This is the latest @Vikas-kum :
cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}

Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?

vandanavk · 2019-12-04T22:06:16Z

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check
if IterationNumber:
    set itertionnumber
and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test
This is the latest @Vikas-kum :
cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}
Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?

This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module

Vikas-kum · 2019-12-05T18:25:50Z

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check
if IterationNumber:
    set itertionnumber
and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test
This is the latest @Vikas-kum :
cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}
Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?
This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module

Thanks for clarifying.
Can you please let me know the result of this test -
Test case -
Training has TRAIN, EVAL ,GLOBAL mode set. What is the output to minerva?

Let training run for 10 steps of train and 10 steps of eval for 2 epoch(1 epoch is 10 step train and 10 step eval)

vandanavk · 2019-12-05T23:24:11Z

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check
if IterationNumber:
    set itertionnumber
and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test
This is the latest @Vikas-kum :
cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}
Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?
This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module
Thanks for clarifying.
Can you please let me know the result of this test -
Test case -
Training has TRAIN, EVAL ,GLOBAL mode set. What is the output to minerva?

Let training run for 10 steps of train and 10 steps of eval for 2 epoch(1 epoch is 10 step train and 10 step eval)

@Vikas-kum here's the log with the current code. https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4.
some issues/improvements that I see - doesn't differentiate between modes, doesn't differentiate between epochs, iteration number can only be an int, step number continues across epochs (is this intended behavior?)

rahul003 · 2019-12-05T23:54:37Z

@vandanavk I see that iteration number is -1 for some values, and also the same (metricName, iterationNumber) is duplicated a few times.

I see that the iteration number is being set to mode_step, which is causing this duplication. I don't know what's the right way to address this, as even global step is not exactly the cleanest thing to do as the graph will have merged metric across modes if we do that. It looks like SM Metrics doesn't have the right abstraction here.

But we should at least not be sending iteration number as -1, can you take a look at that

Vikas-kum · 2019-12-06T00:38:16Z

There are some issues with current implementation , if I look at results -
"{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2104947566986084, "Timestamp": 1575587064.571794, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.311326742172241, "Timestamp": 1575587064.579527, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.292689800262451, "Timestamp": 1575587064.589998, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.421755313873291, "Timestamp": 1575587064.608332, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4677681922912598, "Timestamp": 1575587064.615562, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.1791975498199463, "Timestamp": 1575587064.6230772, "IterationNumber": 5}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.246185779571533, "Timestamp": 1575587064.630558, "IterationNumber": 6}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3693621158599854, "Timestamp": 1575587064.639154, "IterationNumber": 7}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.351681709289551, "Timestamp": 1575587064.646904, "IterationNumber": 8}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.289271593093872, "Timestamp": 1575587064.705531, "IterationNumber": -1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3608412742614746, "Timestamp": 1575587064.7141418, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.265277862548828, "Timestamp": 1575587064.717716, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2706689834594727, "Timestamp": 1575587064.721091, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3054704666137695, "Timestamp": 1575587064.724351, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2706704139709473, "Timestamp": 1575587064.728058, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2529993057250977, "Timestamp": 1575587064.731631, "IterationNumber": 5}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.279721736907959, "Timestamp": 1575587064.73516, "IterationNumber": 6}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4128708839416504, "Timestamp": 1575587064.738771, "IterationNumber": 7}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2863287925720215, "Timestamp": 1575587064.74294, "IterationNumber": 8}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3826935291290283, "Timestamp": 1575587064.802319, "IterationNumber": 9}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.36856746673584, "Timestamp": 1575587064.81825, "IterationNumber": 10}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.303138494491577, "Timestamp": 1575587064.82403, "IterationNumber": 11}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4107022285461426, "Timestamp": 1575587064.831247, "IterationNumber": 12}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.280378818511963, "Timestamp": 1575587064.838045, "IterationNumber": 13}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.34610915184021, "Timestamp": 1575587064.844379, "IterationNumber": 14}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.201525926589966, "Timestamp": 1575587064.851093, "IterationNumber": 15}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4224019050598145, "Timestamp": 1575587064.85957, "IterationNumber": 16}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3088531494140625, "Timestamp": 1575587064.867857, "IterationNumber": 17}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2803003787994385, "Timestamp": 1575587064.874088, "IterationNumber": 18}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.377769947052002, "Timestamp": 1575587064.929729, "IterationNumber": 9}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3445422649383545, "Timestamp": 1575587064.939193, "IterationNumber": 10}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.240546703338623, "Timestamp": 1575587064.9428961, "IterationNumber": 11}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.271639347076416, "Timestamp": 1575587064.946501, "IterationNumber": 12}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.297555685043335, "Timestamp": 1575587064.950707, "IterationNumber": 13}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.259658098220825, "Timestamp": 1575587064.955234, "IterationNumber": 14}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2558112144470215, "Timestamp": 1575587064.9623349, "IterationNumber": 15}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.24918532371521, "Timestamp": 1575587064.971052, "IterationNumber": 16}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4208264350891113, "Timestamp": 1575587064.974842, "IterationNumber": 17}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.282519817352295, "Timestamp": 1575587064.980673, "IterationNumber": 18}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.368577003479004, "Timestamp": 1575587065.00563, "IterationNumber": 19}"

As I suspected, iteration number is repeated. :)
Other problem is iteration number -1.
Both of these issues needs to be resolved.

One way is you append mode to loss tensor name, so that metrics cna be shown as TRAIN_LOSS and EVAL_LOSS , GLOBAL_LOSS when emitting to minerva.

Also please look at what is -1 iteration number, it needs to be handled.

@rahul003 @leleamol @jarednielsen I think the above issue will be in all the frameworks. Would we be able to please make sure we handle this issue across every framework.

vandanavk · 2019-12-06T00:42:39Z

There are some issues with current implementation , if I look at results -
"{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2104947566986084, "Timestamp": 1575587064.571794, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.311326742172241, "Timestamp": 1575587064.579527, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.292689800262451, "Timestamp": 1575587064.589998, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.421755313873291, "Timestamp": 1575587064.608332, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4677681922912598, "Timestamp": 1575587064.615562, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.1791975498199463, "Timestamp": 1575587064.6230772, "IterationNumber": 5}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.246185779571533, "Timestamp": 1575587064.630558, "IterationNumber": 6}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3693621158599854, "Timestamp": 1575587064.639154, "IterationNumber": 7}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.351681709289551, "Timestamp": 1575587064.646904, "IterationNumber": 8}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.289271593093872, "Timestamp": 1575587064.705531, "IterationNumber": -1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3608412742614746, "Timestamp": 1575587064.7141418, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.265277862548828, "Timestamp": 1575587064.717716, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2706689834594727, "Timestamp": 1575587064.721091, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3054704666137695, "Timestamp": 1575587064.724351, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2706704139709473, "Timestamp": 1575587064.728058, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2529993057250977, "Timestamp": 1575587064.731631, "IterationNumber": 5}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.279721736907959, "Timestamp": 1575587064.73516, "IterationNumber": 6}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4128708839416504, "Timestamp": 1575587064.738771, "IterationNumber": 7}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2863287925720215, "Timestamp": 1575587064.74294, "IterationNumber": 8}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3826935291290283, "Timestamp": 1575587064.802319, "IterationNumber": 9}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.36856746673584, "Timestamp": 1575587064.81825, "IterationNumber": 10}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.303138494491577, "Timestamp": 1575587064.82403, "IterationNumber": 11}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4107022285461426, "Timestamp": 1575587064.831247, "IterationNumber": 12}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.280378818511963, "Timestamp": 1575587064.838045, "IterationNumber": 13}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.34610915184021, "Timestamp": 1575587064.844379, "IterationNumber": 14}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.201525926589966, "Timestamp": 1575587064.851093, "IterationNumber": 15}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4224019050598145, "Timestamp": 1575587064.85957, "IterationNumber": 16}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3088531494140625, "Timestamp": 1575587064.867857, "IterationNumber": 17}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2803003787994385, "Timestamp": 1575587064.874088, "IterationNumber": 18}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.377769947052002, "Timestamp": 1575587064.929729, "IterationNumber": 9}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3445422649383545, "Timestamp": 1575587064.939193, "IterationNumber": 10}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.240546703338623, "Timestamp": 1575587064.9428961, "IterationNumber": 11}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.271639347076416, "Timestamp": 1575587064.946501, "IterationNumber": 12}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.297555685043335, "Timestamp": 1575587064.950707, "IterationNumber": 13}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.259658098220825, "Timestamp": 1575587064.955234, "IterationNumber": 14}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2558112144470215, "Timestamp": 1575587064.9623349, "IterationNumber": 15}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.24918532371521, "Timestamp": 1575587064.971052, "IterationNumber": 16}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4208264350891113, "Timestamp": 1575587064.974842, "IterationNumber": 17}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.282519817352295, "Timestamp": 1575587064.980673, "IterationNumber": 18}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.368577003479004, "Timestamp": 1575587065.00563, "IterationNumber": 19}"

As I suspected, iteration number is repeated. :)
Other problem is iteration number -1.
Both of these issues needs to be resolved.

One way is you append mode to loss tensor name, so that metrics cna be shown as TRAIN_LOSS and EVAL_LOSS , GLOBAL_LOSS .

Also please look at what is -1 iteration number, it needs to be handled.

@Vikas-kum
This -1 is an issue in PT (and possibly MX hook too) hook. when forward_pre_hook() is called the very first time,

if self.writer is not None:
            self._close_writers()

self.writer is None, so TRAIN mode step -1 is not logged. but when set to EVAL mode, this check fails as writer is valid and iteration number -1 is written

Vikas-kum · 2019-12-06T00:43:11Z

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check
if IterationNumber:
    set itertionnumber
and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test
This is the latest @Vikas-kum :
cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}
Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?
This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module
Thanks for clarifying.
Can you please let me know the result of this test -
Test case -
Training has TRAIN, EVAL ,GLOBAL mode set. What is the output to minerva?
Let training run for 10 steps of train and 10 steps of eval for 2 epoch(1 epoch is 10 step train and 10 step eval)
@Vikas-kum here's the log with the current code. https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4.
some issues/improvements that I see - doesn't differentiate between modes, doesn't differentiate between epochs, iteration number can only be an int, step number continues across epochs (is this intended behavior?)

no this is bug. Please refer to comment here , i suggested a way to handle mode.
#86 (comment)

vandanavk · 2019-12-06T00:53:50Z

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check
if IterationNumber:
    set itertionnumber
and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test
This is the latest @Vikas-kum :
cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}
Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?
This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module
Thanks for clarifying.
Can you please let me know the result of this test -
Test case -
Training has TRAIN, EVAL ,GLOBAL mode set. What is the output to minerva?
Let training run for 10 steps of train and 10 steps of eval for 2 epoch(1 epoch is 10 step train and 10 step eval)
@Vikas-kum here's the log with the current code. https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4.
some issues/improvements that I see - doesn't differentiate between modes, doesn't differentiate between epochs, iteration number can only be an int, step number continues across epochs (is this intended behavior?)
no this is bug. Please refer to comment here , i suggested a way to handle mode.
#86 (comment)

@Vikas-kum The output now looks like this https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4

vandanavk · 2019-12-06T00:59:45Z

@rahul003 the behavior in terms of logging inputs is the following with the code in this PR

loss input is not logged by default (only output is). input can be included with an explicit call to add_module_tensors(). but if this is done, then loss input will be logged to sm metrics too.

the earlier option was:

add a check to exclude all "_input" tensors from sm metrics. in this case, loss inputs and outputs are generally logged but will be excluded only from sm metrics.

rahul003 · 2019-12-06T01:07:59Z

@rahul003 the behavior in terms of logging inputs is the following with the code in this PR

loss input is not logged by default (only output is). input can be included with an explicit call to add_module_tensors(). but if this is done, then loss input will be logged to sm metrics too.

the earlier option was:

add a check to exclude all "_input" tensors from sm metrics. in this case, loss inputs and outputs are generally logged but will be excluded only from sm metrics.

I think that's not a problem because the input to Loss block doesn't have a place in the losses collection. We can guide users to add that module to a different collection, say
hook.get_collection('outputs').add_module_tensors(loss_module, inputs=True, outputs=False)

rahul003

So this should work for mxnet as well, right? Can you add a test for mxnet too
The step number fix can come in a different pr

This reverts commit fd9155c.

Write correct mode for each scalar in write_scalars() Increment global mode step number irrespective of which mode it is

This reverts commit 8476c5f. We dont need to close writers anymore because mode is stored in scalar cache

rahul003 · 2019-12-19T18:41:50Z

Is this ready now?

vandanavk · 2019-12-27T02:02:37Z

Couple of things -

In the end can you run a simple training with sagemaker and verify what metrics are getting emitted to rhinestone.
I guess you will have to build your image (we have scripts to do that) and provide that custom image in SM API.

In description - Please check what framework has been tested explicitly.

@Vikas-kum
For 1., tested on SageMaker. For a test training script, the following metrics were emitted
scalar/pt_num_steps_GLOBAL
CrossEntropyLoss_output_0_TRAIN
CrossEntropyLoss_output_0_EVAL

vandanavk · 2019-12-27T02:03:48Z

Is this ready now?

@rahul003 @Vikas-kum All review comments addressed.

Vikas-kum · 2019-12-27T17:39:24Z

Is this ready now?

@rahul003 @Vikas-kum All review comments addressed.

Thanks. I see xgboost is not tested. Can you run and check for zero code change xgboost ?

vandanavk · 2019-12-27T23:06:11Z

Is this ready now?

@rahul003 @Vikas-kum All review comments addressed.

Thanks. I see xgboost is not tested. Can you run and check for zero code change xgboost ?

@Vikas-kum this is the output https://gist.github.com/vandanavk/bdc4cb43d11f033b518d96b7d8a4def5 when tests/xgboost/test_hook.py is executed (this is the output of all 13 tests executed)

Vikas-kum · 2019-12-27T23:41:01Z

Is this ready now?

@rahul003 @Vikas-kum All review comments addressed.

Thanks. I see xgboost is not tested. Can you run and check for zero code change xgboost ?

@Vikas-kum this is the output https://gist.github.com/vandanavk/bdc4cb43d11f033b518d96b7d8a4def5 when tests/xgboost/test_hook.py is executed (this is the output of all 13 tests executed)

Why is there repetitions here ? rain-rmse_GLOBAL is repeated for iterations . Also, timestamp ordering is weird, 1577487854.368032 comes after 1577487854.3779159 .

{"MetricName": "train-rmse_GLOBAL", "Value": 0.389416, "Timestamp": 1577487854.3779159, "IterationNumber": 0}
{"MetricName": "test-rmse_GLOBAL", "Value": 0.490306, "Timestamp": 1577487854.3785, "IterationNumber": 0}
{"MetricName": "train-rmse_GLOBAL", "Value": 0.389416, "Timestamp": 1577487854.367563, "IterationNumber": 0}
{"MetricName": "test-rmse_GLOBAL", "Value": 0.490306, "Timestamp": 1577487854.368032, "IterationNumber": 0}
{"MetricName": "train-rmse_GLOBAL", "Value": 0.389416, "Timestamp": 1577487854.352, "IterationNumber": 0}
{"MetricName": "test-rmse_GLOBAL", "Value": 0.490306, "Timestamp": 1577487854.352299, "IterationNumber": 0}
{"MetricName": "train-rmse_GLOBAL", "Value": 0.389416, "Timestamp": 1577487854.341634, "IterationNumber": 0}

Vikas-kum

Vandana confirmed that the repitions are from different test cases. https://gist.github.com/vandanavk/bdc4cb43d11f033b518d96b7d8a4def5

* Skip logging the input tensors to the loss block * Add loss inputs for PT functional loss * append mode name * Write correct mode for each scalar in write_scalars() * Increment global mode step number irrespective of which mode it is (cherry picked from commit cfcfd7a)

vandanavk requested review from Vikas-kum, jarednielsen and rahul003 December 4, 2019 18:44

rahul003 reviewed Dec 4, 2019

View reviewed changes

smdebug/pytorch/hook.py Outdated Show resolved Hide resolved

jarednielsen reviewed Dec 4, 2019

View reviewed changes

tests/pytorch/test_loss.py Show resolved Hide resolved

rahul003 mentioned this pull request Dec 4, 2019

Revert "Skip logging the input tensors to the loss block. (#64)" #87

Merged

rahul003 suggested changes Dec 4, 2019

View reviewed changes

rahul003 approved these changes Dec 6, 2019

View reviewed changes

vandanavk added 15 commits December 16, 2019 14:52

Add loss inputs for PT functional loss

b947ed7

Pre-commit

9902e48

Dont use a hardcoded check for sm metrics

046119b

Fix pt tests

a8a5353

append mode name

778286a

Pre-commit

e08510c

Fix test comment

758bbcd

Revert "Add loss inputs for PT functional loss"

30cde90

This reverts commit fd9155c.

Add test for MX

56143f6

Close writers before switching modes

8476c5f

use sm_metric from func parameters

312c570

Some bug fixes

a283b05

Write correct mode for each scalar in write_scalars() Increment global mode step number irrespective of which mode it is

Modify save_scalar test

7a0c2a4

Pre-commit

1310834

Verify contents of the sm metrics json

3c38889

vandanavk force-pushed the loss_input branch from 53806aa to 3c38889 Compare December 16, 2019 23:43

vandanavk added 2 commits December 17, 2019 14:28

Revert "Close writers before switching modes"

18a14b2

This reverts commit 8476c5f. We dont need to close writers anymore because mode is stored in scalar cache

trigger push

31f5fd5

Vikas-kum closed this Dec 28, 2019

Vikas-kum reopened this Dec 28, 2019

Vikas-kum approved these changes Dec 28, 2019

View reviewed changes

Vikas-kum merged commit cfcfd7a into master Dec 28, 2019

Vikas-kum deleted the loss_input branch December 28, 2019 21:02

Skip logging the input tensors to the loss block #86

Skip logging the input tensors to the loss block #86

Uh oh!

Conversation

vandanavk commented Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes:

Style and formatting:

Issue number, if available

Uh oh!

jarednielsen commented Dec 4, 2019

Uh oh!

Uh oh!

rahul003 commented Dec 4, 2019

Uh oh!

rahul003 commented Dec 4, 2019

Uh oh!

vandanavk commented Dec 4, 2019

Uh oh!

Uh oh!

Vikas-kum commented Dec 4, 2019

Uh oh!

rahul003 commented Dec 4, 2019

Uh oh!

vandanavk commented Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vandanavk commented Dec 4, 2019

Uh oh!

Vikas-kum commented Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vandanavk commented Dec 4, 2019

Uh oh!

vandanavk commented Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahul003 Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vandanavk Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vandanavk Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

rahul003 Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

rahul003 Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

vandanavk Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahul003 Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

rahul003 Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

rahul003 Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vikas-kum commented Dec 4, 2019

Uh oh!

vandanavk commented Dec 4, 2019

Uh oh!

Vikas-kum commented Dec 5, 2019

Uh oh!

vandanavk commented Dec 5, 2019

Uh oh!

rahul003 commented Dec 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vikas-kum commented Dec 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

vandanavk commented Dec 4, 2019 •

edited

Loading

vandanavk commented Dec 4, 2019 •

edited

Loading

Vikas-kum commented Dec 4, 2019 •

edited

Loading

vandanavk commented Dec 4, 2019 •

edited

Loading

rahul003 Dec 4, 2019 •

edited

Loading

vandanavk Dec 4, 2019 •

edited

Loading

vandanavk Dec 4, 2019 •

edited

Loading

rahul003 Dec 4, 2019 •

edited

Loading

rahul003 commented Dec 5, 2019 •

edited

Loading

Vikas-kum commented Dec 6, 2019 •

edited

Loading

rahul003 left a comment •

edited

Loading

vandanavk commented Dec 27, 2019 •

edited

Loading