Skip to content

Conversation

vandanavk
Copy link
Contributor

@vandanavk vandanavk commented Dec 4, 2019

Description of changes:

The PR includes

  • the change which prevents logging the input tensors to the loss block.
    We will only log the output of loss blocks.
  • specify mode in ScalarCache to be used to retrieve the correct mode name and step number in write_scalars()
  • increment global step number irrespective of which mode it is
  • modifying save_scalar tests according to this change (test output after this PR)

Frameworks tested:

  • MXNet
  • PyTorch
  • Tensorflow KerasHook
  • TensorFlow SessionHook
  • XGBoost

Loss module metrics logged for SM metrics before this PR

cat 62064.json 
{"MetricName": "CrossEntropyLoss_input_0", "Value": -0.032386019825935364, "Timestamp": 1575495078851}
{"MetricName": "CrossEntropyLoss_input_1", "Value": 0.0, "Timestamp": 1575495078852}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.299854040145874, "Timestamp": 1575495078852}
{"MetricName": "CrossEntropyLoss_input_0", "Value": -0.026646751910448074, "Timestamp": 1575495078860, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_input_1", "Value": 0.0, "Timestamp": 1575495078860, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.188162088394165, "Timestamp": 1575495078860, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_input_0", "Value": -0.012318896129727364, "Timestamp": 1575495078865, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_input_1", "Value": 0.0, "Timestamp": 1575495078865, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.9457279443740845, "Timestamp": 1575495078865, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_input_0", "Value": 0.020270144566893578, "Timestamp": 1575495078872, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_input_1", "Value": 0.0, "Timestamp": 1575495078872, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.3983503580093384, "Timestamp": 1575495078872, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_input_0", "Value": 0.14442947506904602, "Timestamp": 1575495078877, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_input_1", "Value": 0.0, "Timestamp": 1575495078877, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.11063049733638763, "Timestamp": 1575495078877, "IterationNumber": 4}

Loss module - metrics logged for SM metrics with this PR

{"MetricName": "CrossEntropyLoss_output_0_GLOBAL", "Value": 2.33306884765625, "Timestamp": 1575672948.792166, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0_GLOBAL", "Value": 2.227065086364746, "Timestamp": 1575672948.800318, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0_GLOBAL", "Value": 2.0023903846740723, "Timestamp": 1575672948.804405, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0_GLOBAL", "Value": 1.5051748752593994, "Timestamp": 1575672948.8086379, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0_GLOBAL", "Value": 0.22368499636650085, "Timestamp": 1575672948.812764, "IterationNumber": 4}

Log for an entire epoch of training (using TRAIN and EVAL modes
https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@jarednielsen
Copy link
Contributor

Why are we doing this?

@rahul003
Copy link
Contributor

rahul003 commented Dec 4, 2019

Also linking the previous PR for MXNet #64
which I had approved. But in hindsight this is the wrong approach if you want to save losses by default. Defaulting to output of loss block makes sense for the losses collection, but the input to loss block is a valuable debugging tool. It should be possible to save them.

@rahul003
Copy link
Contributor

rahul003 commented Dec 4, 2019

@vandanavk Could you implement this using these methods: https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#methods-on-a-collection

@vandanavk
Copy link
Contributor Author

Also linking the previous PR for MXNet #64
which I had approved. But in hindsight this is the wrong approach if you want to save losses by default. Defaulting to output of loss block makes sense for the losses collection, but the input to loss block is a valuable debugging tool. It should be possible to save them.

Ya, saw this and assumed this was discussed and decided. Blocking input is only required when loss is written for smexperiements and others.

@Vikas-kum
Copy link
Contributor

Can you also update the description section with the contents that is emitted to minerva.

@rahul003
Copy link
Contributor

rahul003 commented Dec 4, 2019

How are the inputs of loss block even going to Minerva? That might mean losses collection is not correctly configured

@vandanavk
Copy link
Contributor Author

vandanavk commented Dec 4, 2019

How are the inputs of loss block even going to Minerva? That might mean losses collection is not correctly configured

For PyTorch, include inputs and outputs is in _prepare_collections().

@vandanavk
Copy link
Contributor Author

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.

After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

@Vikas-kum
Copy link
Contributor

Vikas-kum commented Dec 4, 2019

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.

After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

What is iteration number btw? Is it step_num, if yes shouldn't it be 0,100,200 etc.

@vandanavk
Copy link
Contributor Author

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check

if IterationNumber:
    set itertionnumber

and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test

@vandanavk
Copy link
Contributor Author

vandanavk commented Dec 4, 2019

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check

if IterationNumber:
    set itertionnumber

and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test

This is the latest @Vikas-kum :

cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}

tensor_name, tensor_val, sm_metric=True, write_tb=False, write_event=False
)
self.scalar_cache.append(scalar_obj)
if "_input" not in tensor_name:
Copy link
Contributor

@rahul003 rahul003 Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not hardcode it like this. How about

register_loss fn:
   self.collection_manager.get_collection('losses').add_module_tensors(loss_module) 

this by default only saves the outputs

Copy link
Contributor Author

@vandanavk vandanavk Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so inputs will not be saved by default for loss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this alone wont work because the default regex for loss collection is [Ll]oss, so input tensors are included. then we'll have to change add_module_Tensors() to overwrite the existing regex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the regex for losses collection is changed to exclude input there? Then we won't need to add the not _input check for all the scalar tensors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See how regex for weights excludes the ones which have both weights and gradients in them
as gradient tensor names are of the format

gradient/weight/x

Copy link
Contributor Author

@vandanavk vandanavk Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll have to do this in register default collections then. any regex mentioned after this initial register_default_collections call, appends regex to the existing.
if the user explicitly wants input tensors to be included, then call add_module_tensors() later. if input is explicitly added this way, then it will goto SM metrics too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, change it here itself

self.get(CollectionKeys.LOSSES).include("[Ll]oss")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, that's what you referred to as well. Yeah

Copy link
Contributor

@rahul003 rahul003 Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the input a non-scalar? How is it going to Minerva? Ah I see because of the mean.

@Vikas-kum
Copy link
Contributor

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check

if IterationNumber:
    set itertionnumber

and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test

This is the latest @Vikas-kum :

cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}

Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?

@vandanavk
Copy link
Contributor Author

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check

if IterationNumber:
    set itertionnumber

and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test

This is the latest @Vikas-kum :

cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}

Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?

This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module

@Vikas-kum
Copy link
Contributor

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check

if IterationNumber:
    set itertionnumber

and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test

This is the latest @Vikas-kum :

cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}

Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?

This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module

Thanks for clarifying.
Can you please let me know the result of this test -
Test case -
Training has TRAIN, EVAL ,GLOBAL mode set. What is the output to minerva?

Let training run for 10 steps of train and 10 steps of eval for 2 epoch(1 epoch is 10 step train and 10 step eval)

@vandanavk
Copy link
Contributor Author

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check

if IterationNumber:
    set itertionnumber

and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test

This is the latest @Vikas-kum :

cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}

Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?

This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module

Thanks for clarifying.
Can you please let me know the result of this test -
Test case -
Training has TRAIN, EVAL ,GLOBAL mode set. What is the output to minerva?

Let training run for 10 steps of train and 10 steps of eval for 2 epoch(1 epoch is 10 step train and 10 step eval)

@Vikas-kum here's the log with the current code. https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4.
some issues/improvements that I see - doesn't differentiate between modes, doesn't differentiate between epochs, iteration number can only be an int, step number continues across epochs (is this intended behavior?)

@rahul003
Copy link
Contributor

rahul003 commented Dec 5, 2019

@vandanavk I see that iteration number is -1 for some values, and also the same (metricName, iterationNumber) is duplicated a few times.

I see that the iteration number is being set to mode_step, which is causing this duplication. I don't know what's the right way to address this, as even global step is not exactly the cleanest thing to do as the graph will have merged metric across modes if we do that. It looks like SM Metrics doesn't have the right abstraction here.

But we should at least not be sending iteration number as -1, can you take a look at that

@Vikas-kum
Copy link
Contributor

Vikas-kum commented Dec 6, 2019

There are some issues with current implementation , if I look at results -
"{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2104947566986084, "Timestamp": 1575587064.571794, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.311326742172241, "Timestamp": 1575587064.579527, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.292689800262451, "Timestamp": 1575587064.589998, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.421755313873291, "Timestamp": 1575587064.608332, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4677681922912598, "Timestamp": 1575587064.615562, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.1791975498199463, "Timestamp": 1575587064.6230772, "IterationNumber": 5}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.246185779571533, "Timestamp": 1575587064.630558, "IterationNumber": 6}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3693621158599854, "Timestamp": 1575587064.639154, "IterationNumber": 7}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.351681709289551, "Timestamp": 1575587064.646904, "IterationNumber": 8}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.289271593093872, "Timestamp": 1575587064.705531, "IterationNumber": -1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3608412742614746, "Timestamp": 1575587064.7141418, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.265277862548828, "Timestamp": 1575587064.717716, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2706689834594727, "Timestamp": 1575587064.721091, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3054704666137695, "Timestamp": 1575587064.724351, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2706704139709473, "Timestamp": 1575587064.728058, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2529993057250977, "Timestamp": 1575587064.731631, "IterationNumber": 5}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.279721736907959, "Timestamp": 1575587064.73516, "IterationNumber": 6}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4128708839416504, "Timestamp": 1575587064.738771, "IterationNumber": 7}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2863287925720215, "Timestamp": 1575587064.74294, "IterationNumber": 8}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3826935291290283, "Timestamp": 1575587064.802319, "IterationNumber": 9}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.36856746673584, "Timestamp": 1575587064.81825, "IterationNumber": 10}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.303138494491577, "Timestamp": 1575587064.82403, "IterationNumber": 11}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4107022285461426, "Timestamp": 1575587064.831247, "IterationNumber": 12}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.280378818511963, "Timestamp": 1575587064.838045, "IterationNumber": 13}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.34610915184021, "Timestamp": 1575587064.844379, "IterationNumber": 14}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.201525926589966, "Timestamp": 1575587064.851093, "IterationNumber": 15}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4224019050598145, "Timestamp": 1575587064.85957, "IterationNumber": 16}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3088531494140625, "Timestamp": 1575587064.867857, "IterationNumber": 17}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2803003787994385, "Timestamp": 1575587064.874088, "IterationNumber": 18}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.377769947052002, "Timestamp": 1575587064.929729, "IterationNumber": 9}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3445422649383545, "Timestamp": 1575587064.939193, "IterationNumber": 10}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.240546703338623, "Timestamp": 1575587064.9428961, "IterationNumber": 11}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.271639347076416, "Timestamp": 1575587064.946501, "IterationNumber": 12}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.297555685043335, "Timestamp": 1575587064.950707, "IterationNumber": 13}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.259658098220825, "Timestamp": 1575587064.955234, "IterationNumber": 14}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2558112144470215, "Timestamp": 1575587064.9623349, "IterationNumber": 15}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.24918532371521, "Timestamp": 1575587064.971052, "IterationNumber": 16}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4208264350891113, "Timestamp": 1575587064.974842, "IterationNumber": 17}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.282519817352295, "Timestamp": 1575587064.980673, "IterationNumber": 18}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.368577003479004, "Timestamp": 1575587065.00563, "IterationNumber": 19}"

As I suspected, iteration number is repeated. :)
Other problem is iteration number -1.
Both of these issues needs to be resolved.

One way is you append mode to loss tensor name, so that metrics cna be shown as TRAIN_LOSS and EVAL_LOSS , GLOBAL_LOSS when emitting to minerva.

Also please look at what is -1 iteration number, it needs to be handled.

@rahul003 @leleamol @jarednielsen I think the above issue will be in all the frameworks. Would we be able to please make sure we handle this issue across every framework.

@vandanavk
Copy link
Contributor Author

There are some issues with current implementation , if I look at results -
"{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2104947566986084, "Timestamp": 1575587064.571794, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.311326742172241, "Timestamp": 1575587064.579527, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.292689800262451, "Timestamp": 1575587064.589998, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.421755313873291, "Timestamp": 1575587064.608332, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4677681922912598, "Timestamp": 1575587064.615562, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.1791975498199463, "Timestamp": 1575587064.6230772, "IterationNumber": 5}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.246185779571533, "Timestamp": 1575587064.630558, "IterationNumber": 6}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3693621158599854, "Timestamp": 1575587064.639154, "IterationNumber": 7}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.351681709289551, "Timestamp": 1575587064.646904, "IterationNumber": 8}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.289271593093872, "Timestamp": 1575587064.705531, "IterationNumber": -1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3608412742614746, "Timestamp": 1575587064.7141418, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.265277862548828, "Timestamp": 1575587064.717716, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2706689834594727, "Timestamp": 1575587064.721091, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3054704666137695, "Timestamp": 1575587064.724351, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2706704139709473, "Timestamp": 1575587064.728058, "IterationNumber": 4}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2529993057250977, "Timestamp": 1575587064.731631, "IterationNumber": 5}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.279721736907959, "Timestamp": 1575587064.73516, "IterationNumber": 6}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4128708839416504, "Timestamp": 1575587064.738771, "IterationNumber": 7}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2863287925720215, "Timestamp": 1575587064.74294, "IterationNumber": 8}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3826935291290283, "Timestamp": 1575587064.802319, "IterationNumber": 9}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.36856746673584, "Timestamp": 1575587064.81825, "IterationNumber": 10}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.303138494491577, "Timestamp": 1575587064.82403, "IterationNumber": 11}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4107022285461426, "Timestamp": 1575587064.831247, "IterationNumber": 12}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.280378818511963, "Timestamp": 1575587064.838045, "IterationNumber": 13}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.34610915184021, "Timestamp": 1575587064.844379, "IterationNumber": 14}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.201525926589966, "Timestamp": 1575587064.851093, "IterationNumber": 15}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4224019050598145, "Timestamp": 1575587064.85957, "IterationNumber": 16}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3088531494140625, "Timestamp": 1575587064.867857, "IterationNumber": 17}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2803003787994385, "Timestamp": 1575587064.874088, "IterationNumber": 18}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.377769947052002, "Timestamp": 1575587064.929729, "IterationNumber": 9}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.3445422649383545, "Timestamp": 1575587064.939193, "IterationNumber": 10}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.240546703338623, "Timestamp": 1575587064.9428961, "IterationNumber": 11}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.271639347076416, "Timestamp": 1575587064.946501, "IterationNumber": 12}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.297555685043335, "Timestamp": 1575587064.950707, "IterationNumber": 13}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.259658098220825, "Timestamp": 1575587064.955234, "IterationNumber": 14}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.2558112144470215, "Timestamp": 1575587064.9623349, "IterationNumber": 15}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.24918532371521, "Timestamp": 1575587064.971052, "IterationNumber": 16}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.4208264350891113, "Timestamp": 1575587064.974842, "IterationNumber": 17}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.282519817352295, "Timestamp": 1575587064.980673, "IterationNumber": 18}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.368577003479004, "Timestamp": 1575587065.00563, "IterationNumber": 19}"

As I suspected, iteration number is repeated. :)
Other problem is iteration number -1.
Both of these issues needs to be resolved.

One way is you append mode to loss tensor name, so that metrics cna be shown as TRAIN_LOSS and EVAL_LOSS , GLOBAL_LOSS .

Also please look at what is -1 iteration number, it needs to be handled.

@Vikas-kum
This -1 is an issue in PT (and possibly MX hook too) hook. when forward_pre_hook() is called the very first time,

if self.writer is not None:
            self._close_writers()

self.writer is None, so TRAIN mode step -1 is not logged. but when set to EVAL mode, this check fails as writer is valid and iteration number -1 is written

@Vikas-kum
Copy link
Contributor

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check

if IterationNumber:
    set itertionnumber

and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test

This is the latest @Vikas-kum :

cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}

Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?

This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module

Thanks for clarifying.
Can you please let me know the result of this test -
Test case -
Training has TRAIN, EVAL ,GLOBAL mode set. What is the output to minerva?
Let training run for 10 steps of train and 10 steps of eval for 2 epoch(1 epoch is 10 step train and 10 step eval)

@Vikas-kum here's the log with the current code. https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4.
some issues/improvements that I see - doesn't differentiate between modes, doesn't differentiate between epochs, iteration number can only be an int, step number continues across epochs (is this intended behavior?)

no this is bug. Please refer to comment here , i suggested a way to handle mode.
#86 (comment)

@vandanavk
Copy link
Contributor Author

RFC @jarednielsen @rahul003 @Vikas-kum this is how the code could look like for functional loss. if input tensors need to be blocked only for SM metrics, then adding a check before queuing for SM metrics is an option.
After these changes, the output to SM metrics looks something like the PR description above @Vikas-kum

Thanks. Why is there no iteration number in first line? {"MetricName": "CrossEntropyLoss_output_0", "Value": 2.349432945251465, "Timestamp": 1575491421253}

This is because smexperiments had a check

if IterationNumber:
    set itertionnumber

and iteration 0 wasn't getting recorded. I had reported it to Owen, so I think this must have been fixed in the latest version. will get a hold of the latest version of the package and test

This is the latest @Vikas-kum :

cat 70028.json 
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.166755437850952, "Timestamp": 1575496628.148496, "IterationNumber": 0}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 2.065483331680298, "Timestamp": 1575496628.156893, "IterationNumber": 1}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.8483023643493652, "Timestamp": 1575496628.1635468, "IterationNumber": 2}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 1.375758409500122, "Timestamp": 1575496628.168522, "IterationNumber": 3}
{"MetricName": "CrossEntropyLoss_output_0", "Value": 0.20699751377105713, "Timestamp": 1575496628.173862, "IterationNumber": 4}

Are you saving every step ? Shouldn't iteration number be 0,100,200 by default. I guess we are recording loss every 100 steps by default, right?

This output is on execution of tests/pytorch/test_loss.py::test_register_loss_module

Thanks for clarifying.
Can you please let me know the result of this test -
Test case -
Training has TRAIN, EVAL ,GLOBAL mode set. What is the output to minerva?
Let training run for 10 steps of train and 10 steps of eval for 2 epoch(1 epoch is 10 step train and 10 step eval)

@Vikas-kum here's the log with the current code. https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4.
some issues/improvements that I see - doesn't differentiate between modes, doesn't differentiate between epochs, iteration number can only be an int, step number continues across epochs (is this intended behavior?)

no this is bug. Please refer to comment here , i suggested a way to handle mode.
#86 (comment)

@Vikas-kum The output now looks like this https://gist.github.com/vandanavk/4c72140e55042e5bef310e1a5d31c2b4

@vandanavk
Copy link
Contributor Author

@rahul003 the behavior in terms of logging inputs is the following with the code in this PR

  • loss input is not logged by default (only output is). input can be included with an explicit call to add_module_tensors(). but if this is done, then loss input will be logged to sm metrics too.

the earlier option was:

  • add a check to exclude all "_input" tensors from sm metrics. in this case, loss inputs and outputs are generally logged but will be excluded only from sm metrics.

@rahul003
Copy link
Contributor

rahul003 commented Dec 6, 2019

@rahul003 the behavior in terms of logging inputs is the following with the code in this PR

  • loss input is not logged by default (only output is). input can be included with an explicit call to add_module_tensors(). but if this is done, then loss input will be logged to sm metrics too.

the earlier option was:

  • add a check to exclude all "_input" tensors from sm metrics. in this case, loss inputs and outputs are generally logged but will be excluded only from sm metrics.

I think that's not a problem because the input to Loss block doesn't have a place in the losses collection. We can guide users to add that module to a different collection, say
hook.get_collection('outputs').add_module_tensors(loss_module, inputs=True, outputs=False)

Copy link
Contributor

@rahul003 rahul003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this should work for mxnet as well, right? Can you add a test for mxnet too
The step number fix can come in a different pr

This reverts commit 8476c5f.

We dont need to close writers anymore because mode is stored in scalar cache
@rahul003
Copy link
Contributor

Is this ready now?

@vandanavk
Copy link
Contributor Author

Couple of things -

  1. In the end can you run a simple training with sagemaker and verify what metrics are getting emitted to rhinestone.
    I guess you will have to build your image (we have scripts to do that) and provide that custom image in SM API.
  2. In description - Please check what framework has been tested explicitly.

@Vikas-kum
For 1., tested on SageMaker. For a test training script, the following metrics were emitted
scalar/pt_num_steps_GLOBAL
CrossEntropyLoss_output_0_TRAIN
CrossEntropyLoss_output_0_EVAL

@vandanavk
Copy link
Contributor Author

Is this ready now?

@rahul003 @Vikas-kum All review comments addressed.

@Vikas-kum
Copy link
Contributor

Is this ready now?

@rahul003 @Vikas-kum All review comments addressed.

Thanks. I see xgboost is not tested. Can you run and check for zero code change xgboost ?

@vandanavk
Copy link
Contributor Author

vandanavk commented Dec 27, 2019

Is this ready now?

@rahul003 @Vikas-kum All review comments addressed.

Thanks. I see xgboost is not tested. Can you run and check for zero code change xgboost ?

@Vikas-kum this is the output https://gist.github.com/vandanavk/bdc4cb43d11f033b518d96b7d8a4def5 when tests/xgboost/test_hook.py is executed (this is the output of all 13 tests executed)

@Vikas-kum
Copy link
Contributor

Is this ready now?

@rahul003 @Vikas-kum All review comments addressed.

Thanks. I see xgboost is not tested. Can you run and check for zero code change xgboost ?

@Vikas-kum this is the output https://gist.github.com/vandanavk/bdc4cb43d11f033b518d96b7d8a4def5 when tests/xgboost/test_hook.py is executed (this is the output of all 13 tests executed)

Why is there repetitions here ? rain-rmse_GLOBAL is repeated for iterations . Also, timestamp ordering is weird, 1577487854.368032 comes after 1577487854.3779159 .

{"MetricName": "train-rmse_GLOBAL", "Value": 0.389416, "Timestamp": 1577487854.3779159, "IterationNumber": 0}
{"MetricName": "test-rmse_GLOBAL", "Value": 0.490306, "Timestamp": 1577487854.3785, "IterationNumber": 0}
{"MetricName": "train-rmse_GLOBAL", "Value": 0.389416, "Timestamp": 1577487854.367563, "IterationNumber": 0}
{"MetricName": "test-rmse_GLOBAL", "Value": 0.490306, "Timestamp": 1577487854.368032, "IterationNumber": 0}
{"MetricName": "train-rmse_GLOBAL", "Value": 0.389416, "Timestamp": 1577487854.352, "IterationNumber": 0}
{"MetricName": "test-rmse_GLOBAL", "Value": 0.490306, "Timestamp": 1577487854.352299, "IterationNumber": 0}
{"MetricName": "train-rmse_GLOBAL", "Value": 0.389416, "Timestamp": 1577487854.341634, "IterationNumber": 0}

@Vikas-kum Vikas-kum closed this Dec 28, 2019
@Vikas-kum Vikas-kum reopened this Dec 28, 2019
Copy link
Contributor

@Vikas-kum Vikas-kum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vandana confirmed that the repitions are from different test cases. https://gist.github.com/vandanavk/bdc4cb43d11f033b518d96b7d8a4def5

@Vikas-kum Vikas-kum merged commit cfcfd7a into master Dec 28, 2019
@Vikas-kum Vikas-kum deleted the loss_input branch December 28, 2019 21:02
ddavydenko pushed a commit that referenced this pull request Jan 2, 2020
* Skip logging the input tensors to the loss block

* Add loss inputs for PT functional loss

* append mode name

* Write correct mode for each scalar in write_scalars()
* Increment global mode step number irrespective of which mode it is

(cherry picked from commit cfcfd7a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants