AdvancedProfiler error #1236

dumitrescustefan · 2020-03-25T19:02:05Z

Hi, as others have pointed out, the Profiler doesn't seem to work (it prints nothing), and trying out the AdvancedProfiler as in https://pytorch-lightning.readthedocs.io/en/latest/profiler.html like:

from pytorch_lightning.profiler import AdvancedProfiler
    profiler = AdvancedProfiler(output_filename="prof.txt")
    trainer = Trainer(profiler=profiler, (other params here)

gives me the following error:

Validation sanity check: 50it [00:00, 212.11it/s]             Traceback (most recent call last):
  File "/Users/sdumitre/work/style/training.py", line 177, in <module>
    main(hparams)
  File "/Users/sdumitre/work/style/training.py", line 77, in main
    trainer.fit(model)
  File "/Users/sdumitre/virtual/p3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 630, in fit
    self.run_pretrain_routine(model)
  File "/Users/sdumitre/virtual/p3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 810, in run_pretrain_routine
    _, _, _, callback_metrics, _ = self.process_output(eval_results)
  File "/Users/sdumitre/virtual/p3/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 117, in process_output
    callback_metrics[k] = v.item()
ValueError: only one element tensors can be converted to Python scalars
                                                 
Process finished with exit code 1

Any pointers?
My env: torch 1.4 installed with pip, Python 3.7, no GPU, on, MacOS.

Thanks for the great lib you're developing!

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-03-25T19:46:36Z

@jeremyjordan

jeremyjordan · 2020-03-26T00:17:17Z

Hi @dumitrescustefan thanks for submitting this issue.

as others have pointed out, the Profiler doesn't seem to work

If there are other Github Issues please reference them here

the Profiler doesn't seem to work (it prints nothing)

Did you configure logging? I usually do this in the root __init__.py of my projects.

eg.

import logging
logging.basicConfig(level=logging.INFO)

Finally, this seems unrelated to the AdvancedProfiler.

 File "/Users/sdumitre/virtual/p3/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 117, in process_output
    callback_metrics[k] = v.item()
ValueError: only one element tensors can be converted to Python scalars

Can you show what your training_step and validation_step code looks like?

Ir1d · 2020-03-26T04:48:18Z

btw I have value error but not in profiler. I was using loguru instead of logging.

dumitrescustefan · 2020-03-26T08:32:45Z

@jeremyjordan thanks for the tips! I put logging into init.py, and tried with the "basic" profiler again, now I get the same error.

Here are the train/val_steps:

    def training_step(self, batch: tuple, batch_nb: int, *args, **kwargs) -> dict:
        x_tensor, x_lengths, y_tensor = batch

        model_out = self.forward(x_tensor, x_lengths)
        loss_val = self.loss(model_out, y_tensor)

        tqdm_dict = {"train_loss": loss_val}
        output = OrderedDict(
            {"loss": loss_val, "progress_bar": tqdm_dict, "log": tqdm_dict}
        )
        return output

    def validation_step(self, batch: tuple, batch_nb: int, *args, **kwargs) -> dict:
        x_tensor, x_lengths, y_tensor = batch

        model_out = self.forward(x_tensor, x_lengths)
        loss_val = self.loss(model_out, y_tensor)

        output = OrderedDict({"val_loss": loss_val})

        return output

All the code (except the forward and the model params) is copy-pasted from a lightning tutorial. Without the profiler everything seems to work okay. The trainer is initialised with:

trainer = Trainer(
        logger=setup_testube_logger(),
        checkpoint_callback=True,
        early_stop_callback=early_stop_callback,
        default_save_path="experiments/",
        gpus=hparams.gpus,
        distributed_backend=hparams.distributed_backend,
        use_amp=hparams.use_16bit,
        max_epochs=hparams.max_epochs,
        min_epochs=hparams.min_epochs,
        accumulate_grad_batches=hparams.accumulate_grad_batches,
        log_gpu_memory=hparams.log_gpu_memory,
        val_percent_check=hparams.val_percent_check,
        profiler=True <-- this is what I added
    )

IMHO, shouldn't the profiler be agnostic to what I do in the code? Actually the in-built profiler is one of the main features that made me try out Lightning. I would be most grateful to have it work :) Please tell me what piece of code I could provide. The model itself aims at predicting a set of n values (floats) based on a number of sentences (embedded with BPE as ints). There is a sentence-level RNN that encodes each sentence, and then a "document" level RNN that runs over each sentence. This gets into a hidden->n linear layer and the error is MSELoss(). This is a baseline I created and I'd like to build from here, but I need to get past these initial errors. I don't know if this info is useful for you, I can provide all the code if required.

Thanks!

dumitrescustefan · 2020-03-26T09:58:08Z

My bad. I did 2 things, I added the profiler and then I also added reduction='none' (in MSELoss) and in the loss function I forgot to do the reduction myself; thus the loss function returned a 2D tensor instead of a scalar, which broke the pytorch_lightning/trainer/logging.py, which led me to believe it was a logging error. I'm usually careful with changing more than a single item per run, but hey, blame the new lib I'm learning to use :)

I'm not closing this issue because even though now the AdvancedProfiler works (dumps to a file), the basic one still doesn't want to print anything onscreen, even after adding level=DEBUG.

If logging is required, maybe the docs could be updated ( https://pytorch-lightning.readthedocs.io/en/latest/profiler.html ).

Should I do anything more besides adding logging to init?

import logging
logging.basicConfig(level=logging.DEBUG)

and in the Trainer object:

profiler=True

Thanks!

Thanks and sorry for the time taken on my mistake!

jeremyjordan · 2020-03-27T03:14:05Z

IMHO, shouldn't the profiler be agnostic to what I do in the code?

yes! that's why i was confused about your error :)

then I also added reduction='none' (in MSELoss) and in the loss function I forgot to do the reduction myself; thus the loss function returned a 2D tensor instead of a scalar, which broke the pytorch_lightning/trainer/logging.py

but this makes perfect sense

Actually the in-built profiler is one of the main features that made me try out Lightning. I would be most grateful to have it work :)

that's great to hear! definitely want to help you get this figured out.

i tried reproducing your error but it's working for me - check out this colab notebook
https://colab.research.google.com/drive/1wSwMd5xGb36zdNS-5yk6ptHNEXqa1IMi

could you perhaps share a colab notebook where this is failing?

btw i agree we should add a note to the documentation about enabling the logger, i believe we used to configure logging within the library but that was removed at one point

Borda · 2020-03-27T09:36:22Z

similar question about missing logging table was raised also by @dumitrescustefan
I would suggest adding also a method profiler.summary() returning string of the stats
so we avoid this confusion with logging init and a user can simply use:

print(trainer.profiler.summary())

@williamFalcon @jeremyjordan ^^

williamFalcon · 2020-03-27T11:11:34Z

if logging is required, let’s just auto configure it for the user? not doing so goes against our principles.
i think a problem with the logger is that i think it prints when training completes. if you stop training early (early stopping), or just ctrl c, i suspect you won’t see it?
adding that print suggestion i think is also yet another thing the user has to remember which goes against our principles haha.

If (2) is true, then let’s just make a limit of 2 epochs when profiler is enabled?

Another option is to always run the basic profiler for the sanity check. Then profiler=True would run it for training as well. but i think the sanity check profiler won’t reflect the true speed bc it doesn’t backprop?

@jeremyjordan can we prioritize making these fixes as this is a key feature?

jeremyjordan · 2020-03-28T00:55:48Z

@williamFalcon

if logging is required, let’s just auto configure it for the user? not doing so goes against our principles.

at the time when the feature was merged, we were configuring logging. i went back to the branch and verified that it was working out of the box. there was a later PR (#767) which removed this.

i think a problem with the logger is that i think it prints when training completes. if you stop training early (early stopping), or just ctrl c, i suspect you won’t see it?

actually, we still do show it :) in both cases (early stopping + keyboard interrupt)

can we prioritize making these fixes as this is a key feature?

yeah for sure. as i understand, this should just involve adding the logging config back in

dumitrescustefan added bug Something isn't working help wanted Open to be worked on labels Mar 25, 2020

Borda assigned jeremyjordan Mar 25, 2020

awaelchli mentioned this issue Mar 26, 2020

ValueError: only one element tensors can be converted to Python scalars #1218

Closed

This was referenced Mar 28, 2020

fix logging config and add profiler test #1267

Merged

better checking of data returned from training_step #1256

Closed

williamFalcon closed this as completed in #1267 Mar 29, 2020

jeremyjordan mentioned this issue Apr 15, 2020

Do not configure python logging #1503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdvancedProfiler error #1236

AdvancedProfiler error #1236

dumitrescustefan commented Mar 25, 2020

williamFalcon commented Mar 25, 2020

jeremyjordan commented Mar 26, 2020 •

edited

Loading

Ir1d commented Mar 26, 2020

dumitrescustefan commented Mar 26, 2020

dumitrescustefan commented Mar 26, 2020

jeremyjordan commented Mar 27, 2020

Borda commented Mar 27, 2020

williamFalcon commented Mar 27, 2020 •

edited

Loading

jeremyjordan commented Mar 28, 2020

AdvancedProfiler error #1236

AdvancedProfiler error #1236

Comments

dumitrescustefan commented Mar 25, 2020

williamFalcon commented Mar 25, 2020

jeremyjordan commented Mar 26, 2020 • edited Loading

Ir1d commented Mar 26, 2020

dumitrescustefan commented Mar 26, 2020

dumitrescustefan commented Mar 26, 2020

jeremyjordan commented Mar 27, 2020

Borda commented Mar 27, 2020

williamFalcon commented Mar 27, 2020 • edited Loading

jeremyjordan commented Mar 28, 2020

jeremyjordan commented Mar 26, 2020 •

edited

Loading

williamFalcon commented Mar 27, 2020 •

edited

Loading