Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdvancedProfiler error #1236

Closed
dumitrescustefan opened this issue Mar 25, 2020 · 9 comments · Fixed by #1267
Closed

AdvancedProfiler error #1236

dumitrescustefan opened this issue Mar 25, 2020 · 9 comments · Fixed by #1267
Assignees
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@dumitrescustefan
Copy link

Hi, as others have pointed out, the Profiler doesn't seem to work (it prints nothing), and trying out the AdvancedProfiler as in https://pytorch-lightning.readthedocs.io/en/latest/profiler.html like:

from pytorch_lightning.profiler import AdvancedProfiler
    profiler = AdvancedProfiler(output_filename="prof.txt")
    trainer = Trainer(profiler=profiler, (other params here)

gives me the following error:

Validation sanity check: 50it [00:00, 212.11it/s]             Traceback (most recent call last):
  File "/Users/sdumitre/work/style/training.py", line 177, in <module>
    main(hparams)
  File "/Users/sdumitre/work/style/training.py", line 77, in main
    trainer.fit(model)
  File "/Users/sdumitre/virtual/p3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 630, in fit
    self.run_pretrain_routine(model)
  File "/Users/sdumitre/virtual/p3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 810, in run_pretrain_routine
    _, _, _, callback_metrics, _ = self.process_output(eval_results)
  File "/Users/sdumitre/virtual/p3/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 117, in process_output
    callback_metrics[k] = v.item()
ValueError: only one element tensors can be converted to Python scalars
                                                 
Process finished with exit code 1

Any pointers?
My env: torch 1.4 installed with pip, Python 3.7, no GPU, on, MacOS.

Thanks for the great lib you're developing!

@dumitrescustefan dumitrescustefan added bug Something isn't working help wanted Open to be worked on labels Mar 25, 2020
@williamFalcon
Copy link
Contributor

@jeremyjordan

@jeremyjordan
Copy link
Contributor

jeremyjordan commented Mar 26, 2020

Hi @dumitrescustefan thanks for submitting this issue.

as others have pointed out, the Profiler doesn't seem to work

If there are other Github Issues please reference them here

the Profiler doesn't seem to work (it prints nothing)

Did you configure logging? I usually do this in the root __init__.py of my projects.

eg.

import logging
logging.basicConfig(level=logging.INFO)

Finally, this seems unrelated to the AdvancedProfiler.

 File "/Users/sdumitre/virtual/p3/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 117, in process_output
    callback_metrics[k] = v.item()
ValueError: only one element tensors can be converted to Python scalars

Can you show what your training_step and validation_step code looks like?

@Ir1d
Copy link
Contributor

Ir1d commented Mar 26, 2020

btw I have value error but not in profiler. I was using loguru instead of logging.

@dumitrescustefan
Copy link
Author

@jeremyjordan thanks for the tips! I put logging into init.py, and tried with the "basic" profiler again, now I get the same error.

Here are the train/val_steps:

    def training_step(self, batch: tuple, batch_nb: int, *args, **kwargs) -> dict:
        x_tensor, x_lengths, y_tensor = batch

        model_out = self.forward(x_tensor, x_lengths)
        loss_val = self.loss(model_out, y_tensor)

        tqdm_dict = {"train_loss": loss_val}
        output = OrderedDict(
            {"loss": loss_val, "progress_bar": tqdm_dict, "log": tqdm_dict}
        )
        return output

    def validation_step(self, batch: tuple, batch_nb: int, *args, **kwargs) -> dict:
        x_tensor, x_lengths, y_tensor = batch

        model_out = self.forward(x_tensor, x_lengths)
        loss_val = self.loss(model_out, y_tensor)

        output = OrderedDict({"val_loss": loss_val})

        return output

All the code (except the forward and the model params) is copy-pasted from a lightning tutorial. Without the profiler everything seems to work okay. The trainer is initialised with:

trainer = Trainer(
        logger=setup_testube_logger(),
        checkpoint_callback=True,
        early_stop_callback=early_stop_callback,
        default_save_path="experiments/",
        gpus=hparams.gpus,
        distributed_backend=hparams.distributed_backend,
        use_amp=hparams.use_16bit,
        max_epochs=hparams.max_epochs,
        min_epochs=hparams.min_epochs,
        accumulate_grad_batches=hparams.accumulate_grad_batches,
        log_gpu_memory=hparams.log_gpu_memory,
        val_percent_check=hparams.val_percent_check,
        profiler=True <-- this is what I added
    )

IMHO, shouldn't the profiler be agnostic to what I do in the code? Actually the in-built profiler is one of the main features that made me try out Lightning. I would be most grateful to have it work :) Please tell me what piece of code I could provide. The model itself aims at predicting a set of n values (floats) based on a number of sentences (embedded with BPE as ints). There is a sentence-level RNN that encodes each sentence, and then a "document" level RNN that runs over each sentence. This gets into a hidden->n linear layer and the error is MSELoss(). This is a baseline I created and I'd like to build from here, but I need to get past these initial errors. I don't know if this info is useful for you, I can provide all the code if required.

Thanks!

@dumitrescustefan
Copy link
Author

My bad. I did 2 things, I added the profiler and then I also added reduction='none' (in MSELoss) and in the loss function I forgot to do the reduction myself; thus the loss function returned a 2D tensor instead of a scalar, which broke the pytorch_lightning/trainer/logging.py, which led me to believe it was a logging error. I'm usually careful with changing more than a single item per run, but hey, blame the new lib I'm learning to use :)

I'm not closing this issue because even though now the AdvancedProfiler works (dumps to a file), the basic one still doesn't want to print anything onscreen, even after adding level=DEBUG.

If logging is required, maybe the docs could be updated ( https://pytorch-lightning.readthedocs.io/en/latest/profiler.html ).

Should I do anything more besides adding logging to init?

import logging
logging.basicConfig(level=logging.DEBUG)

and in the Trainer object:

profiler=True

Thanks!

Thanks and sorry for the time taken on my mistake!

@jeremyjordan
Copy link
Contributor

IMHO, shouldn't the profiler be agnostic to what I do in the code?

yes! that's why i was confused about your error :)

then I also added reduction='none' (in MSELoss) and in the loss function I forgot to do the reduction myself; thus the loss function returned a 2D tensor instead of a scalar, which broke the pytorch_lightning/trainer/logging.py

but this makes perfect sense


Actually the in-built profiler is one of the main features that made me try out Lightning. I would be most grateful to have it work :)

that's great to hear! definitely want to help you get this figured out.

i tried reproducing your error but it's working for me - check out this colab notebook
https://colab.research.google.com/drive/1wSwMd5xGb36zdNS-5yk6ptHNEXqa1IMi

could you perhaps share a colab notebook where this is failing?

btw i agree we should add a note to the documentation about enabling the logger, i believe we used to configure logging within the library but that was removed at one point

@Borda
Copy link
Member

Borda commented Mar 27, 2020

similar question about missing logging table was raised also by @dumitrescustefan
I would suggest adding also a method profiler.summary() returning string of the stats
so we avoid this confusion with logging init and a user can simply use:

print(trainer.profiler.summary())

@williamFalcon @jeremyjordan ^^

@williamFalcon
Copy link
Contributor

williamFalcon commented Mar 27, 2020

  1. if logging is required, let’s just auto configure it for the user? not doing so goes against our principles.

  2. i think a problem with the logger is that i think it prints when training completes. if you stop training early (early stopping), or just ctrl c, i suspect you won’t see it?

  3. adding that print suggestion i think is also yet another thing the user has to remember which goes against our principles haha.

If (2) is true, then let’s just make a limit of 2 epochs when profiler is enabled?

Another option is to always run the basic profiler for the sanity check. Then profiler=True would run it for training as well. but i think the sanity check profiler won’t reflect the true speed bc it doesn’t backprop?

@jeremyjordan can we prioritize making these fixes as this is a key feature?

@jeremyjordan
Copy link
Contributor

@williamFalcon

if logging is required, let’s just auto configure it for the user? not doing so goes against our principles.

at the time when the feature was merged, we were configuring logging. i went back to the branch and verified that it was working out of the box. there was a later PR (#767) which removed this.

i think a problem with the logger is that i think it prints when training completes. if you stop training early (early stopping), or just ctrl c, i suspect you won’t see it?

actually, we still do show it :) in both cases (early stopping + keyboard interrupt)

can we prioritize making these fixes as this is a key feature?

yeah for sure. as i understand, this should just involve adding the logging config back in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants