Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use SSLOnlineEvaluator callback #309

Closed
FrankXinqi opened this issue Oct 25, 2020 · 13 comments · Fixed by #329
Closed

How to use SSLOnlineEvaluator callback #309

FrankXinqi opened this issue Oct 25, 2020 · 13 comments · Fixed by #329
Assignees
Labels
bug Something isn't working question Further information is requested
Milestone

Comments

@FrankXinqi
Copy link

❓ Questions and Help

Before asking:

Lightning Forum Posted Question: https://forums.pytorchlightning.ai/t/how-to-use-sslonlineevaluator/309

What is your question?

I study the code provided here SSL SimCLR on colab, and implemented a similar code.

I have modified the datamodule to load mine, and I can run the code. However, if I use SSLOnlineEvaluator. I cannot make it work. If I specify the callbacks for model, the error shows:
TypeError: on_train_batch_end() takes 6 positional arguments but 7 were given

Code

I have the following code:

def to_device(batch, device):
    (img1, _), y = batch
    img1 = img1.to(device)
    y = y.to(device)
    return img1, y

online_finetuner = SSLOnlineEvaluator(z_dim=2048 * 2 * 2, num_classes = 7)
online_finetuner.to_device = to_device

lr_logger = LearningRateMonitor()

callbacks = [online_finetuner, lr_logger]

# pick data
rafdb_height=100
batch_size=64

# data
dm = RAFDBDataModule(num_workers=0)
dm.train_transforms = SimCLRTrainDataTransform(input_height=rafdb_height)
dm.val_transforms = SimCLREvalDataTransform(input_height=rafdb_height)
dm.test_transforms = SimCLREvalDataTransform(input_height=rafdb_height)

# model
model = SimCLR(num_samples=dm.num_samples, batch_size=batch_size)

# fit
trainer = pl.Trainer(gpus=1, max_epochs=20, callbacks=callbacks)
trainer.fit(model, dm)

Full Output information:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                 | Type         | Params
------------------------------------------------------
0 | encoder              | ResNet       | 25 M  
1 | projection           | Projection   | 4 M   
2 | non_linear_evaluator | SSLEvaluator | 8 M   
Epoch 0:   0%|          | 0/383 [00:00<?, ?it/s] Traceback (most recent call last):
  File "H:/Project/PycharmProject/FER_Long-tailed/train_SimCLR.py", line 236, in <module>
    trainer.fit(model, dm)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 483, in train
    self.train_loop.run_training_epoch()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 557, in run_training_epoch
    self.on_train_batch_end(epoch_output, epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 249, in on_train_batch_end
    self.trainer.call_hook('on_train_batch_end', epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 823, in call_hook
    trainer_hook(*args, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 147, in on_train_batch_end
    callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
TypeError: on_train_batch_end() takes 6 positional arguments but 7 were given
Epoch 0:   0%|          | 0/383 [00:02<?, ?it/s]

Process finished with exit code 1

What have you tried?

Try to put OnlineEvaluator as the only callback, but it does not work as well.

What's your environment?

  • OS: Win10
  • Packaging [e.g. pip, conda]
  • Version:
    python 3.8.6
    pytorch 1.6.0
    pytorch-lightning 1.0.2 py_0 conda-forge
    pytorch-lightning-bolts 0.2.5 pypi_0 pypi
@awaelchli awaelchli transferred this issue from Lightning-AI/pytorch-lightning Oct 25, 2020
@awaelchli
Copy link
Contributor

I believe this PR #277 fixed this

@awaelchli awaelchli added fix fixing issues... question Further information is requested and removed fix fixing issues... labels Oct 25, 2020
@FrankXinqi
Copy link
Author

Cool. I see that an 'output' arg was added. I will install the latest version from GitHub.

Just an offset question: If I do not want to install the blots package again by
pip install git+https://github.com/PytorchLightning/pytorch-lightning-bolts.git@master --upgrade, instead I download the package from GitHub, put it in my project folder, and just call any function by its relative path. Would this work with this kind of big project?

@awaelchli
Copy link
Contributor

I never tried that but you should be able to make it work. You probably have to add it to the python path or similar to have the imports working.
We don't recommend this, because you won't get any updates and have to manually repeat the process if you need to get new version.

@FrankXinqi
Copy link
Author

FrankXinqi commented Oct 26, 2020

Cool. There is a new bug, when I upgrade to the master version. The network maltiplication has a dimension not match as

  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\functional.py", line 1676, in linear
    output = input.matmul(weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0

The output information

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                 | Type         | Params
------------------------------------------------------
0 | encoder              | ResNet       | 25 M  
1 | projection           | Projection   | 4 M   
2 | non_linear_evaluator | SSLEvaluator | 8 M   
Epoch 0:   0%|          | 0/383 [00:00<?, ?it/s] Traceback (most recent call last):
  File "H:/Project/PycharmProject/FER_Long-tailed/train_SimCLR.py", line 276, in <module>
    trainer.fit(model, dm)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 483, in train
    self.train_loop.run_training_epoch()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 557, in run_training_epoch
    self.on_train_batch_end(epoch_output, epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 249, in on_train_batch_end
    self.trainer.call_hook('on_train_batch_end', epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 823, in call_hook
    trainer_hook(*args, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 147, in on_train_batch_end
    callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pl_bolts\callbacks\ssl_online.py", line 85, in on_train_batch_end
    mlp_preds = pl_module.non_linear_evaluator(representations)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pl_bolts\models\self_supervised\evaluator.py", line 30, in forward
    logits = self.block_forward(x)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\container.py", line 117, in forward
    input = module(input)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\linear.py", line 91, in forward
    return F.linear(input, self.weight, self.bias)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\functional.py", line 1676, in linear
    output = input.matmul(weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0
Epoch 0:   0%|          | 0/383 [00:00<?, ?it/s]

This bug only happens if I add callback. I works fine for a few time(not sure if it is step, iterations ...), but then it gives input.shape=32,100352, weight.t().shape=8192,1024.

@FrankXinqi
Copy link
Author

FrankXinqi commented Oct 26, 2020

The above mentioned issue is with my customed datamodule. Here is a different issue with official CIFAR10.

The issue seems to relate with CUDA, not sure if the problem is on my side. Just to make it clear, all codes run well without callbacks passed in.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Files already downloaded and verified
Files already downloaded and verified

  | Name                 | Type         | Params
------------------------------------------------------
0 | encoder              | ResNet       | 25 M  
1 | projection           | Projection   | 4 M   
2 | non_linear_evaluator | SSLEvaluator | 8 M   
Epoch 0:   0%|          | 0/1562 [00:00<?, ?it/s] C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [24,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "H:/Project/PycharmProject/FER_Long-tailed/example/SimCLR.py", line 36, in <module>
    trainer.fit(model, dm)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 483, in train
    self.train_loop.run_training_epoch()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 557, in run_training_epoch
    self.on_train_batch_end(epoch_output, epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 249, in on_train_batch_end
    self.trainer.call_hook('on_train_batch_end', epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 823, in call_hook
    trainer_hook(*args, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 147, in on_train_batch_end
    callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pl_bolts\callbacks\ssl_online.py", line 95, in on_train_batch_end
    acc = accuracy(mlp_preds, y, num_classes=trainer.datamodule.num_classes)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\metrics\functional\classification.py", line 286, in accuracy
    tps, fps, tns, fns, sups = stat_scores_multiple_classes(
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\metrics\functional\classification.py", line 197, in stat_scores_multiple_classes
    num_classes = get_num_classes(pred=pred, target=target, num_classes=num_classes)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\metrics\functional\classification.py", line 96, in get_num_classes
    num_target_classes = int(target.max().detach().item() + 1)
RuntimeError: CUDA error: device-side assert triggered
Epoch 0:   0%|          | 0/1562 [00:00<?, ?it/s]

@awaelchli awaelchli added the fix fixing issues... label Nov 1, 2020
@ananyahjha93 ananyahjha93 mentioned this issue Nov 2, 2020
8 tasks
@ananyahjha93 ananyahjha93 self-assigned this Nov 3, 2020
@ananyahjha93
Copy link
Contributor

@FrankXinqi what is the number of classes being passed to SSLOnlineEvaluator. What looks to me is the case when num_classes being passed is different than the class labels found in target tensor.

@ananyahjha93
Copy link
Contributor

@FrankXinqi for the first issue, I am merging a fix which should solve this issue. The output of resnet now is a flattened 2048 dimensional feature vector which is taken by the online evaluation callback.

@FrankXinqi
Copy link
Author

@FrankXinqi what is the number of classes being passed to SSLOnlineEvaluator. What looks to me is the case when num_classes being passed is different than the class labels found in target tensor.

My cutomized dataset has 7 output classes. For CIFAR10, it put is as 10. However, neither my own dataset or pl CIFAR10 do not work.

@FrankXinqi
Copy link
Author

@FrankXinqi for the first issue, I am merging a fix which should solve this issue. The output of resnet now is a flattened 2048 dimensional feature vector which is taken by the online evaluation callback.

Many thanks. For the first issue, do you mean this issue "RuntimeError: mat1 dim 1 must match mat2 dim 0"? If so, it may take some time to merge your PR to the master branch. Would you mind pointing out the solution, so I can fix it locally?

Thx.

@ananyahjha93
Copy link
Contributor

@FrankXinqi changed quite a few things around, like a whole revamp of simclr, difficult to point to a single change.

#329

@FrankXinqi
Copy link
Author

@FrankXinqi changed quite a few things around, like a whole revamp of simclr, difficult to point to a single change.

#329

Thanks. Seems like there was a merge 14 days ago, but not recently. Wanna to know if I reinstall the lastest one from Github, will the problem be fixed?

@ananyahjha93
Copy link
Contributor

@FrankXinqi

https://github.com/PyTorchLightning/pytorch-lightning-bolts/tree/fix/simclr

This is the branch that will be merged. The PR will close this issue because we have removed all the complexity of SSLOnlineEvaluator callback. Feel free to open new issues based on the new code base if something doesn't work still.

@FrankXinqi
Copy link
Author

Thanks. I will try it later and response to this issue still, if there are other bugs met.

@Borda Borda added this to the v0.3 milestone Jan 18, 2021
@Borda Borda added bug Something isn't working and removed fix fixing issues... labels Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants