Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not saving ckpt.tar.gz checkpoint #371

Open
IsauraMaria96 opened this issue Jul 4, 2024 · 8 comments
Open

Not saving ckpt.tar.gz checkpoint #371

IsauraMaria96 opened this issue Jul 4, 2024 · 8 comments

Comments

@IsauraMaria96
Copy link

Hi,

Thanks for the great tool. Recently I've installed CellBender in an Ubuntu server, and I've been having a problem in which the ckpt checkpoint is not saved, and thus the tool is uncapable of completing the process. Has anyone else had this problem? Thanks a lot.

Full log is attached: Error.log

System description:

  • Model: Dell Inc. Precision 5860.
  • RAM: 128,0 GiB.
  • CPU: Intel Xeon w5-2465X x32
  • OS: Ubuntu 22.04.4 LTS

Log:
cellbender:remove-background: Command:
cellbender remove-background --cuda --input /home/neurofisiologia/SRR19792156/outs/raw_feature_bc_matrix.h5 --output /home/neurofisiologia/DatosRefinados.h5
cellbender:remove-background: CellBender 0.3.2
cellbender:remove-background: (Workflow hash 346ca8efb8)
cellbender:remove-background: 2024-07-03 09:51:30
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from /home/neurofisiologia/SRR19792156/outs/raw_feature_bc_matrix.h5
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Features in dataset: 38606 Gene Expression
cellbender:remove-background: Trimming features for inference.
cellbender:remove-background: 33572 features have nonzero counts.
cellbender:remove-background: Prior on counts for cells is 3741
cellbender:remove-background: Prior on counts for empty droplets is 295
cellbender:remove-background: Excluding 8942 features that are estimated to have <= 0.1 background counts in cells.
cellbender:remove-background: Including 24630 features in the analysis.
cellbender:remove-background: Trimming barcodes for inference.
cellbender:remove-background: Excluding barcodes with counts below 147
cellbender:remove-background: Using 2575 probable cell barcodes, plus an additional 10272 barcodes, and 71346 empty droplets.
cellbender:remove-background: Largest surely-empty droplet has 343 UMI counts.
cellbender:remove-background: Attempting to unpack tarball "ckpt.tar.gz" to /tmp/tmprc18nole
cellbender:remove-background: No saved checkpoint.
cellbender:remove-background: No checkpoint loaded.
cellbender:remove-background: Running inference...
cellbender:remove-background: [epoch 001] average training loss: 6661.9639
cellbender:remove-background: [epoch 002] average training loss: 6034.6147 (3.7 seconds per epoch)
cellbender:remove-background: Will checkpoint every 114 epochs
cellbender:remove-background: [epoch 003] average training loss: 5427.5748
cellbender:remove-background: [epoch 004] average training loss: 5108.2541
cellbender:remove-background: [epoch 005] average training loss: 4923.2049
cellbender:remove-background: [epoch 005] average test loss: 4961.6450
cellbender:remove-background: [epoch 006] average training loss: 4714.6100
cellbender:remove-background: [epoch 007] average training loss: 4658.0975
cellbender:remove-background: [epoch 008] average training loss: 4686.5494
cellbender:remove-background: [epoch 009] average training loss: 4636.7682
cellbender:remove-background: [epoch 010] average training loss: 4599.6784
cellbender:remove-background: [epoch 010] average test loss: 4674.5524
cellbender:remove-background: [epoch 011] average training loss: 4629.7057
cellbender:remove-background: [epoch 012] average training loss: 4552.1350
cellbender:remove-background: [epoch 013] average training loss: 4496.8647
cellbender:remove-background: [epoch 014] average training loss: 4308.9377
cellbender:remove-background: [epoch 015] average training loss: 4275.1747
cellbender:remove-background: [epoch 015] average test loss: 4324.3197
cellbender:remove-background: [epoch 016] average training loss: 4261.2428
cellbender:remove-background: [epoch 017] average training loss: 4251.0613
cellbender:remove-background: [epoch 018] average training loss: 4228.1749
cellbender:remove-background: [epoch 019] average training loss: 4206.0814
cellbender:remove-background: [epoch 020] average training loss: 4197.3849
cellbender:remove-background: [epoch 020] average test loss: 4191.5520
cellbender:remove-background: [epoch 021] average training loss: 4190.3577
cellbender:remove-background: [epoch 022] average training loss: 4154.5904
cellbender:remove-background: [epoch 023] average training loss: 4119.1000
cellbender:remove-background: [epoch 024] average training loss: 4101.0069
cellbender:remove-background: [epoch 025] average training loss: 4077.4471
cellbender:remove-background: [epoch 025] average test loss: 4076.8579
cellbender:remove-background: [epoch 026] average training loss: 4079.1548
cellbender:remove-background: [epoch 027] average training loss: 4060.0420
cellbender:remove-background: [epoch 028] average training loss: 4041.2950
cellbender:remove-background: [epoch 029] average training loss: 4023.0368
cellbender:remove-background: [epoch 030] average training loss: 4001.7430
cellbender:remove-background: [epoch 030] average test loss: 3975.9369
cellbender:remove-background: [epoch 031] average training loss: 3994.5689
cellbender:remove-background: [epoch 032] average training loss: 3992.0950
cellbender:remove-background: [epoch 033] average training loss: 3986.7607
cellbender:remove-background: [epoch 034] average training loss: 3997.4167
cellbender:remove-background: [epoch 035] average training loss: 3991.3141
cellbender:remove-background: [epoch 035] average test loss: 3993.9262
cellbender:remove-background: [epoch 036] average training loss: 3998.2393
cellbender:remove-background: [epoch 037] average training loss: 3989.8854
cellbender:remove-background: [epoch 038] average training loss: 3982.2416
cellbender:remove-background: [epoch 039] average training loss: 3980.3234
cellbender:remove-background: [epoch 040] average training loss: 3984.4739
cellbender:remove-background: [epoch 040] average test loss: 3973.4658
cellbender:remove-background: [epoch 041] average training loss: 3974.9065
cellbender:remove-background: [epoch 042] average training loss: 3984.2641
cellbender:remove-background: [epoch 043] average training loss: 3975.1879
cellbender:remove-background: [epoch 044] average training loss: 3971.4374
cellbender:remove-background: [epoch 045] average training loss: 3974.2532
cellbender:remove-background: [epoch 045] average test loss: 3950.0547
cellbender:remove-background: [epoch 046] average training loss: 3970.9828
cellbender:remove-background: [epoch 047] average training loss: 3964.1729
cellbender:remove-background: [epoch 048] average training loss: 3962.0764
cellbender:remove-background: [epoch 049] average training loss: 3971.4048
cellbender:remove-background: [epoch 050] average training loss: 3970.0651
cellbender:remove-background: [epoch 050] average test loss: 3958.7704
cellbender:remove-background: [epoch 051] average training loss: 3973.9497
cellbender:remove-background: [epoch 052] average training loss: 3970.4156
cellbender:remove-background: [epoch 053] average training loss: 3965.1261
cellbender:remove-background: [epoch 054] average training loss: 3975.3828
cellbender:remove-background: [epoch 055] average training loss: 3969.5423
cellbender:remove-background: [epoch 055] average test loss: 3932.6834
cellbender:remove-background: [epoch 056] average training loss: 3964.7342
cellbender:remove-background: [epoch 057] average training loss: 3967.4058
cellbender:remove-background: [epoch 058] average training loss: 3971.9959
cellbender:remove-background: [epoch 059] average training loss: 3960.5551
cellbender:remove-background: [epoch 060] average training loss: 3964.4331
cellbender:remove-background: [epoch 060] average test loss: 3967.9076
cellbender:remove-background: [epoch 061] average training loss: 3965.4153
cellbender:remove-background: [epoch 062] average training loss: 3962.5914
cellbender:remove-background: [epoch 063] average training loss: 3965.0319
cellbender:remove-background: [epoch 064] average training loss: 3965.6907
cellbender:remove-background: [epoch 065] average training loss: 3960.0795
cellbender:remove-background: [epoch 065] average test loss: 3945.7927
cellbender:remove-background: [epoch 066] average training loss: 3964.4541
cellbender:remove-background: [epoch 067] average training loss: 3968.9065
cellbender:remove-background: [epoch 068] average training loss: 3958.4191
cellbender:remove-background: [epoch 069] average training loss: 3963.3575
cellbender:remove-background: [epoch 070] average training loss: 3954.3709
cellbender:remove-background: [epoch 070] average test loss: 4007.2453
cellbender:remove-background: [epoch 071] average training loss: 3958.2268
cellbender:remove-background: [epoch 072] average training loss: 3961.9567
cellbender:remove-background: [epoch 073] average training loss: 3968.9788
cellbender:remove-background: [epoch 074] average training loss: 3962.2250
cellbender:remove-background: [epoch 075] average training loss: 3967.0552
cellbender:remove-background: [epoch 075] average test loss: 3997.2249
cellbender:remove-background: [epoch 076] average training loss: 3955.0682
cellbender:remove-background: [epoch 077] average training loss: 3960.1321
cellbender:remove-background: [epoch 078] average training loss: 3966.0317
cellbender:remove-background: [epoch 079] average training loss: 3953.0031
cellbender:remove-background: [epoch 080] average training loss: 3957.0243
cellbender:remove-background: [epoch 080] average test loss: 4002.7144
cellbender:remove-background: [epoch 081] average training loss: 3963.4742
cellbender:remove-background: [epoch 082] average training loss: 3964.5696
cellbender:remove-background: [epoch 083] average training loss: 3967.0997
cellbender:remove-background: [epoch 084] average training loss: 3967.0555
cellbender:remove-background: [epoch 085] average training loss: 3969.6566
cellbender:remove-background: [epoch 085] average test loss: 4005.2764
cellbender:remove-background: [epoch 086] average training loss: 3979.3970
cellbender:remove-background: [epoch 087] average training loss: 3971.2706
cellbender:remove-background: [epoch 088] average training loss: 3979.9692
cellbender:remove-background: [epoch 089] average training loss: 3991.1880
cellbender:remove-background: [epoch 090] average training loss: 3984.4977
cellbender:remove-background: [epoch 090] average test loss: 4012.9531
cellbender:remove-background: [epoch 091] average training loss: 3979.9608
cellbender:remove-background: [epoch 092] average training loss: 3980.1110
cellbender:remove-background: [epoch 093] average training loss: 3990.6269
cellbender:remove-background: [epoch 094] average training loss: 3987.8105
cellbender:remove-background: [epoch 095] average training loss: 4003.3267
cellbender:remove-background: [epoch 095] average test loss: 4026.3201
cellbender:remove-background: [epoch 096] average training loss: 4011.9168
cellbender:remove-background: [epoch 097] average training loss: 4001.7220
cellbender:remove-background: [epoch 098] average training loss: 4002.4815
cellbender:remove-background: [epoch 099] average training loss: 4014.8439
cellbender:remove-background: [epoch 100] average training loss: 4009.8107
cellbender:remove-background: [epoch 100] average test loss: 4034.6981
cellbender:remove-background: [epoch 101] average training loss: 4001.8132
cellbender:remove-background: [epoch 102] average training loss: 4000.4273
cellbender:remove-background: [epoch 103] average training loss: 4000.4040
cellbender:remove-background: [epoch 104] average training loss: 3996.6345
cellbender:remove-background: [epoch 105] average training loss: 4007.3502
cellbender:remove-background: [epoch 105] average test loss: 4046.1299
cellbender:remove-background: [epoch 106] average training loss: 3994.2900
cellbender:remove-background: [epoch 107] average training loss: 4018.2631
cellbender:remove-background: [epoch 108] average training loss: 3995.7133
cellbender:remove-background: [epoch 109] average training loss: 3984.8872
cellbender:remove-background: [epoch 110] average training loss: 4008.2703
cellbender:remove-background: [epoch 110] average test loss: 4043.1757
cellbender:remove-background: [epoch 111] average training loss: 4017.7784
cellbender:remove-background: [epoch 112] average training loss: 4017.0501
cellbender:remove-background: [epoch 113] average training loss: 4021.3158
cellbender:remove-background: [epoch 114] average training loss: 3994.4110
cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
File "/home/neurofisiologia/CellBender/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
torch.save(model_obj, filebase + '_model.torch')
File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 628, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 840, in _save
pickler.dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object

cellbender:remove-background: [epoch 115] average training loss: 4016.8244
cellbender:remove-background: [epoch 115] average test loss: 4036.3020
cellbender:remove-background: [epoch 116] average training loss: 4017.2557
cellbender:remove-background: [epoch 117] average training loss: 3996.7196
cellbender:remove-background: [epoch 118] average training loss: 4004.9664
cellbender:remove-background: [epoch 119] average training loss: 4022.4710
cellbender:remove-background: [epoch 120] average training loss: 4019.5331
cellbender:remove-background: [epoch 120] average test loss: 4067.2432
cellbender:remove-background: [epoch 121] average training loss: 4008.7457
cellbender:remove-background: [epoch 122] average training loss: 4001.0307
cellbender:remove-background: [epoch 123] average training loss: 3998.2867
cellbender:remove-background: [epoch 124] average training loss: 4001.8232
cellbender:remove-background: [epoch 125] average training loss: 4055.3543
cellbender:remove-background: [epoch 125] average test loss: 4058.0449
cellbender:remove-background: [epoch 126] average training loss: 4003.1687
cellbender:remove-background: [epoch 127] average training loss: 4017.3536
cellbender:remove-background: [epoch 128] average training loss: 4019.2687
cellbender:remove-background: [epoch 129] average training loss: 4028.9802
cellbender:remove-background: [epoch 130] average training loss: 4018.2229
cellbender:remove-background: [epoch 130] average test loss: 4026.8101
cellbender:remove-background: [epoch 131] average training loss: 4018.8546
cellbender:remove-background: [epoch 132] average training loss: 4002.1382
cellbender:remove-background: [epoch 133] average training loss: 4011.3291
cellbender:remove-background: [epoch 134] average training loss: 4009.5174
cellbender:remove-background: [epoch 135] average training loss: 3999.1352
cellbender:remove-background: [epoch 135] average test loss: 4015.5564
cellbender:remove-background: [epoch 136] average training loss: 3996.2076
cellbender:remove-background: [epoch 137] average training loss: 3995.8721
cellbender:remove-background: [epoch 138] average training loss: 4017.0538
cellbender:remove-background: [epoch 139] average training loss: 4017.7493
cellbender:remove-background: [epoch 140] average training loss: 3998.2958
cellbender:remove-background: [epoch 140] average test loss: 4049.0232
cellbender:remove-background: [epoch 141] average training loss: 3991.3952
cellbender:remove-background: [epoch 142] average training loss: 4022.6591
cellbender:remove-background: [epoch 143] average training loss: 3992.5597
cellbender:remove-background: [epoch 144] average training loss: 4008.8651
cellbender:remove-background: [epoch 145] average training loss: 3992.5097
cellbender:remove-background: [epoch 145] average test loss: 4121.4365
cellbender:remove-background: [epoch 146] average training loss: 4005.6093
cellbender:remove-background: [epoch 147] average training loss: 4021.3828
cellbender:remove-background: [epoch 148] average training loss: 3995.0772
cellbender:remove-background: [epoch 149] average training loss: 3985.9057
cellbender:remove-background: [epoch 150] average training loss: 4004.1677
cellbender:remove-background: [epoch 150] average test loss: 4030.2060
cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
File "/home/neurofisiologia/CellBender/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
torch.save(model_obj, filebase + '_model.torch')
File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 628, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 840, in _save
pickler.dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object

cellbender:remove-background: 2024-07-03 10:01:02
cellbender:remove-background: Inference procedure complete.

@Sepidehsheybani
Copy link

same problem, not able to save check points:
Traceback (most recent call last):
File "/home2/s225139/.conda/envs/CellBender/lib/python3.8/site-packages/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
torch.save(model_obj, filebase + '_model.torch')
File "/home2/s225139/.conda/envs/CellBender/lib/python3.8/site-packages/torch/serialization.py", line 628, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
File "/home2/s225139/.conda/envs/CellBender/lib/python3.8/site-packages/torch/serialization.py", line 840, in _save
pickler.dump(obj)
TypeError: cannot pickle 'weakref' object

@lesolano
Copy link

No solution, but I am encountering the same error. I have tested on v0.3.2, v0.3.0 and v0.2.2. Version 0.2.2 produces expected outputs, while the more recent versions produce the errors seen above.

@abbey-green
Copy link

I am also experiencing the same issue

@aimutishammy
Copy link

Same error.

@abbey-green
Copy link

I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1

@Sepidehsheybani
Copy link

I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1

Thank you, I will try it.

@aimutishammy
Copy link

I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1

This combination works for me. Thanks!

@antsmer
Copy link

antsmer commented Aug 26, 2024

I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1

Does anyone who got it working mind sharing what scipy version they are using? After using these three at the versions listed I get an error 'ValueError: row index exceeds matrix dimensions' which I'm hoping with be a quick fix after I switch to the correct scipy version. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants