Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalars, Plots, and Debug Samples hang and stop logging with large git diff and Tensorboard Images #1312

Open
Inquisitive-ME opened this issue Aug 12, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@Inquisitive-ME
Copy link

Describe the bug

If a repository contains a large git diff and logs tensorboard images ClearML logging will hang and never recover

To reproduce

Clone this repo https://github.com/Inquisitive-ME/clearml_hang_example and run clearml_example.py
You will see the console logging working, while sometimes the first epoch of scalars will show up but after that no Scalars, Plots, or Debug Samples will be logged

Expected behaviour

I expect ClearML to correctly log and not get stuck

Environment

  • Server type self hosted and app.clear.ml
  • All
  • ClearML Server Version: All
  • Python Version: 3.10
  • Linux \ Macos (haven't tested Windows)

Related Discussion

If this continues a slack thread, please provide a link to the original slack thread.

@Inquisitive-ME Inquisitive-ME added the bug Something isn't working label Aug 12, 2024
@ainoam
Copy link
Collaborator

ainoam commented Aug 13, 2024

Thanks for reporting @Inquisitive-ME - We'll take a look.

@eugen-ajechiloae-clearml
Copy link
Collaborator

Hi @Inquisitive-ME! Are you able to see an auxiliary_git_diff artifact reported to the task? This artifacts stores the git diff, so the upload might take a while depending on your network. If you don't see it, it is likely that the file is still uploading and some reports are waiting for it to finish, and that's why it appears to be hanging.
I recommend adding the large files to .gitignore or setting sdk.development.store_uncommited_code_diff: false in clearml.conf, if you don't need the git diff.

@Inquisitive-ME
Copy link
Author

Inquisitive-ME commented Aug 30, 2024

I do see the file and it appears to be uploaded. However, there are still no debug samples. I have had experiments running for days. It NEVER works. This is a bug not an issue of needing to wait longer.

sdk.development.store_uncommited_code_diff: false Does fix the problem but you have a bug in your software that is very annoying to deal with

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants