Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task is already marked stopped when the callback from Task.register_abort_callback is called #1330

Open
mads-oestergaard opened this issue Sep 18, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@mads-oestergaard
Copy link

mads-oestergaard commented Sep 18, 2024

Describe the bug

It is not possible to modify the task (e.g. update and upload a checkpoint) in a callback registered with Task.register_abort_callback

Trying to save a checkpoint in the callback gives the following error:
2024-09-13 12:12:27,581 - clearml.model - WARNING - Could not update last created model in Task b281b21329e3470ebc8959e831f28ff8, Task status 'stopped' cannot be updated

To reproduce

Register a callback on the current task using something like this:

def on_abort_callback() -> None:
    print("Saving last checkpoint")
    trainer.save_checkpoint(
        self.last_filepath,
        weights_only=self.save_weights_only,
    )

    # Ensure that the trainer stops gracefully
    trainer.should_stop = True

print("Registering model checkpoint abort callback")
Task.current_task().register_abort_callback(on_abort_callback)

where trainer is a pytorch-lightning Trainer and the callback is registered in an extended lightning ModelCheckpoint (docs)

Expected behaviour

It should be possible to upload a model checkpoint to the ClearML server when a task is aborted in the abort callback function.

Current workaround is to mark the current task in_progress while saving checkpoint and then afterwards marking it stopped again. Not intuitive :-)

Environment

  • Server type: self hosted
  • ClearML SDK Version: 1.16.4
  • ClearML Server Version: 1.15.0
  • Python Version: 3.10.13
  • OS (Windows \ Linux \ Macos): linux

Related Discussion

https://clearml.slack.com/archives/CTK20V944/p1726571061754989

@mads-oestergaard mads-oestergaard added the bug Something isn't working label Sep 18, 2024
@mads-oestergaard mads-oestergaard changed the title Tasks are already marked stopped when calling the callback from Task.register_abort_callback Task is already marked stopped when calling the callback from Task.register_abort_callback Sep 18, 2024
@mads-oestergaard mads-oestergaard changed the title Task is already marked stopped when calling the callback from Task.register_abort_callback Task is already marked stopped thecallback from Task.register_abort_callback is called Sep 18, 2024
@mads-oestergaard mads-oestergaard changed the title Task is already marked stopped thecallback from Task.register_abort_callback is called Task is already marked stopped the callback from Task.register_abort_callback is called Sep 18, 2024
@mads-oestergaard mads-oestergaard changed the title Task is already marked stopped the callback from Task.register_abort_callback is called Task is already marked stopped when the callback from Task.register_abort_callback is called Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant