You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Providers mostly don't report any errors they encounter back up to the core system. This means that if something fails, it's possible for it to just cause an unlimited "hang". In my case the hang causes the trigger to have been marked as started, when it hasn't actually been started. In the docker provider for example, errors are caught and simply ignored (example).
I'm not familiar with the codebase so please let me know if there any mistakes. I spent some time following around the providers, and found a few examples like the following where, a effectively a request to deploy the trigger is just sent but never followed up upon.
I'm guessing what needs to happen is that the provider needs some way to return an error code, which core can then "bubble up" by changing the deployment status. I didn't want to attempt to make changes without creating an issue first, as I'm missing a lot of context. I'm also not sure how prs such as #1470 interact with this issue for example.
Example Reproduction
I originally reported this in #1476, but moved it here as I realised my issue was a symptom of a wider problem that I described above. I encountered this while I was setting up authentication for my self hosted docker registry. Trigger would try to deploy a task, and the docker provider would fail to run it, because it couldn't download the image. This would cause trigger to hang exponentially, as it was unaware that the docker provider failed to run the task.
During my testing last night I added a scheduled task that runs every 20 minutes. I then forgot about it, and was messing around with some other things in trigger. After some sleep, I came back to it and noticed that there was a long list of "running" scheduled tasks. Upon further investigation, before going to sleep I had made an incomplete deployment which resulted in a missing docker image from the registry. This lead to the same place where, trigger tries to deploy the image, it fails to, but trigger has no idea. This lead to the long list of running tasks, the longest of which was hanging for ~14 hours.
Screenshot of the runs list
For reference, when the task is working correctly it takes ~2 seconds from start to finish.
The text was updated successfully, but these errors were encountered:
Thanks for digging into this and creating two very thorough issues!
The self-hosting story isn't great at the moment, this being part of the problem. We're going to make some big changes in the next couple of months. One of those changes means providers will go away completely - there will only be a single runtime-agnostic image to run per deployment.
There's currently no easy way to fail a task from the provider. Also, any changes here would also touch parts that affect our cloud deployment.
I think for those reasons, it makes more sense not to touch this at all currently. This will be fixed by the new self-hosted setup.
However, what you could do in the meantime to prevent this and other issues is to set a max duration on your tasks. This would at least prevent tasks running forever and "alert" to investigate any underlying causes, including provider errors. Not ideal, but I think it's the best we can do for now.
The Problem
Providers mostly don't report any errors they encounter back up to the core system. This means that if something fails, it's possible for it to just cause an unlimited "hang". In my case the hang causes the trigger to have been marked as
started
, when it hasn't actually been started. In the docker provider for example, errors are caught and simply ignored (example).I'm not familiar with the codebase so please let me know if there any mistakes. I spent some time following around the providers, and found a few examples like the following where, a effectively a request to deploy the trigger is just sent but never followed up upon.
https://github.com/triggerdotdev/trigger.dev/blob/main/apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts#L544-L560
I'm guessing what needs to happen is that the provider needs some way to return an error code, which core can then "bubble up" by changing the deployment status. I didn't want to attempt to make changes without creating an issue first, as I'm missing a lot of context. I'm also not sure how prs such as #1470 interact with this issue for example.
Example Reproduction
I originally reported this in #1476, but moved it here as I realised my issue was a symptom of a wider problem that I described above. I encountered this while I was setting up authentication for my self hosted docker registry. Trigger would try to deploy a task, and the docker provider would fail to run it, because it couldn't download the image. This would cause trigger to hang exponentially, as it was unaware that the docker provider failed to run the task.
During my testing last night I added a scheduled task that runs every 20 minutes. I then forgot about it, and was messing around with some other things in trigger. After some sleep, I came back to it and noticed that there was a long list of "running" scheduled tasks. Upon further investigation, before going to sleep I had made an incomplete deployment which resulted in a missing docker image from the registry. This lead to the same place where, trigger tries to deploy the image, it fails to, but trigger has no idea. This lead to the long list of running tasks, the longest of which was hanging for ~14 hours.
Screenshot of the runs list
For reference, when the task is working correctly it takes ~2 seconds from start to finish.
The text was updated successfully, but these errors were encountered: