-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP 500 Internal server error when cancelling a job, new jobs stuck in waiting on Docker #406
Comments
I'm running into pretty much the exact same issue with my kubernetes deployed version, but with an Elasticsearch bucket. I initially thought it was a DB thing as well, but no longer think so. I realized there was an error in my bucket configuration, but updating it to be correct didn't change the error at all. I also don't think failing when cancelling a job should have anything to do with the db. Have you had any luck figuring it out? |
@regel any suggestions for next troubleshooting steps? To add to this, not sure if it's related, but when I shell into the pod in kubernetes, I don't have a user assigned As I mentioned in my previous comment, changing the database info doesn't seem to have had any effect. Is there anyway I can test this to rule it out? I validated credentials are correct, but not sure how to get confirmation loudml is able to access. Adding my replication to this ticket getting same result as OP: using the 1.6.0 docker image, I am unable to start or cancel processes without error Below are all the creation steps for each resource and the output. Stack trace from the logs is below as well. The job will stay in waiting until I try to cancel it. When I try to cancel it, it errors and then stays in canceling until I restart. Model Creation:
Bucket Creation:
Train Model
Check job to see status
stack trace
Attempt to cancel job
No stack trace is generated when the job fails to cancel. |
Job is initially waiting:
Then I try to cancel it:
When I check, it's actually trying to cancel it but it's stuck at cancelling for ever:
For reference:
I suspect it may have something to do with the database?
How do I check if it's actually connected and writing to DB?
Externally, I've already confirmed I can access myinfluxdb, I can write data, etc.
I also see this weird Python exception in the logs when I try to run the training:
Any pointers would be appreciated. I also tried nightly, 1.5.0, they all show the same error.
The text was updated successfully, but these errors were encountered: