Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate logging and queueing? #34

Closed
ShuyangCao opened this issue Dec 20, 2022 · 3 comments
Closed

Separate logging and queueing? #34

ShuyangCao opened this issue Dec 20, 2022 · 3 comments

Comments

@ShuyangCao
Copy link

Thanks again for your work. Your tool helps me push out a lot of great work. Feel free to check out my website.

Recently, our workstation has unstable connection with the GPUs (might be an issue with the driver). Basically, nvidia-smi would return

Unable to determine the device handle for GPU 0000:4C:00.0: Unknown Error

When this issue occurs, the ts session will break down and restart. While the access to the previous session is lost, the jobs launched by the previous ts session are still running and we can no longer track their logging outputs with ts -t.

I guess it might be better to separate logging and queueing, so that the logging module does not depend on the GPU status and can still work when GPU error occurs.

Thanks!

@justanhduc
Copy link
Owner

Hey @ShuyangCao. Nicely done! Keep up your good works using ts!

In fact, logging is handled by the client, not the server, so you can still see your progress via offline log files in /tmp. -t/-c simply read from these files. You can sort the files to find the wanted log. When the server crashes, all information about the jobs is gone, so it's impossible to use -t/-c [jobid] anymore.

Above all, the crash should not happen at all. This is probably due to the poor error handling of GPU query. I pushed a simple fix for this in the branch gpu-err. Please check and see if the error can be handled or not.

Btw, there's a log file in /tmp which has the format socket-ts.<uid>.error. Could you please let me know the error message that the server gave when the query was unsuccessful?

justanhduc added a commit that referenced this issue Dec 21, 2022
@ShuyangCao
Copy link
Author

Thanks! Yes, I can still check the files in /tmp, but I am not sure which job each file corresponds to.

The error messages are:

-------------------Error
 Msg: Error calling recv_msg in c_check_version
 errno 104, "Connection reset by peer"
date Tue Dec 20 12:38:25 2022
pid 82645
type CLIENT
-------------------Error
 Msg: Failed to get GPU handle for GPU 3: GPU is lost
 errno 2, "No such file or directory"
date Wed Dec 21 07:04:18 2022
pid 4351
type SERVER

@justanhduc
Copy link
Owner

Thanks for the reply @ShuyangCao. The first error probably is caused by a message from an orphan client sent to a restarted server. This is harmless in most cases. The second error will return a NULL pointer, which causes the server crash as you experienced. The patch I pushed should be able to handle this error.

Please let me know if there's still any problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants