Separate logging and queueing? #34

ShuyangCao · 2022-12-20T17:54:39Z

Thanks again for your work. Your tool helps me push out a lot of great work. Feel free to check out my website.

Recently, our workstation has unstable connection with the GPUs (might be an issue with the driver). Basically, nvidia-smi would return

Unable to determine the device handle for GPU 0000:4C:00.0: Unknown Error

When this issue occurs, the ts session will break down and restart. While the access to the previous session is lost, the jobs launched by the previous ts session are still running and we can no longer track their logging outputs with ts -t.

I guess it might be better to separate logging and queueing, so that the logging module does not depend on the GPU status and can still work when GPU error occurs.

Thanks!

The text was updated successfully, but these errors were encountered:

justanhduc · 2022-12-21T05:57:34Z

Hey @ShuyangCao. Nicely done! Keep up your good works using ts!

In fact, logging is handled by the client, not the server, so you can still see your progress via offline log files in /tmp. -t/-c simply read from these files. You can sort the files to find the wanted log. When the server crashes, all information about the jobs is gone, so it's impossible to use -t/-c [jobid] anymore.

Above all, the crash should not happen at all. This is probably due to the poor error handling of GPU query. I pushed a simple fix for this in the branch gpu-err. Please check and see if the error can be handled or not.

Btw, there's a log file in /tmp which has the format socket-ts.<uid>.error. Could you please let me know the error message that the server gave when the query was unsuccessful?

ShuyangCao · 2022-12-22T05:11:35Z

Thanks! Yes, I can still check the files in /tmp, but I am not sure which job each file corresponds to.

The error messages are:

-------------------Error
 Msg: Error calling recv_msg in c_check_version
 errno 104, "Connection reset by peer"
date Tue Dec 20 12:38:25 2022
pid 82645
type CLIENT
-------------------Error
 Msg: Failed to get GPU handle for GPU 3: GPU is lost
 errno 2, "No such file or directory"
date Wed Dec 21 07:04:18 2022
pid 4351
type SERVER

justanhduc · 2022-12-22T06:53:48Z

Thanks for the reply @ShuyangCao. The first error probably is caused by a message from an orphan client sent to a restarted server. This is harmless in most cases. The second error will return a NULL pointer, which causes the server crash as you experienced. The patch I pushed should be able to handle this error.

Please let me know if there's still any problem.

justanhduc added a commit that referenced this issue Dec 21, 2022

handled error querying gpu (fix #34)

06a1f3f

justanhduc closed this as completed in d8d1c45 Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate logging and queueing? #34

Separate logging and queueing? #34

ShuyangCao commented Dec 20, 2022

justanhduc commented Dec 21, 2022

ShuyangCao commented Dec 22, 2022

justanhduc commented Dec 22, 2022

Separate logging and queueing? #34

Separate logging and queueing? #34

Comments

ShuyangCao commented Dec 20, 2022

justanhduc commented Dec 21, 2022

ShuyangCao commented Dec 22, 2022

justanhduc commented Dec 22, 2022