Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang after timeout in timeoutWait #24

Closed
ebfe opened this issue Nov 15, 2014 · 4 comments
Closed

Hang after timeout in timeoutWait #24

ebfe opened this issue Nov 15, 2014 · 4 comments

Comments

@ebfe
Copy link
Contributor

ebfe commented Nov 15, 2014

After the timeout condition in proctl.timeoutWait is triggered, one of the next
next/step/continue commands will hang with dlv blocked in a wait syscall. I
believe the problem is that in the timeout case timeoutWait leaves a goroutine
running that is still executing a wait syscall. Now we have 2 goroutines which
race for the next SIGTRAP and if the abandoned timeoutWait goroutine
wins dlv hangs.

repo:

$ cat sleep.go
package main

import (
    "syscall"
)

func main() {
    ts := syscall.Timespec{ Sec: 3 }
    syscall.Nanosleep(&ts,  nil)
}


$ (echo -e "break syscall.Nanosleep\ncontinue"; yes step) | dlv -run
@ebfe
Copy link
Contributor Author

ebfe commented Nov 15, 2014

Apparently there is a second failure mode where it doesn't hang completly, but every step command takes ~2sec (= timeoutWait timeout).

@derekparker
Copy link
Member

Yeah, that timeoutWait is a bit of a crutch to handle a thread that can go from procfs reporting a trace stop, then a single step is requested, and that thread resumes a scheduler induced sleep - causing Delve to remain blocked in a wait for that thread.

I'm thinking through some ideas on how to better handle this situation, and that will fix the second failure mode, and should eliminate the race that you're describing. There is a small race in timeoutWait if an actual pid is provided where a SIGSTOP and SIGTRAP are racing and the wait executed by the goroutines gets the SIGTRAP and ignores it.

Also, there is a bug there where if the pid is -1 (to wait on all threads) and the timeout condition is met, a goroutine will be left indefinitely in a blocking wait. That is definitely an issue, but currently no callers of timeoutWait are passing -1 as the pid.

derekparker added a commit that referenced this issue Nov 25, 2014
Remove any assumption that a wait syscall on a thread id after a
continue will return. Any time we continue a thread, wait for activity
from any thread, because the scheduler may well have switched contexts
on us due to syscall entrace, channel op, etc...

There are several more things to be done here including:

* Potential tracking of goroutine id as we jump around to thread
  contexts.
* Potential of selectively choosing threads to operate on based on the
  internal M data structures, ensuring that our M has an active G.

This commit partially fixes #23 and #24, however there are still some
random hangs that happen and need to be ironed out.
@derekparker
Copy link
Member

Can you test and confirm this is no longer an issue. Just pushed some fixes up to master that should resolve this issue, and possibly #23 as well. If these issues seem to be fixed please close out the issue.

@ebfe
Copy link
Contributor Author

ebfe commented Nov 28, 2014

Seems fixed as of afa3a9c

@ebfe ebfe closed this as completed Nov 28, 2014
nclifton pushed a commit to nclifton/delve that referenced this issue Feb 24, 2021
* [#175866942] Setup and optus submit worker

* [175866942] Send mms to optus, now with EOF

* [175866942] Get mms provider from account record

* [#175866942] Fixed up the optus_submit tests with correct variable names

* [#175866942] Fixed dockerfile

* Removed some debug logs
abner-chenc pushed a commit to loongson/delve that referenced this issue Mar 1, 2024
Remove any assumption that a wait syscall on a thread id after a
continue will return. Any time we continue a thread, wait for activity
from any thread, because the scheduler may well have switched contexts
on us due to syscall entrace, channel op, etc...

There are several more things to be done here including:

* Potential tracking of goroutine id as we jump around to thread
  contexts.
* Potential of selectively choosing threads to operate on based on the
  internal M data structures, ensuring that our M has an active G.

This commit partially fixes go-delve#23 and go-delve#24, however there are still some
random hangs that happen and need to be ironed out.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants