-
-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add thread support to debugger #64364
Conversation
requires #63891 |
requires #64369 Not strictly required any more, just makes the UI a little less intuitive. Workaround in place: |
ed8b0de
to
b1108d7
Compare
status updates: good progress on C#, still blocked on Extensions, waiting for input |
8105218
to
6bb9e98
Compare
update: ported to C# dotnet6 and rebased. Lost ability to debug unhandled exceptions, since I have not found where unhandled exceptions go in the dotnet6 code. |
b5abeca
to
faa61e4
Compare
updates:
|
ef7a508
to
bb811c8
Compare
fc79848
to
3b27c32
Compare
Not sure how to fix that, but |
// Could be reading any channel, so don't do it if any of them are blocked. | ||
// REVISIT: this isn't completely correct. If Main is blocked on thread.join, it will never respond to any | ||
// requests and we need to give up, so we do need per-channel flow control, i.e. multiple sockets or our | ||
// own flow control (e.g. enet) or just conceptually tear down the Main connection and discard all Main | ||
// messages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why we can't:
- Always poll events from main.
- Only poll events from threads while they are debug-broken (or always poll but never pop messages unless debugging).
- When debug broken:
- If inside a thread, drop all messages that are not "core:" (which is the debugging part, or a list of "thread-safe" prefix)
- If inside main, process all messages regularly.
else: - Process all messages from main (if main is blocked, and no thread is debugging, messages will simply queue up until main returns available or a thread reaches a breakpoint/error).
We could also then (in a separate PR) further tap into Thread/Mutex/Semaphore to poll_events from main at a regular basis in debug builds (like we do for infinite loops in scripts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you service a queue from more than one thread, messages can be processed out of order. I felt that would be a weaker mechanism than what I built, so I would only choose that if it buys me something significant in return. I don't fully understand what the benefit would be of doing it the way that you suggest. I think it would generally work though, yes.
Right now, the broken thread (focused thread) does the debugging related core messages in order, and main does the other stuff when it can. It isn't perfect for sure. For example, when you try to use the remote scene debugger, usually main isn't available until you step execute, which pretty much sucks. If we wanted to make that nicer though, we would have to do some sort of explicit handshaking with main where the debug thread wakes up main just to service captures but main isn't otherwise allowed to run the main loop so the game doesn't progress. That would be an evolution I could see making sense, but it is a bit harder to get right. [edit: Wrong, better idea below]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, scratch that last part... Main could totally spin on the thread barrier and still process debug messages while another thread is broken. Threads other than main would go block when "held" as they do now but main would be "held" by spinning in a loop that processes the main queue. Then the only time the UI stops updating the remote scene debugger would be if main is REALLY blocked (OS semaphore, mutex, etc.)
That would be a great enhancement, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and tested the behavior of Main when it is held in GDScript code, and yes it was a bug fix (not an enhancement.) It just blocked on the "held threads" barrier and didn't process things, so remote inspection generally didn't work when debugging non-main threads (unless Main was idle.)
I implemented the change outlined above to let Main spin and process messages but nothing else, as soon as it executes in the script engine(s). In a "normal" program, it will immediately enter this state when another thread is being step executed. That fixes being able to remote inspect nodes while debugging other threads.
@@ -192,11 +257,14 @@ Error RemoteDebuggerPeerTCP::connect_to_host(const String &p_host, uint16_t p_po | |||
|
|||
void RemoteDebuggerPeerTCP::_thread_func(void *p_ud) { | |||
// Update in time for 144hz monitors | |||
// FIXME that precision is nonsense, we don't have that kind of resolution from sleeping (ms on Windows, scheduling intervals, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if outdated but 6.9 ms is the expected frame rate for a 144Hz monitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this use case, I doubt it's really a problem if the actual sleep duration ends up being 6ms, 7ms, 10ms or anything in between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe we should make this a define somewhere (e.g. THREAD_POLL_DELAY
), we use this magic number in more than one place IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry my comment is garbage :). It is technically entirely correct but not at all clear. Let me try again:
REVISIT: Whoever designed this needs to look at this again. The code makes a claim that sleeping for this delay would result in waking often enough for 144Hz updates. However, this is very unlikely to be correct. For example, a Windows default installation has a timer resolution of around 64Hz. Also, there is the fact that a normal user mode thread in a non-real time OS may not even be able to get scheduled with close to that accuracy. Therefore, it is very likely that whatever problem is supposed to be addressed by this delay value is actually not being handled. Specifying it at this precision is very misleading when the underlying primitive is actually a user mode sleep at ms resolution.
void OS_Windows::delay_usec(uint32_t p_usec) const {
if (p_usec < 1000) {
Sleep(1);
} else {
Sleep(p_usec / 1000);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what problem is being solved, so I left this comment but never got back to it. I should have written a better comment. I don't know if anyone ever looks at REVISIT so maybe it is a FIXME if you folks agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this use case, I doubt it's really a problem if the actual sleep duration ends up being 6ms, 7ms, 10ms or anything in between.
To be clear: if I am correct and the timer resolution is equivalent to 64Hz, then sleeping at all (any longer than just yielding the CPU) would result in maximum update rate of 64Hz, which clearly does not service 144 fps as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gdscript works visual script was working, will be ported to extension extension script needs work on extension API C# works but needs more usage by dotnet6 port
2b55d9f
to
217f931
Compare
On my end the general approach seems ok, I think @Faless needs to OK to it given he is more familiar with the debugger code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I also explained in the contributors chat, I think adding channels to the RemoteDebuggerPeer
is an unnecessary complication, and it will actually introduce more issues than it solves.
First of all, it forces every RemoteDebuggerPeer
implementation (tcp, websocket, etc) to implement channels.
This is an incredible overhead when it could be done (assuming separate queues are needed) by wrapping around "poll" so it routes messages to the appropriate queue and without requiring the lower level implementation to know this much about the debugger protocol (i.e. holding the queue in the RemoteDebugger, not the RemoteDebuggerPeer).
More importantly though, there seem to be this idea that we want "separate concurrent network streams" active in the engine debugger (in relation to godotengine/godot-proposals#4952 and SCTP, AMQP, or even using multiple TCP sockets).
This is not the case, there is one debugging session, and it is really important that messages are received in the order they are sent, introducing conccurent streams just makes things more confusing and error prone without solving any real issue.
There is no real reason to do flow control at the "wire level", because the "wire" is never stressed, the issues you are encountering are at the application level.
And this brings us to the topic of queueing messages that cannot be serviced by threads. Your main point about this was the need of dequeuing messages from threads themselves, because main could be held up by a join, mutex, etc.
But in that scenario main cannot service those queued messages anyway and they would simply pile up.
But at some point the queue will fill up and you'll have to discard messages anyway, and when it unlocks, the editor will also be swamped by replies to piled up messages.
So, in the end, discarding messages when main is locked that way, is simply inevitable.
To wrap this up:
- If we decide we really need separate queues, we should hold them, and route messages to the them in the
RemoteDebugger
, not theRemoteDebuggerPeer
(but again, in the end, it will end up still causing problems in the scenario they are supposed to solve). - Otherwise, and this is a simpler solution, we should just have threads only service while debugging and focused, and discard messages that are not thread safe.
In any case, we will need to be able to mark captures as thread-safe via a bool, so each capture (including user defined ones) can decide if it can handle being called by a thread or not.
Captures marked as thread safe will receive messages while being serviced by threads, while messages for non-thread-safe captures will be dropped when being serviced by a thread (or moved to the appropriate queue if we go that way).
It specifically avoids having to know anything about the protocol by letting the upper level (RemoteDebugger) sort the messages in to the right "lanes." I don't think there are any "etc" so it is just TCP and websockets. We already have an IP based implementation (enet) in the product if you don't like to use just plain TCP connections. I implemented RemoteDebuggerPeerWebSocket by just ignoring the channels and running the protocol as before. That's because I didn't know exactly what the thread model would be for web going forward, so I just made it work just like before (stil implementing the new API of course.)
The concurrent streams are the control of the thread being debugged vs the main thread being used to interrogate various other things like remote scene debugger / inspector. Within each of those, the order is very important, which is why I argued against your proposal to use both the worker thread and the main thread to service the same queue depending on the situation. Channels are the way you keep things in order within a class of messages while not blocking other classes of messages. Yeah, it's just independent streams really that happen to share a connection (in some protocols).
Absolutely not. That is what flow control is. The receiving side of the connection stops reading because there are no resources available to service requests (threads, concurrency limits, buffers, whatever the "server" is) which in this implementation is the max size of queued requests. When it stops reading, the TCP protocol communicates this fact to the sending side and causes the sending socket to become non-writable (when buffers are full.) That's how the sending side can use poll (or select in the old days) to know when it is supposed to resume sending more stuff. That's how every TCP server works. We just can't do that here without the per-channel flow control because we still want to let the core messages flow while blocking more main messages. Yes, we would do that at application level by sending a message back using the transport thread. I just would rather not have that sort of code when TCP already does it.
Your whole discussion sounds like "I would have done it differently" and that's ok. What are the actual defects that you are reporting? I said I would stick around long enough to correct any defects, but I can't justify spending time/money to implement what I would consider a worse solution for you? If you have any actual code review notes like "hey there is a race condition here" or "this code is wrong", then great. But I really don't know what you are trying to achieve with this thread. Do you just want this to not get merged because you hate it? That's fine, then just have it merged to a branch (like dotnet6) was and then I can go. If you do want it merged, then let's please focus on actual bugs? I think it works really well as written. You will probably hate the style in places, and you will probably say that we don't need the configurable tree view at this time. That happened because I was going to implement the PRs for thread naming that are there and respond to a request to make these dialogs dockable/floatable. At that point, the same dialog will need to accommodate a lot of columns or a few columns, so I just went ahead and built that. If you hate that too, then just rip it out. |
There are, the RemoteDebuggerPeer exists is on purpose so other methods can be implemented, named pipe, reverse TCP, whatever. |
Area maintainers are entitled to envision a different implementation than the one suggested by a contributor because they have a good understanding of how this fits the engine design overall. You may not have enough background with the project to get a feel for that. So far this has been your spec that you yourself have implemented as you see fit. I think it's unreasonable to expect it to be merged as is without maintainer's considerations being taken into account, even if you take a responsibility to stick around. If you are insistent on your solution being tangibly better, then my suggestion here would be to wait for more feedback from other contributors and maintainers and see what they think. So we have a better representation of opinions than just two opposite ones. |
And when that happens, if the producer on the other side (editor) keep producing messages (because we have no control over message generation), it will start filling the outgoing queue. |
I addressed that in a previous post. When the main thread channel back pressures the client, the connection could fire a signal. This would be used in the UI to set the user interface for clicking on remote stuff to disabled, so that the user would know they aren't going to be able to look at remote things right now, rather than jamming on the buttons and nothing happens. |
I appreciate the time you spent to clarify that. I asked Akien if we should try to merge this so at least there would be a threaded debugger (after 7 years of people complaining about it) that works. Even though it is super close to Beta and even though I need to stop working on Godot contributions. I am just trying to maybe have this work benefit some users, rather than it just dying here. I know that's not the best possible flow and it isn't the Godot way. I tried to get design review in the proposal and status updates and interim commits I have been providing, but it was just really bad timing with the main guy being on vacation at the time and also us apparently having very different backgrounds and never communicating effectively about technical things. Also, I was hard to work with because I was very unhappy with the way things work here, which is why I need to try something else. So yeah, it isn't as intended. I don't use Godot and I don't work for anyone who uses Godot. I have absolutely no reason to try to push this into the product except that I feel it would be a shame and it cost a lot to learn all the different affected areas to be able to implement it. I suspect that is why it sat there for 7 years and so I thought it would be silly to waste it. I am probably over-estimating the value of my contribution because it was moderately challenging for an external :) |
Update to save people some time: We can stop debating this here. I had a good chat with @akien-mga and the pressure is off anyway schedule wise, so I am much more relaxed now. :) Please merge this to a branch in your repo, so the next developer can use it as a proof of concept or reference or whatever. I won't be keeping it in mine past October 1, 2022. |
Why? You said there is no pressure and then you introduce time pressure? 😕 |
I meant there is no more pressure to try to get this approved. The only way it would have made Beta is if it had been approved with minor fixes, and that obviously did not happen. So now I am just saying please archive this in a branch so I can close my PRs and get rid of my fork. My reasons for doing that are private. I'm trying to be considerate by letting you know instead of just disappearing. |
Superseded by #76582. Thanks for the contribution nonetheless! |
@akien-mga for posterity/search results: The simplified implementation provided by reduz doesn't address the requirements implemented here. There may be a time in the future where additional stuff from here could be leveraged / reimplemented. This clarifies "superseded." :) |
Implementation of godotengine/godot-proposals#4952
That proposal contains a write up of the design that is implemented here. While long, it might be worthwhile reading for reviewers looking at the threading of this code.
Status:
GDScript
visual script
Extension Script
Extension Script Notes
To complete the implementation for ScriptLanguageExtension, at least the following changes need to be done to
godot-cpp
:void *
argument passing, presumably just this?GDNativeExtensionScriptInstanceInfo
so people don't have to use raw native interfacePackedByteArray
as an argument type (code missing, that's where I ended up blocking)C#
including new setting 'Debug' for unhandled exception policyDebugger UI
DAP
Local Debugger