Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thread support to debugger #64364

Closed
wants to merge 1 commit into from

Conversation

derammo
Copy link
Contributor

@derammo derammo commented Aug 13, 2022

Implementation of godotengine/godot-proposals#4952

That proposal contains a write up of the design that is implemented here. While long, it might be worthwhile reading for reviewers looking at the threading of this code.

Status:

GDScript

  • working, demonstrable with prototype UI showing threads, and allowing exploring multiple threads

visual script

  • working, including simultaneously with GDScript on another thread
  • will be contributed to the visual script extension once the threaded debugger is in Godot

Extension Script

  • incompletely implemented
  • blocked by various issues in GDExtension

Extension Script Notes

To complete the implementation for ScriptLanguageExtension, at least the following changes need to be done to godot-cpp:

  • implement void * argument passing, presumably just this?
// Specialization for void*, which is used to pass back native objects
template <>
struct PtrToArg<void *> {
	_FORCE_INLINE_ static void *convert(const void *p_ptr) {
		return *(void *const *)p_ptr;
	}
	typedef Object *EncodeT;
	_FORCE_INLINE_ static void encode(void *p_var, void *p_ptr) {
		*(reinterpret_cast<void **>(p_ptr)) = p_var;
	}
};
  • would be nice to wrap GDNativeExtensionScriptInstanceInfo so people don't have to use raw native interface
  • merge godot-cpp 801 (avoiding reference here)
  • implement PackedByteArray as an argument type (code missing, that's where I ended up blocking)

C#

  • operable, including new setting 'Debug' for unhandled exception policy
  • unhandled exception debugging lost due to dotnet6 merge (needs work)
  • could use follow-up work on threading but good enough to merge
  • some code will appear as dead code (not being called) because it is waiting for a hook from the dotnet6 runtime

Debugger UI

  • implemented in C++ (based on the GDScript prototype)
  • uses a namespace because otherwise the class names would get very long at this point

DAP

  • broken, WIP
  • would be relatively easy since API already understands stacks are per-thread

Local Debugger

  • ported to new usage, but only debugs the main thread
  • I was advised that nobody cares about LocalDebugger, so will defer any work on this until asked

@derammo
Copy link
Contributor Author

derammo commented Aug 13, 2022

requires #63891

@derammo
Copy link
Contributor Author

derammo commented Aug 13, 2022

requires #64369

Not strictly required any more, just makes the UI a little less intuitive. Workaround in place:
derammo/godot_debugger_prototype@22f890e

@derammo
Copy link
Contributor Author

derammo commented Aug 20, 2022

status updates: good progress on C#, still blocked on Extensions, waiting for input

@derammo
Copy link
Contributor Author

derammo commented Aug 24, 2022

update: ported to C# dotnet6 and rebased. Lost ability to debug unhandled exceptions, since I have not found where unhandled exceptions go in the dotnet6 code.

@derammo derammo force-pushed the derammo_debugger branch 4 times, most recently from b5abeca to faa61e4 Compare September 2, 2022 17:03
@derammo
Copy link
Contributor Author

derammo commented Sep 2, 2022

updates:

  • ported UI to C++ and polished.
  • DAP inspected: needs work for threading
  • Local Debugger fixed to work for main thread

@derammo derammo force-pushed the derammo_debugger branch 2 times, most recently from ef7a508 to bb811c8 Compare September 4, 2022 10:34
@derammo derammo force-pushed the derammo_debugger branch 7 times, most recently from fc79848 to 3b27c32 Compare September 6, 2022 02:55
@derammo
Copy link
Contributor Author

derammo commented Sep 6, 2022

fixes #2446
fixes #18330

If approved, documentation would need to be updated to reflect it works (see #41590)

does NOT fix #42901, which was mistakenly identified as a threaded debugging issue, but is really a bug in the single threaded debugger in 3.x and should be scheduled to be looked at

@derammo
Copy link
Contributor Author

derammo commented Sep 6, 2022

Not sure how to fix that, but Windows and Buildsystem teams don't need to be assigned.

Comment on lines +161 to +165
// Could be reading any channel, so don't do it if any of them are blocked.
// REVISIT: this isn't completely correct. If Main is blocked on thread.join, it will never respond to any
// requests and we need to give up, so we do need per-channel flow control, i.e. multiple sockets or our
// own flow control (e.g. enet) or just conceptually tear down the Main connection and discard all Main
// messages.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why we can't:

  • Always poll events from main.
  • Only poll events from threads while they are debug-broken (or always poll but never pop messages unless debugging).
  • When debug broken:
  • If inside a thread, drop all messages that are not "core:" (which is the debugging part, or a list of "thread-safe" prefix)
  • If inside main, process all messages regularly.
    else:
  • Process all messages from main (if main is blocked, and no thread is debugging, messages will simply queue up until main returns available or a thread reaches a breakpoint/error).

We could also then (in a separate PR) further tap into Thread/Mutex/Semaphore to poll_events from main at a regular basis in debug builds (like we do for infinite loops in scripts).

Copy link
Contributor Author

@derammo derammo Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you service a queue from more than one thread, messages can be processed out of order. I felt that would be a weaker mechanism than what I built, so I would only choose that if it buys me something significant in return. I don't fully understand what the benefit would be of doing it the way that you suggest. I think it would generally work though, yes.

Right now, the broken thread (focused thread) does the debugging related core messages in order, and main does the other stuff when it can. It isn't perfect for sure. For example, when you try to use the remote scene debugger, usually main isn't available until you step execute, which pretty much sucks. If we wanted to make that nicer though, we would have to do some sort of explicit handshaking with main where the debug thread wakes up main just to service captures but main isn't otherwise allowed to run the main loop so the game doesn't progress. That would be an evolution I could see making sense, but it is a bit harder to get right. [edit: Wrong, better idea below]

Copy link
Contributor Author

@derammo derammo Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, scratch that last part... Main could totally spin on the thread barrier and still process debug messages while another thread is broken. Threads other than main would go block when "held" as they do now but main would be "held" by spinning in a loop that processes the main queue. Then the only time the UI stops updating the remote scene debugger would be if main is REALLY blocked (OS semaphore, mutex, etc.)

That would be a great enhancement, I think.

Copy link
Contributor Author

@derammo derammo Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and tested the behavior of Main when it is held in GDScript code, and yes it was a bug fix (not an enhancement.) It just blocked on the "held threads" barrier and didn't process things, so remote inspection generally didn't work when debugging non-main threads (unless Main was idle.)

I implemented the change outlined above to let Main spin and process messages but nothing else, as soon as it executes in the script engine(s). In a "normal" program, it will immediately enter this state when another thread is being step executed. That fixes being able to remote inspect nodes while debugging other threads.

https://github.com/godotengine/godot/compare/2b55d9f4b2ddc2b13d5977a09177f96390ab5908..217f9316f353f23c8517ef43a22374ce53469002

@@ -192,11 +257,14 @@ Error RemoteDebuggerPeerTCP::connect_to_host(const String &p_host, uint16_t p_po

void RemoteDebuggerPeerTCP::_thread_func(void *p_ud) {
// Update in time for 144hz monitors
// FIXME that precision is nonsense, we don't have that kind of resolution from sleeping (ms on Windows, scheduling intervals, etc.)
Copy link
Collaborator

@Faless Faless Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if outdated but 6.9 ms is the expected frame rate for a 144Hz monitor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this use case, I doubt it's really a problem if the actual sleep duration ends up being 6ms, 7ms, 10ms or anything in between.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe we should make this a define somewhere (e.g. THREAD_POLL_DELAY), we use this magic number in more than one place IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry my comment is garbage :). It is technically entirely correct but not at all clear. Let me try again:

REVISIT: Whoever designed this needs to look at this again. The code makes a claim that sleeping for this delay would result in waking often enough for 144Hz updates. However, this is very unlikely to be correct. For example, a Windows default installation has a timer resolution of around 64Hz. Also, there is the fact that a normal user mode thread in a non-real time OS may not even be able to get scheduled with close to that accuracy. Therefore, it is very likely that whatever problem is supposed to be addressed by this delay value is actually not being handled. Specifying it at this precision is very misleading when the underlying primitive is actually a user mode sleep at ms resolution.

void OS_Windows::delay_usec(uint32_t p_usec) const {
	if (p_usec < 1000) {
		Sleep(1);
	} else {
		Sleep(p_usec / 1000);
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what problem is being solved, so I left this comment but never got back to it. I should have written a better comment. I don't know if anyone ever looks at REVISIT so maybe it is a FIXME if you folks agree.

Copy link
Contributor Author

@derammo derammo Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this use case, I doubt it's really a problem if the actual sleep duration ends up being 6ms, 7ms, 10ms or anything in between.

To be clear: if I am correct and the timer resolution is equivalent to 64Hz, then sleeping at all (any longer than just yielding the CPU) would result in maximum update rate of 64Hz, which clearly does not service 144 fps as intended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gdscript works
visual script was working, will be ported to extension
extension script needs work on extension API
C# works but needs more usage by dotnet6 port
@reduz
Copy link
Member

reduz commented Sep 14, 2022

On my end the general approach seems ok, I think @Faless needs to OK to it given he is more familiar with the debugger code.

Copy link
Collaborator

@Faless Faless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I also explained in the contributors chat, I think adding channels to the RemoteDebuggerPeer is an unnecessary complication, and it will actually introduce more issues than it solves.

First of all, it forces every RemoteDebuggerPeer implementation (tcp, websocket, etc) to implement channels.
This is an incredible overhead when it could be done (assuming separate queues are needed) by wrapping around "poll" so it routes messages to the appropriate queue and without requiring the lower level implementation to know this much about the debugger protocol (i.e. holding the queue in the RemoteDebugger, not the RemoteDebuggerPeer).

More importantly though, there seem to be this idea that we want "separate concurrent network streams" active in the engine debugger (in relation to godotengine/godot-proposals#4952 and SCTP, AMQP, or even using multiple TCP sockets).
This is not the case, there is one debugging session, and it is really important that messages are received in the order they are sent, introducing conccurent streams just makes things more confusing and error prone without solving any real issue.
There is no real reason to do flow control at the "wire level", because the "wire" is never stressed, the issues you are encountering are at the application level.

And this brings us to the topic of queueing messages that cannot be serviced by threads. Your main point about this was the need of dequeuing messages from threads themselves, because main could be held up by a join, mutex, etc.
But in that scenario main cannot service those queued messages anyway and they would simply pile up.
But at some point the queue will fill up and you'll have to discard messages anyway, and when it unlocks, the editor will also be swamped by replies to piled up messages.

So, in the end, discarding messages when main is locked that way, is simply inevitable.

To wrap this up:

  • If we decide we really need separate queues, we should hold them, and route messages to the them in the RemoteDebugger, not the RemoteDebuggerPeer (but again, in the end, it will end up still causing problems in the scenario they are supposed to solve).
  • Otherwise, and this is a simpler solution, we should just have threads only service while debugging and focused, and discard messages that are not thread safe.

In any case, we will need to be able to mark captures as thread-safe via a bool, so each capture (including user defined ones) can decide if it can handle being called by a thread or not.
Captures marked as thread safe will receive messages while being serviced by threads, while messages for non-thread-safe captures will be dropped when being serviced by a thread (or moved to the appropriate queue if we go that way).

@derammo
Copy link
Contributor Author

derammo commented Sep 14, 2022

As I also explained in the contributors chat, I think adding channels to the RemoteDebuggerPeer is an unnecessary complication, and it will actually introduce more issues than it solves.

First of all, it forces every RemoteDebuggerPeer implementation (tcp, websocket, etc) to implement channels. This is an incredible overhead when it could be done (assuming separate queues are needed) by wrapping around "poll" so it routes messages to the appropriate queue and without requiring the lower level implementation to know this much about the debugger protocol (i.e. holding the queue in the RemoteDebugger, not the RemoteDebuggerPeer).

It specifically avoids having to know anything about the protocol by letting the upper level (RemoteDebugger) sort the messages in to the right "lanes." I don't think there are any "etc" so it is just TCP and websockets. We already have an IP based implementation (enet) in the product if you don't like to use just plain TCP connections. I implemented RemoteDebuggerPeerWebSocket by just ignoring the channels and running the protocol as before. That's because I didn't know exactly what the thread model would be for web going forward, so I just made it work just like before (stil implementing the new API of course.)

More importantly though, there seem to be this idea that we want "separate concurrent network streams" active in the engine debugger (in relation to godotengine/godot-proposals#4952 and SCTP, AMQP, or even using multiple TCP sockets). This is not the case, there is one debugging session, and it is really important that messages are received in the order they are sent, introducing conccurent streams just makes things more confusing and error prone without solving any real issue. There is no real reason to do flow control at the "wire level", because the "wire" is never stressed, the issues you are encountering are at the application level.

The concurrent streams are the control of the thread being debugged vs the main thread being used to interrogate various other things like remote scene debugger / inspector. Within each of those, the order is very important, which is why I argued against your proposal to use both the worker thread and the main thread to service the same queue depending on the situation. Channels are the way you keep things in order within a class of messages while not blocking other classes of messages. Yeah, it's just independent streams really that happen to share a connection (in some protocols).

And this brings us to the topic of queueing messages that cannot be serviced by threads. Your main point about this was the need of dequeuing messages from threads themselves, because main could be held up by a join, mutex, etc. But in that scenario main cannot service those queued messages anyway and they would simply pile up. But at some point the queue will fill up and you'll have to discard messages anyway, and when it unlocks, the editor will also be swamped by replies to piled up messages.

So, in the end, discarding messages when main is locked that way, is simply inevitable.

Absolutely not. That is what flow control is. The receiving side of the connection stops reading because there are no resources available to service requests (threads, concurrency limits, buffers, whatever the "server" is) which in this implementation is the max size of queued requests. When it stops reading, the TCP protocol communicates this fact to the sending side and causes the sending socket to become non-writable (when buffers are full.) That's how the sending side can use poll (or select in the old days) to know when it is supposed to resume sending more stuff.

That's how every TCP server works. We just can't do that here without the per-channel flow control because we still want to let the core messages flow while blocking more main messages. Yes, we would do that at application level by sending a message back using the transport thread. I just would rather not have that sort of code when TCP already does it.

To wrap this up:

  • If we decide we really need separate queues, we should hold them, and route messages to the them in the RemoteDebugger, not the RemoteDebuggerPeer (but again, in the end, it will end up still causing problems in the scenario they are supposed to solve).
  • Otherwise, and this is a simpler solution, we should just have threads only service while debugging and focused, and discard messages that are not thread safe.

In any case, we will need to be able to mark captures as thread-safe via a bool, so each capture (including user defined ones) can decide if it can handle being called by a thread or not. Captures marked as thread safe will receive messages while being serviced by threads, while messages for non-thread-safe captures will be dropped when being serviced by a thread (or moved to the appropriate queue if we go that way).

Your whole discussion sounds like "I would have done it differently" and that's ok. What are the actual defects that you are reporting? I said I would stick around long enough to correct any defects, but I can't justify spending time/money to implement what I would consider a worse solution for you?

If you have any actual code review notes like "hey there is a race condition here" or "this code is wrong", then great. But I really don't know what you are trying to achieve with this thread. Do you just want this to not get merged because you hate it? That's fine, then just have it merged to a branch (like dotnet6) was and then I can go. If you do want it merged, then let's please focus on actual bugs? I think it works really well as written.

You will probably hate the style in places, and you will probably say that we don't need the configurable tree view at this time. That happened because I was going to implement the PRs for thread naming that are there and respond to a request to make these dialogs dockable/floatable. At that point, the same dialog will need to accommodate a lot of columns or a few columns, so I just went ahead and built that. If you hate that too, then just rip it out.

@Faless
Copy link
Collaborator

Faless commented Sep 14, 2022

I don't think there are any "etc" so it is just TCP and websockets.

There are, the RemoteDebuggerPeer exists is on purpose so other methods can be implemented, named pipe, reverse TCP, whatever.
Specific platforms might require different low level peer for communication.

@YuriSizov
Copy link
Contributor

Your whole discussion sounds like "I would have done it differently" and that's ok. What are the actual defects that you are reporting? I said I would stick around long enough to correct any defects, but I can't justify spending time/money to implement what I would consider a worse solution for you?

Area maintainers are entitled to envision a different implementation than the one suggested by a contributor because they have a good understanding of how this fits the engine design overall. You may not have enough background with the project to get a feel for that. So far this has been your spec that you yourself have implemented as you see fit. I think it's unreasonable to expect it to be merged as is without maintainer's considerations being taken into account, even if you take a responsibility to stick around.

If you are insistent on your solution being tangibly better, then my suggestion here would be to wait for more feedback from other contributors and maintainers and see what they think. So we have a better representation of opinions than just two opposite ones.

@Faless
Copy link
Collaborator

Faless commented Sep 14, 2022

The receiving side of the connection stops reading because there are no resources available to service requests (threads, concurrency limits, buffers, whatever the "server" is) which in this implementation is the max size of queued requests. When it stops reading, the TCP protocol communicates this fact to the sending side and causes the sending socket to become non-writable (when buffers are full.)

And when that happens, if the producer on the other side (editor) keep producing messages (because we have no control over message generation), it will start filling the outgoing queue.
And when the outgoing queue is full, messages will have to be dropped.

@derammo
Copy link
Contributor Author

derammo commented Sep 14, 2022

The receiving side of the connection stops reading because there are no resources available to service requests (threads, concurrency limits, buffers, whatever the "server" is) which in this implementation is the max size of queued requests. When it stops reading, the TCP protocol communicates this fact to the sending side and causes the sending socket to become non-writable (when buffers are full.)

And when that happens, if the producer on the other side (editor) keep producing messages (because we have no control over message generation), it will start filling the outgoing queue. And when the outgoing queue is full, messages will have to be dropped.

I addressed that in a previous post. When the main thread channel back pressures the client, the connection could fire a signal. This would be used in the UI to set the user interface for clicking on remote stuff to disabled, so that the user would know they aren't going to be able to look at remote things right now, rather than jamming on the buttons and nothing happens.

@derammo
Copy link
Contributor Author

derammo commented Sep 14, 2022

Area maintainers are entitled to envision a different implementation than the one suggested by a contributor because they have a good understanding of how this fits the engine design overall. You may not have enough background with the project to get a feel for that. So far this has been your spec that you yourself have implemented as you see fit. I think it's unreasonable to expect it to be merged as is without maintainer's considerations being taken into account, even if you take a responsibility to stick around.

If you are insistent on your solution being tangibly better, then my suggestion here would be to wait for more feedback from other contributors and maintainers and see what they think. So we have a better representation of opinions than just two opposite ones.

I appreciate the time you spent to clarify that. I asked Akien if we should try to merge this so at least there would be a threaded debugger (after 7 years of people complaining about it) that works. Even though it is super close to Beta and even though I need to stop working on Godot contributions. I am just trying to maybe have this work benefit some users, rather than it just dying here. I know that's not the best possible flow and it isn't the Godot way. I tried to get design review in the proposal and status updates and interim commits I have been providing, but it was just really bad timing with the main guy being on vacation at the time and also us apparently having very different backgrounds and never communicating effectively about technical things. Also, I was hard to work with because I was very unhappy with the way things work here, which is why I need to try something else.

So yeah, it isn't as intended. I don't use Godot and I don't work for anyone who uses Godot. I have absolutely no reason to try to push this into the product except that I feel it would be a shame and it cost a lot to learn all the different affected areas to be able to implement it. I suspect that is why it sat there for 7 years and so I thought it would be silly to waste it. I am probably over-estimating the value of my contribution because it was moderately challenging for an external :)

@derammo
Copy link
Contributor Author

derammo commented Sep 14, 2022

Update to save people some time: We can stop debating this here. I had a good chat with @akien-mga and the pressure is off anyway schedule wise, so I am much more relaxed now. :)

Please merge this to a branch in your repo, so the next developer can use it as a proof of concept or reference or whatever. I won't be keeping it in mine past October 1, 2022.

@Zireael07
Copy link
Contributor

Zireael07 commented Sep 15, 2022

I won't be keeping it in mine past October 1, 2022.

Why? You said there is no pressure and then you introduce time pressure? 😕

@derammo
Copy link
Contributor Author

derammo commented Sep 15, 2022

Why? You said there is no pressure and then you introduce time pressure? 😕

I meant there is no more pressure to try to get this approved. The only way it would have made Beta is if it had been approved with minor fixes, and that obviously did not happen. So now I am just saying please archive this in a branch so I can close my PRs and get rid of my fork.

My reasons for doing that are private. I'm trying to be considerate by letting you know instead of just disappearing.

@akien-mga akien-mga modified the milestones: 4.0, 4.x Nov 24, 2022
@akien-mga akien-mga modified the milestones: 4.x, 4.2 Jun 19, 2023
@akien-mga
Copy link
Member

Superseded by #76582. Thanks for the contribution nonetheless!

@derammo
Copy link
Contributor Author

derammo commented Jun 20, 2023

@akien-mga for posterity/search results:

The simplified implementation provided by reduz doesn't address the requirements implemented here. There may be a time in the future where additional stuff from here could be leveraged / reimplemented. This clarifies "superseded." :)

@akien-mga akien-mga modified the milestones: 4.2, 4.x Jun 22, 2023
@YuriSizov YuriSizov removed this from the 4.x milestone Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants