Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should I Ignore k_ESteamNetworkingSocketsDebugOutputType_Bug ? #193

Open
Kodpact opened this issue Sep 5, 2021 · 9 comments
Open

Should I Ignore k_ESteamNetworkingSocketsDebugOutputType_Bug ? #193

Kodpact opened this issue Sep 5, 2021 · 9 comments

Comments

@Kodpact
Copy link

Kodpact commented Sep 5, 2021

I run example chat program with many client connections at same time and when i was terminating all clients at once server was failing asserts with debug outputs with type of k_ESteamNetworkingSocketsDebugOutputType_Bug.

AssertMsg1( false, "Processed thinkers %d times -- probably one thinker keeps requesting an immediate wakeup call.", nIterations );
And
AssertMsg( len( m_vecPendingCallbacks ) < 100, "Callbacks backing up and not being checked. Need to check them more frequently!" );

These two asserts were failing for me. Do i need to ignore them or should I terminate server ? Can i fix those issues ?

@zpostfacto
Copy link
Contributor

zpostfacto commented Sep 5, 2021

If you can reproduce it, can you debug it? Just put a breakpoint on that assert. Then, instead of breaking the thinker loop, set next statement and let the straggler try to think one more time, and see who it is and what they are doing.

@zpostfacto
Copy link
Contributor

(And BTW, that assert is probably an indication of a bug that is worth tracking down and fixing. If you are able to reproduce it and can give me some info, I'd appreciate it.)

@Kodpact
Copy link
Author

Kodpact commented Sep 6, 2021

I am using latest release version (1.30) further updates may have fixed this.
Note that i am connecting about 400 clients to server (in ~10 seconds) and then killing all of them which causes those assert failures (Quick off-topic question, i was able to connect only 63 by one process, im on windows, is it because of WaitForMultipleObjects which has count limit of MAXIMUM_WAIT_OBJECTS = 64? or im doing something wrong...).
Seems like second bug message is about callback, its not anything dangerous, just queue is getting bigger and server cannot keep up, but nothing gets corrupted. After stack exceeds 100 then assertion fails, is 100 that big? maybe there should be option to set that value...
But first one seems different, looks like the thinker enters deadlock (this condition fails), further debug shows that the Main Thread is the owner of the lock and SteamNetworking thread cannot obtain it.
image
I have no large picture of whats happening but i think code here has a bug, everytime I paused on breakpoint whenever nIterations was big number like 6K+ the size of queue s_queueThinkers was 1 (Looks obvious as i was connecting no more than 400 clients then how this reached 6K or 10K...), and it was entering the same path over and over because the call to
pNextThinker->InternalSetNextThinkTime( usecNow + 1000 );
Was not removing the item from s_queueThinkers so the size was always 1 with infinite loop. So this line of code has no effect as its purpose was to delay the deadlock...

Here is the code of chat client/server which was modified to create spammer clients :/
Im spawning 63 clients per process and connect them to the server...

@zpostfacto
Copy link
Contributor

The newest code only uses one wait object, regardless of the number of sockets, so the 63 socket limit should be fixed. (5b70535)

Also, are you perhaps simulating fake lag? There was a bug with the fake lag code, but I think it might have been introduced and then fixed (d515f45) since 1.30.

If you reproduce a deadlock, can you see what the main thread is doing? Would love to see a call stack from the main branch.

I recently implemented a P2P stress test, because we had a partner making hundreds of P2P connections (which are more expensive than UDP connections, in general). It did something very similar to what you are doing -- made a bunch of connections at a relatively high rate and watched for any assertions or warnings about long lock times, etc. This led to some optimizations. I think most of the problems I fixed were specific to P2P code, but there might have been some general fixes. I would be interested to know if the problems you are seeing are fixed with the latest code in the main branch.

@Kodpact
Copy link
Author

Kodpact commented Sep 7, 2021

Also, are you perhaps simulating fake lag? There was a bug with the fake lag code, but I think it might have been introduced and then fixed (d515f45) since 1.30.

No, thats not it (I tested latest version just to make sure...). Call to s_queueThinkers's item-remove function is only called when usecTargetThinkTime is set to k_nThinkTime_Never when calling InternalSetNextThinkTime. Which never happens when program enters into deadlock and then it keeps iterating infinitely until "infinite-loop defender assertion" - nIterations > 10000 is fired.

@zpostfacto
Copy link
Contributor

OK, thanks. Well if you are able to tell me what the other threads is doing (call stack) I'll bet we can get it figured out. Also I can try to apply your changes and run your test. If you want to turn your change into a more explicit unit test (add some command line parameters or maybe even just an #ifdef that could be compiled to a separate executable) and submit a PR, then I can reproduce it and add it to the CI configuration. I understand if that's more work than you want to apply to this. But if you are interested, I'd definitely appreciate the effort. In any case I'll try to help get this fixed if you can give me some more info.

@Kodpact
Copy link
Author

Kodpact commented Sep 7, 2021

This is what main thread is doing
image
I think you get it wrong. Problem is not in locking/unlocking! The lock is released after few ms from main thread so SteamNetworking can continue to work on it, problem is that the socket thread is waiting like while (true) and not like sleep(x) or "wait next server tick"

Comment clearly says what it should do: "Reschedule him for 1ms in the future", it just doesn't

@zpostfacto
Copy link
Contributor

Just now getting back to this bug. I don't support you have a test harness or other script I can run to reproduce the problem?

@Kodpact
Copy link
Author

Kodpact commented Oct 4, 2021

It's not easy to reproduce. Best I could come up with is this msvc project. Run server from VS and then run clients from batch script for about 10 times, and after successfull ~600 connections close all clients. Wait for few seconds for timeout and then it should happen. (GNS is installed from vcpkg)

project.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants