-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock: Multi-VU many iterations #971
Comments
InvestigationAfter profiling we found that the issue was related to evaluating the page hide when page.close is called. Looking at the stack trace of the live goroutines, it became evident that it wasn't a traditional deadlock, but rather a stuck asynchronous process waiting for a response from the chrome browser. My theory was that at a certain point when chrome is under a lot of load, it stops responding to CDP requests and that is why we get timeouts in navigations and new page creations. I tested this theory by setting the timeout to one that is longer than the max test duration. Unfortunately, chrome does respond to these events but it just takes longer for it to do so. As of now, we're unable to to identify the root cause, and we can only guess that:
The current workaround for this "deadlock" issue is to have a timeout, but it doesn't help solve the root problem. Unrelated find #1Unrelated to this current issue, we also found that logging the goroutine ids were taking a lot of time (pprof), and so we removed this step completely in #977. Unrelated find #2While working on this theory, I was also able to determine that the compute resources given to the remote chrome instance affects the test (see full investigation results here). We need to ensure that the compute resources for all components are appropriately set so that non of them over utilise the compute resources, which will affect the outcome and make results inconclusive. |
This new EvaluateWithContext will allow us to pass in a context with a timeout. For issue #971, we're seeing that often the evaluate to hide the page on page.close ends up in a deadlock (waiting for the response from the chrome browser after executing the CDP evaluate command). We are not sure of the root cause of the issue, but this fix will help prevent this deadlock allow the test run to complete in a timely manner without it causing a time out error.
A temporary fix can be found in #979. |
Could there be a race in handling iteration events? 🤔 From the stack trace, it seems like both
I haven't looked into this deeply, just guessing. So please take it with a grain of salt. |
When running this test, it was running 5 VUs, so it's likely an iteration finishing on one VU, and a |
Right. But, in the meantime, might we be closing/(update: or disconnecting from) the browser and this might be racing with |
I don't have the answer for why it's occurring in this scenario, I was hoping that it was due to the overloaded chrome instance, however other APIs (such as page.goto) they eventually receive a response when the default timeout is set very high, which would mean that chrome does indeed respond to CDP requests when it is under heavy load, and doesn't drop messages. When the evaluate on If the browser for that stuck VU were to close, wouldn't we get an error suggesting that the remote host closed the connection?
|
This new EvaluateWithContext will allow us to pass in a context with a timeout. For issue #971, we're seeing that often the evaluate to hide the page on page.close ends up in a deadlock (waiting for the response from the chrome browser after executing the CDP evaluate command). We are not sure of the root cause of the issue, but this fix will help prevent this deadlock allow the test run to complete in a timely manner without it causing a time out error.
It seems we need to look deeper into this issue since you have very good points and I don't have the best answer at this moment :( |
After working with this fix against a remote chrome instance, we've now been able to reproduce the issue with
One more point to make is that the remote chrome instances were not under a lot of load. |
Fixed by #1219. |
While running the following test script:
Test script
The test has to be aborted since it goes over the 40m limit. Investigate why this happens and resolve it.
This is the screenshot of the dashboard of the test run. It would seem that all but one of the VUs completed all the allotted iterations. One vu was stuck in a deadlock, which meant the test run was waiting for it to complete the iteration, which it never did, that eventually resulted in an abort.
Tasks
The text was updated successfully, but these errors were encountered: