-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent requests hanging, eventually crashing the tab/browser #20897
Comments
We're able to reproduce this on Chrome as well, we recently updated from v7.7.0 to v9.5.3 and we don't see the issue in v7 but do see it on v9.5.3. |
We've added debug logging for the browsers (by adding @sharmilajesupaul it'd be interesting to know if you're seeing similar in your Chrome logs! When remoted in and observing a build becoming stuck, what we see in the network tab is a bunch of requests that are stuck in a 'Pending' state, all showing as 'Stalled' under timings. For our app, these are generally third party things such as google recaptcha, sentry, and so on. I do not believe it is any of these individual requests that are at fault, because it's differing ones each time (and I've wasted some time stamping out various ones of them to no avail). In the case of my original post, it was a request to get the spec file itself that was stuck! The devtools docs suggest this happens when things are stuck behind "higher priority requests" - maybe they're stuck behind these security ones that we can see going wrong in debug logs? In an attempt to resolve this or get more information, I've tried stubbing out the security endpoints in Edge via
I can see that it's working in general for other edge endpoints (from the cypress logging when an intercept occurs), but it's unsuccessful in preventing requests to We also gave the |
We've disabled safe browsing via the following snippet (inspired by #9017):
This has prevented the errors around We'll continue to monitor the data on our end and update if we have any breakthroughs. |
@alyssa-glean, if possible, could you run Cypress with the debug logs turned on:
|
@alyssa-glean My apologies, I haven't had the time to pull the logs on this, but this stopped happening for us after updating our version of Chrome. We were running on a much older Chrome 89.0.4389.82 with Cypress v9 was causing this issue for a particular test suite. When we bumped the version to 101.0.4951.54 (Which was the latest at the time, on may 4th 2022), that seems to have resolved the issue. |
No problem - glad you've found something that worked for you! We did a lot of bumping of browser versions when we first hit this, so far to no avail - but we're a few stops short of 101.x.y.z so perhaps we'll give that a go.
Sure, we'll try this too and get back to you - we have tinkered with the debug logs a bit but not with these exact flags. We didn't spot anything interesting in them but we're definitely not the experts 😅 |
@mschile hi, I work with Alyssa. I have cypress logs with We know a bit more about the issue, it looks like we are triggering a bug in chrome's javascript engine, we can see using gdb that the --render process is just looping over the same few functions related to javascript Exceptions and rendering stack frames, we also know that the chrome devtools does not work properly during a hang and if you try to start the javascript profiler it hangs. This all suggests to me that the javascript engine is stuck in some uninterruptible loop, if I can get chrome to tell me about JIT'd code locations/unwinding I might be able to workout what's causing it. At the moment this is a real heisenbug as a lot of the things we're tried to make it easier to reproduce or debug it have caused it to disappear. So I think this is a javascript engine bug, and a very weird one. This was a very long way of saying I'm not sure the cypress logs will be helpful. |
@tonysimpson-sonocent Can you send me an email with the debugging logs to shawn.harris@cypress.io, please? Please list this GitHub issue as the subject line. |
Cypress debugging log sent. |
@tonysimpson-sonocent Thank you! |
Hey @tonysimpson-sonocent. I was able to go through the debug logs. From the crash logs and the logs you sent, it looks like the server is running out of memory, which is possible since the browser is headful in CI and video recording is turned on. How much memory is allocated to the Jenkins job / Cypress node instance? It looks like 2GB on my end. Have you tried increasing the memory to see if the issue resolves? Could you try to increase to maybe 4GB or 8GB and capture the debug logs then and see if the issue persists? If it does persist, it could be indicative of a memory leak someplace. |
@AtofStryker the instance running cypress has 14GB. I don't see any evidence of processes running out of memory, I do see in the log and from other means that the chrome/edge process runs out of javascript heap. I think this is due to a bug in chrome that our tests are triggering. The browser says theres about a 3.5GB limit on the javascript heap
We can see in the chrome log captured by cypress this is hit:
I'm pretty sure this memory leak is a bug in chrome/edge's javascript engine - see my previous comment. |
@tonysimpson-sonocent interesting. Is there anything that consistently reproduces the issue, or evades it completely? Have you tried increasing the |
possibly related to #22128 but cert errors might be a red herring? |
one run (228) hung in |
had a couple of hangers in |
Looks like the hangs were legit. Downgrading to |
Yeah, I thought I'd help out a bit in confirming/bulking up the sample size - it's very exciting news if we truly cannot replicate on 4.0.0! 🎉 One thing to keep in mind is that browser version might be a factor... but am I right in thinking that all of your |
@alyssa-glean I have have been regularly bumping the docker image when possible. I did reproduce the hang with |
Well those 5 didn't stick, so I guess that's 35 total for 4.0.0. Although the output suggests it's still using 4.1.0 despite what's in
|
Ah I am looking at this docker image and we do install |
did see a hang with |
We also have the hanging behaviour for quite a while now. I have the feeling it was introduced with a new version of Chrome and that it might not be related to Cypress. For instance, Chrome version 90 is working properly, but version 99 is hanging consistently. |
ran with the custom docker image (4.0.0). One job timed out, but was giving steady input. So I increased the timeout to 45 minutes and am monitoring. |
A bit of a frustrating one today. I created a docker image that installs node 12 and I am wondering if we next try
|
Yeah, this has been the nature of the problem for us throughout TBH. Since I started looking at it I've been convinced I've found a plausible fix 3 or 4 times only to ultimately be proven wrong. In one instance we had no stuck builds for a couple of days before we found out my latest change hadn't resolved it 😞 . Intermittent problems are such a pain to investigate! Downgrading to 3.x sounds a reasonable next step to me.
I think these will be runs that got stuck but then managed to unstick themselves again - we've seen this as well while investigating, albeit rarely. Usually just at the point where you're ready to throw some diagnostics at it 😈 . |
Definitely can empathize with the frustration 😅 . I just hope we get close to something soon. I did take a look at downgrading to Ran 10 times so far and haven't seen a hang yet. Maybe a good sign? |
spoke too soon. 364 hung up 😢 |
@alyssa-glean I am getting ready to move off of rotation this week, but @rachelruderman is going to be taking over for me on this issue for the next few weeks. Would you be able to add her to the repo to contribute? |
Hey, just wanted to share an update on what we've been up to the last week or so. I've just merged a PR that simplifies the test repo quite a bit more. There's now just a single spec, which runs the same test in a loop 300 times. The test is a "failed login flow" - all it does is enter some credentials, hit 'log in' and expect an error to be shown. One upside of this is that there's no longer any need for any backend services, and so we've also added the local config file into source control (because there are no credentials needing to be wired anymore). This should mean that any collaborator can get up and running locally straight away 🎉 I've started experimenting with hosting our site in different ways to see if the problem remains or goes away. Currently our app is hosted in a k8s cluster behind traefik - I'm playing with a branch where we host it as a static S3 bucket instead on raw http. So far, I haven't reproduced the hang on this branch (but would like to do more runs to be sure). |
@alyssa-glean I'm so glad to hear that! Please keep us updated 🙏 |
It's been a while since I last gave an update, but I have some good news. We ran another experiment, this time to see what happened if we ran the same minimal scenario but using Playwright instead of Cypress. We rewrote a simple login test and did a bunch of builds with it running in a loop and we're now very confident that we've run into the exact same problem. It manifests slightly differently, because Playwright detects that their worker has stopped responding and spins up a new one in its place (we see This means the problem must lie elsewhere, and it's some interaction between our app and the browser that's the root of the problem. The bad news (for us) is that we still don't know exactly what, but we're definitely getting closer. And it means I can close this issue and not waste any more of your time. Thanks for all the help you've given us with this, we've really appreciated it! 🏆 |
@alyssa-glean that's great news that there isn't something Cypress related, but I am bummed the issue is still present. We are always glad to help and if you run into any additional trouble please reach out! |
@alyssa-glean can you let us know if you find out the root cause between your app and the browser? We have been experiencing the same issues as you |
Sure thing. To be honest, this has been plaguing us for so long now that I think I'll be shouting it from the rooftops whatever it turns out to be! 🙈 |
Current behavior
This is an issue we've been seeing for some time intermittently in our CI pipeline, but we had a particularly clean / minimal example on Friday so this issue will specifically reference what we saw on that occasion.
We observed that one of our parallel runners had stopped emitting any output, suggesting that Cypress was stuck somewhere. Using VNC, we remoted into the display that Cypress uses to see what was going on, and what we saw was the browser spinning trying to load our tests. The browser was fully responsive, and refreshing the tab saw it getting stuck in the same place every time:
In the network tab, we could see that it was the XHR request to
$BASE_URL/__cypress/tests?p=integration/ci/glean/tasks.spec.ts
which was getting stuck. Chrome listed it as 'Pending' - I've attached some screenshots of what we could see in the network tab itself.We were also able to reproduce in the Chrome console by firing off a manual
fetch('$BASE_URL/__cypress/tests?p=integration/ci/glean/tasks.spec.ts')
- the promise that was returned never completed. We did this about three times, before firing off one for a different spec file to see what would happen. Immediately as we did this, the browser completely crashed and we got a Javascript heap out of memory in the runner logs - see crashed output.txt.We've been trying to narrow down this issue for some time now, and have held off raising a Cypress issue as we were concerned that it might be a regression in our own app. However, in this instance the problem was occurring before our site was even loaded, leading us to believe it's a Cypress issue (specifically around the way that requests are proxied).
Desired behavior
No response
Test code to reproduce
Our
cypress.json
looks as follows, if it's of any interest:We are running Cypress in headed mode.
Cypress Version
9.5.1
Other
The browser in this case was Edge 100, but we're seeing the same issues in Chrome as well.
The text was updated successfully, but these errors were encountered: