-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple parallel long fetches with overridden, non-zero, timeouts fails all the time #1887
Comments
Does this problem happen only with fake timers or "live" too? |
I am battling something brutal with doing 5x longish 20 minute requests not working on a real, deployed machine. 1 or 2 of the shorter requests work, but then the other few lingering just fully "die" and eventually trigger the headers timeout error if that is set. Running the requests individually have no issues. I dont necessarily experience the same issue on my local machine yet, so am still working on figuring out repro cases. So there is definitely SOMETHING happening live, I will run this test right now and edit this comment with the results running it without faketimers I tried changing the timeouts to 0 and still experience the issue so I suspect there is more to this than maybe just this specific example, but am only able to do one step at a time I have had to switch back to axios for these requests and everything is fine with axios while I try to resolve this on the side. |
I am unable to get this same test to fail on my machine without faketimers... will keep trying to find ways to repro what I am seeing on my server, but any guidance on why the test is failing that might be a clue would be great |
I have to give up and go back to axios for this, I have tried so many different things to reproduce this locally and debug this but I cant and I understand this is not that actionable/helpful unfortunately. I cant change/control the server I am accessing and cant reproduce anything like this with any API servers (using the benchmarks/server as a base) I spin up (on a remote machine on my LAN) with various levels of timeouts, data sizes returned, etc. I am doing 4 parallel 5 to 25 minute requests from a single node app on a linux (ubuntu) server on EC2 to 1 server that is running on Google Compute Engine. I can get a response for 1 or 2 of the 4 requests based on what I have changed, but never all 4 at the same time. The sockets show as all 4 get established (ss command). If I do all 4 requests in series (wait for the previous to complete before starting the next), they all work fine. On my local dev machine (ubuntu), all 4 long fetches happen fine, the code all works fine, no issues at all. I suspect there is something that is specific to the server I am accessing or some weird edge of an edge case here but have just run out of ideas and things to try unfortunately. Things I have tried on the ec2 server (after first checking that all 4 requests worked with the change locally) without luck:
The query looks like this, pretty straightforward other than the SSL pinning:
|
My experience with these kind of issues tells me the fakeTimers problem is not really related. What this looks like: a firewall is truncating the connection in a way that we do not recognize, generating an error that we ignore. |
I just ran a bash script from the server fetching these requests in parallel with curl. A total of 8 requests, ranging from 2 minutes to 20 minutes all at the same time, same as I call from my node app. All 8 finished fine with no issues I do think there is an error that is being ignored somehow, but am unsure how to even debug that. The only other thing I can think to try right now is replace with node-fetch and simply say "yep it works or no it doesnt". If there is anywhere I can put some more debugging I am open to it! |
Aha! node-fetch acts nearly identically to undici and some of the requests fail! This looks very interesting to me so I will take a look a bit to see if it makes a difference for me https://github.com/node-fetch/node-fetch#custom-highwatermark Initial testing of changing from highWaterMark: 0 to highWaterMark: 5 made no difference, doesnt even seem to be getting this far It kind of looks like the last request that is asked for completely overrides the earlier request. I added quite a bit of logging and it really seems like the requests just disappear. I dont know where to go from here with my debugging :/ |
Can you try |
undici.request behaves the same as fetch for this case after some testing |
I wonder what axios is doing to make the request succeed all the time. |
Yea I am really out of my depth here. How would I go about instrumenting the .js to really make sure keep-alives are working as intended for these long queries? Tracing around kKeepAliveTimeoutValue and trying to understand the keepalive within the codebase has me a bit lost. I did already try and remove this as a variable by using "pipelining: 0" but that might not have been the right way to go about it. It just really seems like the shortest request is causing the others (longer ones) to be destroyed without any notice/events being thrown/shown to the user I am also looking at the util.destroy(socket calls, maybe I will put a log statement in destroy() and see if that's getting called |
I am surprised how readable axios is, I never looked into it. Doing a tiny bit of digging but it is going to be something small. Really a shame I still cant find a local/developer friendly repro. I can repro it EVERY time though on my prod environment though, so at least it is not ephemeral or intermittent This is nuanced and seems to differ a bit with how undici handles it axios/axios#1752 |
I think those longs connections are timing out at the TCP level due to inactivity. We could think of providing an option to enable TCP keepalive. https://nodejs.org/api/net.html#socketsetkeepaliveenable-initialdelay |
I will try hard coding that somewhere when I get a chance |
Great news, thanks for sticking with me here I added a
right here Line 1350 in 2b260c9
And tested it and success, my requests all go through and the performance is on par with axios! I verified the keep-alive behavior before/after with Excellllllent! Now the question is would you be open to me opening a PR, and how should it relate to the existing ClientOptions parameters https://undici.nodejs.org/#/docs/api/Client?id=parameter-clientoptions
|
It should be a new parameter: WDYT @ronag? Should this be enabled by default? It might actually help with some ECONNRESETs. |
Is there any downside to it? I don't mind having it as default. |
@xconverge would you mind opening a PR? |
Actually, I think this might already work?
|
We just need to override the defaults in |
@ronag I made the latest changes in the PR, but yes I now see exactly what you are saying Let me run a test with the current version of undici with:
If this is true then should this just be a documentation exercise or do we still want to continue with some logic/parsing of an additional option where we set both of these by default if not defined explicitly by the user? I feel like since it is 2 flags being controlled by 1 undici specific parameter, it might be kind of weird but doable. (ie a user sets keepAlive: false, and tcpKeepAliveInterval to non-zero), just need a decent sized validation before overriding them but I am ok with that |
setting these in socket = tls.connect({}) doesnt work because we are upgrading/assigning the socket, but manually setting socket.setKeepAlive with them after connect() works, so I think this will still require a code change
|
Bug Description
When running multiple LONG running fetches in parallel, some of them seem to just die and hit a timeout that doesnt really seem relevant
Reproducible By
Tweaking minutesPerRequest, numParallelRequests, and timeout here makes for some pretty confusing results. When timeout is 0, everything seems to work correctly. I suspect there are a few timers that are overlapping eachother or getting shared somehow.
Simply set timeout to 0 and see that the test passes, and set timeout to 30 minutes and see that it fails
Expected Behavior
The fetches to succeed
Logs & Screenshots
Environment
Node v18.13.0
Additional context
The text was updated successfully, but these errors were encountered: