-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP and HTTPS attempt only single IP, causing intermittent failures #708
Comments
Slightly better test code: var https = require('https')
function get() {
https.get('https://www.itunes.com', function(res) {
console.log('statusCode: %s & ip: %s', res.statusCode, res.connection.remoteAddress);
}).on('error', function(err) {
console.log('Error:', err);
});
} |
The reason for
That would be a backwards incompatible change and it would be detrimental to what is probably the common case: that the server is really down. I'm leaning towards that this is working as intended and that more advanced error handling is best left to the user. I am sympathetic to the 'it should just work' argument but I don't see a way of doing it that doesn't have downsides. |
It's challenging to do more sophisticated error handling above the https library level because if you pass an IP address to I would argue that libcurl is a good reference implementation for non-interactive HTTP fetchers, and io.js fetching should aim to work in the same way. What about changing only the For mDNS and other things that don't show up in Alternately, I think |
In terms of HTTP and HTTPS having retry built in, I'm personally not so sure, it may improve a sense of stability in a lot of uses, but it also may be unexpected behaviour, this seems more like a layered behaviour. That said, the way .NET solves this is with a delegate which could translate roughly to something like this:
This allows developers to control the behaviour a little more, you could introduce some standard behaviours too such as:
The only thing I'm not sure about is race condition on the handler being set before the request starts and DNS lookup begins, but this seems like it would also cause issues with methods like Thanks, |
I don't think the behavior I suggested is unexpected. In addition to the browser examples, curl, and wget, I tested Ruby's Net::HTTP and Python's Requests with Wireshark, and they also attempt parallel simultaneous connections and use the first to complete. The list so far is: Firefox: parallel There is also RFC 6555, "Happy Eyeballs," describing the parallel connection algorithm in the context of mixed IPv4 and IPv6 connectivity. So I think this is generally accepted correct behavior for HTTP and HTTPS. I filed a separate issue, #736, suggesting a new method |
I'm +1 on implementing this, I think it makes sense and seems mature for node to at least offer this capability instead of just selecting the first IP or nothing. I see things like curl/wget/ruby as layered frameworks/apps, which is why I compared instead to the .NET core NET framework or Java (which I think node core correlates to better), their design is to default to the first IP but easily allow user code to customize this behaviour (as with the example API I gave above). That said, these frameworks were designed long before the web of today, and I think it could make sense to have this behaviour as default. |
I wouldn't say it's a incompatible change per se, as it usually benefits the application. The masking of a potential server outage is a bit unfortunate, but I'd say the benefits outweight the drawbacks here. I think it only makes sense to implement this in |
Any more +1/-1 from the TC if this should be added? I might give it a shot, if there's enough support. |
I think like @meandmycode's suggestions here. I don't think this is something that belongs in core but the fact that we can't easily support userland code to do something similar is a problem which we should address. request, hyperquest, superagent, etc. are all client libraries that people tend to reach for and they should be the ones that implement the specific behaviour desired by their creators and the communities using them, core should just be an enabler for this and nothing more. |
Is this a regression or has this been the behavior since the early days of Node? |
I built a sequential retry framework to do just this in my node.js application. Just before I was going to release this retry framework to github. I decided to do some research to see why wasn't this implemented in the first place. After all, what else would you have use DNS round robin for. I concluded that most if not all the unix applications are doing this type of retry with DNS round robin already deep in the socket connection level. And this includes basic tools like telnet or ftp. And web browser is doing this already as stated here. So if basic unix application is doing this and web browser is doing this, I believe io.js http should do at least sequential retry as well. |
@domenic since the early days of node. nodejs/node-v0.x-archive#4169 As for leaving it up to lib authors... request/request#1551 I see that @bnoordhuis vehemently opposes this change but I really can't understand why. This is a huge gotcha that bit me for years until it all finally clicked while I was setting up some RR DNS and getting random ECONNRESETs. (This also answered a longstanding node question i had: why do random econnresets show up in my logs for valid hosts like google.com? 💥) Only reason I'm here now is I was just about to sit down and write a patch that fixes this for my own uses and decided to google for movement on the issue first. My plan was to introduce a resolution strategy option that defaults to status quo but allows |
@nodejs/ctc ... do we want to do anything with this one? It's certainly possible to implement this in userland with a custom |
This is about Happy Eyeballs, isn't it? What do you mean "doesn't perform too well" when done from user-land? Performance wouldn't be any different when supported natively, there are no shortcuts core can take. |
@bnoordhuis: Are you talking about this issue in general? It's not about Happy Eyeballs, but about error handling in an all-IPv4 situation where multiple A records are returned, and some IPs are unavailable or return connection errors. BTW, I've gotten a correction on my earlier assertion that Chrome and FF issue connections to multiple IPs in parallel: https://twitter.com/sleevi_/status/719312986511450113. Evidently it's just Happy Eyeballs + sequential retries to different IPs within a timer. It looks like an |
Where do we stand on this? This issue hasn't seen any real movement in about a year so I'm inclined to close it out. FWIW, I'm still of the opinion that 'silent retry' has no place in core. |
At what level? Certainly retrying HTTP POSTs or GETs without prompting is wrong. But, for instance, dropped packets in a TCP connection will automatically get re-sent (granted, this is at the OS rather than runtime level). I think retrying Here's another way of looking at it: The current library design encourages people to write software that works fine in nominal conditions, but fails catastrophically when a single backend server is down. I think that's not what Node should be pursuing. |
The flip side is that silent retry makes for unpredictable and hard-to-debug behavior. Besides the obvious latency issues, consider for example what node should report when all connection attempts fail. Do you report the last connection error? What if the previous errors were different and are the real cause? There isn't a single good answer so you either punt on it or move from mechanism to policy territory. That's fine for libraries; it's not fine for core because it needs to be as broadly usable and unopiniated as possible. Note that I'm not inviting debate on this particular issue, I'm just pointing out that silent retry has issues that should not be solved by core. |
As an example that supports @bnoordhuis claim: does |
@indutny: See my original post on this issue:
See also my comment later in the thread: #708 (comment)
Node is the outlier here. |
Hah, nice! Sorry, thread was very long to read fully. I think we may want to create an experimental agent option for this. Let's put it up straight that it may be removed at any time if no major adoption will happen. If it will happen - then let's live with it. @bnoordhuis do you agree? |
If you've read through my comments, you know the answer is 'no'. :-) |
What are the obvious latency issues? This only happens with connects that would otherwise fail.
Yes, obviously.
What does that mean? Say you get "connection refused" on one IP, and "no route to host" on another IP. Both of those are a problem, and fixing either one will restore service the way you want. You seem to be thinking of the HTTP library as a low-level network debugging tool, when it's not. It's a tool for retrieving content over HTTP, which has a certain set of commonly accepted semantics, including robust treatment of multiple IPs in DNS records. |
Guess it's less obvious than I thought it was so I'll spell it out. Say a host name resolves to two addresses: one is reachable, the other is not. If you try them in parallel, happy eyeballs style, you are, in my opinion, a rude netizen. I won't support such a change. If you try them sequentially, you open yourself up to unpredictability. The connection might succeed immediately but it also might not. If it's slow because the first path times out, that slowness might happen consistently but then again it might not. Now double that unpredictability for hosts with four addresses. Now double it again for hosts with eight addresses. See where this is going? How do you propose debugging such issues? If you want to see your feature request move along you are going to have to look beyond your own narrow use case and take into account that users will run into such issues and will have to troubleshoot them. |
The same way I would debug any problems with a slow host: by taking a tcpdump. There's another form of unpredictability: people running sites that have one unavailable IP address find that their site is still available using Firefox, Chrome, Ruby, Python, curl and wget, but is for some mysterious reason not available using Node. I think that is far more likely to waste people's troubleshooting time than finding that some fetches are slow. |
I think this is the way to go. With the size of today's pipes, a few extra packets won't hurt.
|
@bnoordhuis is being irrational.
|
@jsha That works but only if it's consistently reproducible or if you are a small-scale operation. It's not so great when it happens a few times per day or once every few hundred thousand connections. Again, it's a matter of taking the broader view. Look at it from the perspective of a maintainer. Would you still be fervently in favor if you had to help out every user that runs into issues for the next five years? Probably not, right?
@cmawhorter That's simply wrong. If you want a happy eyeballs http agent, there is nothing stopping you from writing one today. The discussion is about whether it belongs in core. Also, please stop the image posting. It doesn't add anything and I personally find it obnoxious. |
Ah, now we're getting somewhere. It sounds like you are alluding to the use of the HTTP / HTTPS libraries to underpin RPC layers. Is that right? I think that's definitely a valuable use case! But I think it's relatively uncommon to encounter multiple A records in internal RPC applications. For external RPCs / APIs, like the Twitter API, multiple A records are common - but they are common specifically because the host wants more reliability.
Yes. I very much disagree with you about the expected maintenance burden of fixing this bug. Like I said, it's the default in a number of other systems, and AFAIK doesn't form a significant fraction of their support requests. |
No it's not @bnoordhuis. Stop saying that. OP is posting about how http resolves dns to the first IP it finds and how it's difficult to work around that behavior. This first-ip behavior becomes even more problematic if there is a DNS cache you don't control. Then the (random) IP node uses becomes static and results in permanent failure even though the target is probably up. node's http/net/socket needs to be more flexible. That's what this issue is about.
This is technically a true statement, but is extremely misleading. DNS RR resolution strategies (sequential, first-byte wins) require a connection to be established to the target. Right now, there is no way for the winning connection to be handed off. Socket.connect uses a private function That would mean you need to create a brand new connection to the winning DNS RR ip (which could fail). This means it's not a solution. AFAIK, DNS RR is standard for all major cloud providers and DNS that resolves to a single assumed working IP is rapidly becoming (become?) a thing of the past. This is the single biggest hidden gotcha in node.js and there are zero good workarounds. And if you disagree, I'd welcome you to prove me wrong.
🌹 |
@jsha That's about right. One of the biggest use cases for node.js is talking to other services over HTTP.
We'll have to agree to disagree. I've been working on node for some time now and I like to think I've developed a pretty good sense for features that are low risk and features that produce a never-ending stream of unhappy users and bug reports. You can guess in which category I think this one belongs. To be clear, I'm fine with putting mechanisms in place so people can set up their own policies. What I'm not fine with is putting such policies in core, that's what the module ecosystem is for. I think we have most of the mechanisms but I'm open to discussion on if or how they can be improved. @cmawhorter I'm going to tune you out. It's not my job to educate you and your particular brand of hyperbole and smartassery doesn't make me inclined to. |
@bnoordhuis not on gitter? i'd love to take this discussion there or someplace else. i'm legit curious about your pov. |
@cmawhorter I don't use IRC/gitter/etc., there aren't enough hours in a day, but you can email me if you want. Sorry I called you a smartass but I thought you were just trying to get a rise out of me. |
Consider also that some service providers EXPECT clients to do multiple connection requests, with the (more or less reasonable) expectation that the handshake latency will be representative of the overall session latency, so the user is connecting to the most responsive server. Gone are the days of TCB exhaustion. |
Ok... so I have finally read through this whole thing and have a couple thoughts.
Adding this as an option that can be enabled but defaults to off, seems to provide a pretty simple way to mitigate the problems with number one and number two. (would use # but then you get issues...) It's probably a minor use case at this point to deal with number 2 but it's probably going to end up being a bigger use case as time moves forward. Especially with electron gaining popularity. Which may actually be the biggest reason to implement something like this. Electron will pretty much always be consuming public APIs that are going to, most likely, start using multiple A names and multiple DNS providers thanks to DDoS attacks against DNS infrastructure. Passing in a function to handle the resolution seems less then optimal to me, better to pass a flag that turns on a type of resolution that is implemented internally. Just some thoughts. |
Also:
Note that this doesn't just affect HTTP - this will impact ANY higher-level protocol that relies on forward name resolution. |
That is the policy-over-mechanism line of thinking that is not acceptable to me. It's okay if you are writing a framework on top of node.js but it's not okay for node.js itself because it's never going to cover every scenario. (I know someone is going to reply with "why is that important?" or "just add more flags" but that's not going to be a productive discussion.) I've mentioned up-thread that I'm sympathetic to making it easier out-of-the-box but a custom HTTP agent is five or ten lines of code so there isn't much to gain there. The whole point of exporting the agent was to make it easy to create your own. By the way, @jsha, I missed this in #708 (comment):
I don't know about ruby but python's built-in http.client and socket modules don't try anything in parallel, only sequentially. Maybe requests does but that's not a built-in module. |
@cmawhorter It's pretty easy to solve this issue in Node.js versions after the IO.js merge. Just set the ref: https://nodejs.org/api/net.html#net_socket_connect_options_connectlistener Since options from plugged into request like this: |
I've been thinking it over and I'm willing to entertain pull requests that meet the following criteria:
As an example of configurability vs. convenience: the default should be to try the addresses in the order that the DNS server returns them but an escape hatch that lets the user filter or reorder them is almost certainly needed. I'm going to close the issue. By all means continue the discussion but code > words. |
@bnoordhuis - with respect, the first item there is the least optimal way of handling things, Edited because it seems to me there's a bigger issue here with respect to where in the stack a decision relating to network session establishment should be made. I like the other criteria, though :) |
I'm certainly no expert, nor am I intimately familiar with the inner workings of net/http, but wouldn't this be a lot better to implement directly in Then, we might even start thinking about having TL;DR: I agree that this fix should be applied directly to |
I'm certainly not against this and it'd be an improvement, but it doesn't address the fundamental problem which is dns resolution is done in a variety of ways, and it's sometimes more desirable to resolve to an open and active connection (and not ip).
Why? DNS resolution either needs to become more flexible in net.js or be removed. I think I finally understand your perspective at this point, but net.js either needs more flexibility or to not do dns at all. |
Steps to reproduce:
Find a host that answers a given port only on some of the returned IP addresses for its name. For instance, www.itunes.com has this problem. Then run this code:
Since www.itunes.com resolves to three different IP addresses (see
dig www.itunes.com
) in random order, io.js will connect to a random host each time, sometimes failing and sometimes succeeding.Arguably, this is broken behavior on the part of the host. However, it's reasonably common on the web. Firefox and Chrome each attempt parallel connections to all IP addresses provided in the DNS response, using the first connection to succeed. Curl attempts each IP address in sequence until it gets a successful connection or runs out of IP addresses: https://github.com/bagder/curl/blob/master/lib/connect.c#L1156
It looks like the io.js http and https libraries, through
Agent
, call into Socket.connect, which usesdns.lookup
(returns just one IP address) instead ofdns.resolve
: https://github.com/iojs/io.js/blob/v1.x/lib/net.js#L900.Is it practical to change the Socket.connect behavior to use
dns.resolve
and try multiple IP addresses? I'm guessing that would be a fairly API-incompatible change. Another possibility: Agent could implement failure handling code that specifically catched connection refused and timeout errors, callsdns.resolve
itself, and calls Socket.connect again with IP address arguments. This would be less performant but also a smaller API change.For the real-world motivating example, I ran into this when writing automated tests for HTTPS Everywhere's rulesets: https://github.com/jsha/https-everywhere/blob/rules-tester/rewriter/tester.js. As-is, I get a number of false positives from hosts that appear to fail but actual work fine in a browser.
The text was updated successfully, but these errors were encountered: