-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
npm install fails intermittently due to race condition in Redirect.prototype.onResponse #2807
Comments
Update: In December 2017, PNPM eliminated its reliance on the buggy request library. We switched from NPM to PNPM and no longer have to deal with this headache. Also, our installs are significantly faster with PNPM! :-) |
I'm not really into the APIs here but from what I see, I would like to offer some guesses which more educated folks here may evaluate: The old request object is not cleaned up correctly before re-initiating it. In particular there is no proper cleanup before Otherwise the socket in the old |
…uest#3075) This reverts commit 2f04c3c.
This reverts commit 2f04c3c.
…uest#3075) This reverts commit ccb7783.
This reverts commit ccb7783.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Summary
We spent the past 3 months investigating an issue where "npm install" would fail intermittently.
NPM 4 shows an error like this:
When we inspected the NPM cache, we found that the downloaded tarball was truncated. This mostly seems to happen in repos that have lots of multi-megabyte dependencies, and in configurations where the NPM registry redirects the HTTP request for the tarball. It was extremely difficult to investigate because it's a race condition that will stop happening if you change any number of things (the computer, the time of day, the proxy settings, etc).
The issue also occurs with the NPM 5 client, but the error message is different (although still ultimately due to a corrupted tarball):
The issue also occurs with the PNPM client, which relies on the Request library as well.
In this case, the symptom is an unhandled stream error:
Wireshark and the VSTS server logs show that the server is correctly streaming the tarball. In fact, we can repro the issue using Fiddler on the local machine, and Fiddler downloads the full tarball, even though NPM still fails. The explanation is that the NPM client is aborting the connection due to a bug in the Request library.
Simplest Example to Reproduce
We ultimately debugged this using a physical PC that consistently repros the problem. Unfortunately we cannot find any other environment that consistently encounters the errors, even though 10 or more people have reported these issues to us over the past few months. In some cases it fails consistently on the person's computer for a day or so, then suddenly starts working again. It was really difficult to pin down.
Problem Analysis
The problem happens when the request for a tarball encounters an HTTP redirect to an Azure blob storage URL. It goes like this:
Request.init()
initializes a bunch of data structuresRequest.start()
creates a connection, obtains a socket, wires up event handlersRequest.onRequestResponse()
At this point, the lib/redirect.js module kicks in with a kind of hacky way of handling this:
Request.init()
again on the same objectRequest.start()
to create a new connection and a new a socket, and wire up the same event handlers to these new objectsRequest.onRequestResponse()
, which starts streaming bytes to the calling applicationEven though this design seems super brittle, it does work most of the time. In most cases, the tarball is successfully streamed and unzipped.
But every once in a while, onRequestError() will get called with an ECONNRESET error that crashes the install. Here's the details:
onRequestError()
, we find thatself.req.socket.destroyed
is true. It seems to be an old socket that was supposed to be returned to the connection pool.Possible Solution
I believe the ideal solution would be to rethink the design of lib/redirect.js, to avoid trying to reconstruct the Request instance, and to avoid mixing together stream events from two different HTTP requests. That's probably a lot of work.
For PNPM (which we are using now), we found a workaround which simply ignores the error and prevents it from tripping up the installation:
It's interesting that a few lines above, someone else had written some logic that seems to be handling a similar issue for the ForeverAgent case. However that code was disabled forever ago.
This workaround unfortunately does not fix NPM, even though the
self.req.socket.destroyed
condition does get executed many times during the install.Context
I'm not sure why the race condition suddenly became so frequent this summer. We've tried upgrading/downgrading NodeJS and NPM and request and npm-registry-client and it has no effect. 8 different experts from various teams (including Azure and Visual Studio Team Services) all worked together on this investigation.
It's been a huge headache for us. I just received another problem report from someone as I was typing up this issue. :-P It would be awesome if it could get some attention. You can find lots and lots of examples of people struggling with these symptoms. Like our experiences, people figure out a black magic "solution" that merely shuffles the race condition around, and seems to work, but then another person hits the same issue.
Your Environment
The text was updated successfully, but these errors were encountered: