Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

npm install fails intermittently due to race condition in Redirect.prototype.onResponse #2807

Closed
octogonz opened this issue Nov 3, 2017 · 4 comments
Labels

Comments

@octogonz
Copy link
Contributor

octogonz commented Nov 3, 2017

Summary

We spent the past 3 months investigating an issue where "npm install" would fail intermittently.
NPM 4 shows an error like this:

npm ERR! tar.unpack unzip error C:\example-project\temp\npm-tmp\npm-13780-b27f9707\onedrive.pkgs.visualstudio.com\_packaging\team\npm\registry\cldrjs\-\cldrjs-0.4.8.tgz

When we inspected the NPM cache, we found that the downloaded tarball was truncated. This mostly seems to happen in repos that have lots of multi-megabyte dependencies, and in configurations where the NPM registry redirects the HTTP request for the tarball. It was extremely difficult to investigate because it's a race condition that will stop happening if you change any number of things (the computer, the time of day, the proxy settings, etc).

The issue also occurs with the NPM 5 client, but the error message is different (although still ultimately due to a corrupted tarball):

npm ERR! code EINTEGRITY
npm ERR! sha1-20E+D1MrUwyeZvYOmVLkQnQ+o1Y= integrity checksum failed when using sha1: wanted sha1-20E+D1MrUwyeZvYOmVLkQnQ+o1Y= but got sha1-ygU0cyjTuZdxezDk6ofWXCGJbqI=. (867808 bytes)

The issue also occurs with the PNPM client, which relies on the Request library as well.

UPDATE: PNPM is no longer affected by this issue. See comments.

In this case, the symptom is an unhandled stream error:

              events.js:160
                               throw er; // Unhandled 'error' event
                               ^
 
              Error: read ECONNRESET
                             at exports._errnoException (util.js:1018:11)
                             at TLSWrap.onread (net.js:572:26)

Wireshark and the VSTS server logs show that the server is correctly streaming the tarball. In fact, we can repro the issue using Fiddler on the local machine, and Fiddler downloads the full tarball, even though NPM still fails. The explanation is that the NPM client is aborting the connection due to a bug in the Request library.

Simplest Example to Reproduce

We ultimately debugged this using a physical PC that consistently repros the problem. Unfortunately we cannot find any other environment that consistently encounters the errors, even though 10 or more people have reported these issues to us over the past few months. In some cases it fails consistently on the person's computer for a day or so, then suddenly starts working again. It was really difficult to pin down.

Problem Analysis

The problem happens when the request for a tarball encounters an HTTP redirect to an Azure blob storage URL. It goes like this:

  • Request.init() initializes a bunch of data structures
  • Request.start() creates a connection, obtains a socket, wires up event handlers
  • Stream events are fired such as "socket" and "request", and processed by the calling application
  • An HTTP response is received and passed to Request.onRequestResponse()

At this point, the lib/redirect.js module kicks in with a kind of hacky way of handling this:

  • It tries to carefully delete all the state that was set up so far:
  delete request.src
  delete request.req
  delete request._started
  if (response.statusCode !== 401 && response.statusCode !== 307) {
    // Remove parameters from the previous response, unless this is the second request
    // for a server that requires digest authentication.
    delete request.body
    delete request._form
    if (request.headers) {
      request.removeHeader('host')
      request.removeHeader('content-type')
      request.removeHeader('content-length')
      if (request.uri.hostname !== request.originalHost.split(':')[0]) {
        // Remove authorization if changing hostnames (but not if just
        // changing ports or protocols).  This matches the behavior of curl:
        // https://github.com/bagder/curl/blob/6beb0eee/lib/http.c#L710
        request.removeHeader('authorization')
      }
    }
  }
  • ...but note that it doesn't unwire any event handlers, and it obviously cannot undo any stream events that were already processed by the caller
  • Then it calls Request.init() again on the same object
  • Which causes Request.start() to create a new connection and a new a socket, and wire up the same event handlers to these new objects
  • Stream events are fired such as "socket" and "request" are fired a second time
  • The second HTTP response is received and passed to Request.onRequestResponse(), which starts streaming bytes to the calling application

Even though this design seems super brittle, it does work most of the time. In most cases, the tarball is successfully streamed and unzipped.

But every once in a while, onRequestError() will get called with an ECONNRESET error that crashes the install. Here's the details:

  • The error occurs after the above sequence, and always on the second socket for that Request instance
  • Inside onRequestError(), we find that self.req.socket.destroyed is true. It seems to be an old socket that was supposed to be returned to the connection pool.
  • The Wireshark timings show that this ECONNRESET error is for an unused HTTP connection that has been idle for a while, and is now timing out due to keepalive
  • But we never find a socket in that state where it's first obtained -- seems like something is destroying the socket later
  • For the PNPM repro, the application's error handler is missing because it's a "once" handler that has already fired once before. We tried working around this by attaching a second handler, but it just crashes later because npm-registry-client tries to push the error back down onto the response's stream, creating an infinite loop

Possible Solution

I believe the ideal solution would be to rethink the design of lib/redirect.js, to avoid trying to reconstruct the Request instance, and to avoid mixing together stream events from two different HTTP requests. That's probably a lot of work.

For PNPM (which we are using now), we found a workaround which simply ignores the error and prevents it from tripping up the installation:

Request.prototype.onRequestError = function (error) {
  var self = this
  if (self._aborted) {
    return
  }
  if (self.req && self.req._reusedSocket && error.code === 'ECONNRESET' &&
    self.agent.addRequestNoreuse) {
    self.agent = { addRequest: self.agent.addRequestNoreuse.bind(self.agent) }
    self.start()
    self.req.end()
    return
  }
  if (self.timeout && self.timeoutTimer) {
    clearTimeout(self.timeoutTimer)
    self.timeoutTimer = null
  }

  // PROPOSED WORKAROUND:
  if (self.req && self.req.socket && self.req.socket.destroyed) {
    return;
  }

  self.emit('error', error)
}

It's interesting that a few lines above, someone else had written some logic that seems to be handling a similar issue for the ForeverAgent case. However that code was disabled forever ago.

This workaround unfortunately does not fix NPM, even though the self.req.socket.destroyed condition does get executed many times during the install.

Context

I'm not sure why the race condition suddenly became so frequent this summer. We've tried upgrading/downgrading NodeJS and NPM and request and npm-registry-client and it has no effect. 8 different experts from various teams (including Azure and Visual Studio Team Services) all worked together on this investigation.

It's been a huge headache for us. I just received another problem report from someone as I was typing up this issue. :-P It would be awesome if it could get some attention. You can find lots and lots of examples of people struggling with these symptoms. Like our experiences, people figure out a black magic "solution" that merely shuffles the race condition around, and seems to work, but then another person hits the same issue.

Your Environment

software version
request 2.83.0
node 6.10.2
npm 5.5.1
Operating System Windows 10
@octogonz
Copy link
Contributor Author

octogonz commented Nov 3, 2017

@nickpape-msft @iclanton @qz2017 FYI

@octogonz
Copy link
Contributor Author

octogonz commented Feb 15, 2018

The issue also occurs with the PNPM client, which relies on the Request library as well. In this case, the symptom is an unhandled stream error

Update: In December 2017, PNPM eliminated its reliance on the buggy request library. We switched from NPM to PNPM and no longer have to deal with this headache. Also, our installs are significantly faster with PNPM! :-)

@hwanders
Copy link

hwanders commented Nov 4, 2018

I'm not really into the APIs here but from what I see, I would like to offer some guesses which more educated folks here may evaluate:

The old request object is not cleaned up correctly before re-initiating it. In particular there is no proper cleanup before delete request.req.
Perhaps you should call request.req.abort() and/or request.req.destroy() before delete request.req.

Otherwise the socket in the old req (of type http.ClientRequest) may survive and eventually timeout or get reset by the server?!
If this happens before the desired (redirected) response arrives, the error will be propagated and then become visible.

reconbot pushed a commit that referenced this issue Dec 1, 2018
reconbot added a commit that referenced this issue Dec 1, 2018
reconbot added a commit that referenced this issue Dec 1, 2018
shotamakino pushed a commit to shotamakino/request that referenced this issue Dec 11, 2018
shotamakino pushed a commit to shotamakino/request that referenced this issue Dec 11, 2018
shotamakino pushed a commit to shotamakino/request that referenced this issue Dec 11, 2018
shotamakino pushed a commit to shotamakino/request that referenced this issue Dec 11, 2018
@stale
Copy link

stale bot commented Nov 4, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants