npm install fails intermittently due to race condition in Redirect.prototype.onResponse #2807

octogonz · 2017-11-03T01:54:00Z

Summary

We spent the past 3 months investigating an issue where "npm install" would fail intermittently.
NPM 4 shows an error like this:

npm ERR! tar.unpack unzip error C:\example-project\temp\npm-tmp\npm-13780-b27f9707\onedrive.pkgs.visualstudio.com\_packaging\team\npm\registry\cldrjs\-\cldrjs-0.4.8.tgz

When we inspected the NPM cache, we found that the downloaded tarball was truncated. This mostly seems to happen in repos that have lots of multi-megabyte dependencies, and in configurations where the NPM registry redirects the HTTP request for the tarball. It was extremely difficult to investigate because it's a race condition that will stop happening if you change any number of things (the computer, the time of day, the proxy settings, etc).

The issue also occurs with the NPM 5 client, but the error message is different (although still ultimately due to a corrupted tarball):

npm ERR! code EINTEGRITY
npm ERR! sha1-20E+D1MrUwyeZvYOmVLkQnQ+o1Y= integrity checksum failed when using sha1: wanted sha1-20E+D1MrUwyeZvYOmVLkQnQ+o1Y= but got sha1-ygU0cyjTuZdxezDk6ofWXCGJbqI=. (867808 bytes)

The issue also occurs with the PNPM client, which relies on the Request library as well.

UPDATE: PNPM is no longer affected by this issue. See comments.

In this case, the symptom is an unhandled stream error:

              events.js:160
                               throw er; // Unhandled 'error' event
                               ^
 
              Error: read ECONNRESET
                             at exports._errnoException (util.js:1018:11)
                             at TLSWrap.onread (net.js:572:26)

Wireshark and the VSTS server logs show that the server is correctly streaming the tarball. In fact, we can repro the issue using Fiddler on the local machine, and Fiddler downloads the full tarball, even though NPM still fails. The explanation is that the NPM client is aborting the connection due to a bug in the Request library.

Simplest Example to Reproduce

We ultimately debugged this using a physical PC that consistently repros the problem. Unfortunately we cannot find any other environment that consistently encounters the errors, even though 10 or more people have reported these issues to us over the past few months. In some cases it fails consistently on the person's computer for a day or so, then suddenly starts working again. It was really difficult to pin down.

Problem Analysis

The problem happens when the request for a tarball encounters an HTTP redirect to an Azure blob storage URL. It goes like this:

Request.init() initializes a bunch of data structures
Request.start() creates a connection, obtains a socket, wires up event handlers
Stream events are fired such as "socket" and "request", and processed by the calling application
An HTTP response is received and passed to Request.onRequestResponse()

At this point, the lib/redirect.js module kicks in with a kind of hacky way of handling this:

It tries to carefully delete all the state that was set up so far:

  delete request.src
  delete request.req
  delete request._started
  if (response.statusCode !== 401 && response.statusCode !== 307) {
    // Remove parameters from the previous response, unless this is the second request
    // for a server that requires digest authentication.
    delete request.body
    delete request._form
    if (request.headers) {
      request.removeHeader('host')
      request.removeHeader('content-type')
      request.removeHeader('content-length')
      if (request.uri.hostname !== request.originalHost.split(':')[0]) {
        // Remove authorization if changing hostnames (but not if just
        // changing ports or protocols).  This matches the behavior of curl:
        // https://github.com/bagder/curl/blob/6beb0eee/lib/http.c#L710
        request.removeHeader('authorization')
      }
    }
  }

...but note that it doesn't unwire any event handlers, and it obviously cannot undo any stream events that were already processed by the caller
Then it calls Request.init() again on the same object
Which causes Request.start() to create a new connection and a new a socket, and wire up the same event handlers to these new objects
Stream events are fired such as "socket" and "request" are fired a second time
The second HTTP response is received and passed to Request.onRequestResponse(), which starts streaming bytes to the calling application

Even though this design seems super brittle, it does work most of the time. In most cases, the tarball is successfully streamed and unzipped.

But every once in a while, onRequestError() will get called with an ECONNRESET error that crashes the install. Here's the details:

The error occurs after the above sequence, and always on the second socket for that Request instance
Inside onRequestError(), we find that self.req.socket.destroyed is true. It seems to be an old socket that was supposed to be returned to the connection pool.
The Wireshark timings show that this ECONNRESET error is for an unused HTTP connection that has been idle for a while, and is now timing out due to keepalive
But we never find a socket in that state where it's first obtained -- seems like something is destroying the socket later
For the PNPM repro, the application's error handler is missing because it's a "once" handler that has already fired once before. We tried working around this by attaching a second handler, but it just crashes later because npm-registry-client tries to push the error back down onto the response's stream, creating an infinite loop

Possible Solution

I believe the ideal solution would be to rethink the design of lib/redirect.js, to avoid trying to reconstruct the Request instance, and to avoid mixing together stream events from two different HTTP requests. That's probably a lot of work.

For PNPM (which we are using now), we found a workaround which simply ignores the error and prevents it from tripping up the installation:

Request.prototype.onRequestError = function (error) {
  var self = this
  if (self._aborted) {
    return
  }
  if (self.req && self.req._reusedSocket && error.code === 'ECONNRESET' &&
    self.agent.addRequestNoreuse) {
    self.agent = { addRequest: self.agent.addRequestNoreuse.bind(self.agent) }
    self.start()
    self.req.end()
    return
  }
  if (self.timeout && self.timeoutTimer) {
    clearTimeout(self.timeoutTimer)
    self.timeoutTimer = null
  }

  // PROPOSED WORKAROUND:
  if (self.req && self.req.socket && self.req.socket.destroyed) {
    return;
  }

  self.emit('error', error)
}

It's interesting that a few lines above, someone else had written some logic that seems to be handling a similar issue for the ForeverAgent case. However that code was disabled forever ago.

This workaround unfortunately does not fix NPM, even though the self.req.socket.destroyed condition does get executed many times during the install.

Context

I'm not sure why the race condition suddenly became so frequent this summer. We've tried upgrading/downgrading NodeJS and NPM and request and npm-registry-client and it has no effect. 8 different experts from various teams (including Azure and Visual Studio Team Services) all worked together on this investigation.

It's been a huge headache for us. I just received another problem report from someone as I was typing up this issue. :-P It would be awesome if it could get some attention. You can find lots and lots of examples of people struggling with these symptoms. Like our experiences, people figure out a black magic "solution" that merely shuffles the race condition around, and seems to work, but then another person hits the same issue.

Your Environment

software	version
request	2.83.0
node	6.10.2
npm	5.5.1
Operating System	Windows 10

The text was updated successfully, but these errors were encountered:

octogonz · 2017-11-03T01:54:25Z

@nickpape-msft @iclanton @qz2017 FYI

octogonz · 2018-02-15T18:02:21Z

The issue also occurs with the PNPM client, which relies on the Request library as well. In this case, the symptom is an unhandled stream error

Update: In December 2017, PNPM eliminated its reliance on the buggy request library. We switched from NPM to PNPM and no longer have to deal with this headache. Also, our installs are significantly faster with PNPM! :-)

hwanders · 2018-11-04T00:03:17Z

I'm not really into the APIs here but from what I see, I would like to offer some guesses which more educated folks here may evaluate:

The old request object is not cleaned up correctly before re-initiating it. In particular there is no proper cleanup before delete request.req.
Perhaps you should call request.req.abort() and/or request.req.destroy() before delete request.req.

Otherwise the socket in the old req (of type http.ClientRequest) may survive and eventually timeout or get reset by the server?!
If this happens before the desired (redirected) response arrives, the error will be propagated and then become visible.

This reverts commit 2bc3cc4.

…uest#3075) This reverts commit 2f04c3c.

This reverts commit 2f04c3c.

…uest#3075) This reverts commit ccb7783.

This reverts commit ccb7783.

stale · 2019-11-04T00:04:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

octogonz pushed a commit to octogonz/request that referenced this issue Nov 3, 2017

Workaround for GitHub issue request#2807

57732bb

octogonz mentioned this issue Nov 3, 2017

Ignore ECONNRESET errors from a destroyed socket #2808

Merged

3 tasks

octogonz pushed a commit to octogonz/request that referenced this issue Nov 3, 2017

Workaround for GitHub issue request#2807

3a8693e

octogonz mentioned this issue Dec 5, 2017

pnpm install sometimes fails due to a race condition in the "request" package pnpm/pnpm#953

Closed

nickpape mentioned this issue Jan 24, 2018

NPM issue installing @microsoft/generator-sharepoint SharePoint/sp-dev-docs#992

Closed

octogonz mentioned this issue Feb 15, 2018

rush install error: failing integrity checks microsoft/rushstack#528

Closed

octogonz mentioned this issue Oct 17, 2018

[rush] Let's get NPM 6 working with Rush microsoft/rushstack#886

Open

reconbot pushed a commit that referenced this issue Dec 1, 2018

Workaround for GitHub issue #2807 (#2808)

2bc3cc4

reconbot added a commit that referenced this issue Dec 1, 2018

Revert "Workaround for GitHub issue #2807 (#2808)"

7748117

This reverts commit 2bc3cc4.

reconbot added a commit that referenced this issue Dec 1, 2018

Revert "Workaround for GitHub issue #2807 (#2808)" (#3075)

b3926a3

This reverts commit 2bc3cc4.

shotamakino pushed a commit to shotamakino/request that referenced this issue Dec 11, 2018

Revert "Workaround for GitHub issue request#2807 (request#2808)" (req…

3b37eb1

…uest#3075) This reverts commit 2f04c3c.

shotamakino pushed a commit to shotamakino/request that referenced this issue Dec 11, 2018

Revert "Workaround for GitHub issue request#2807 (request#2808)"

38b0f7d

This reverts commit 2f04c3c.

shotamakino pushed a commit to shotamakino/request that referenced this issue Dec 11, 2018

Revert "Workaround for GitHub issue request#2807 (request#2808)" (req…

82aea14

…uest#3075) This reverts commit ccb7783.

shotamakino pushed a commit to shotamakino/request that referenced this issue Dec 11, 2018

Revert "Workaround for GitHub issue request#2807 (request#2808)"

169096e

This reverts commit ccb7783.

stale bot added the stale label Nov 4, 2019

stale bot closed this as completed Nov 11, 2019

octogonz mentioned this issue Jul 5, 2022

[rush] "rush update" retries when there are unmet peer dependencies microsoft/rushstack#3420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

npm install fails intermittently due to race condition in Redirect.prototype.onResponse #2807

npm install fails intermittently due to race condition in Redirect.prototype.onResponse #2807

octogonz commented Nov 3, 2017 •

edited

Loading

octogonz commented Nov 3, 2017

octogonz commented Feb 15, 2018 •

edited

Loading

hwanders commented Nov 4, 2018

stale bot commented Nov 4, 2019

npm install fails intermittently due to race condition in Redirect.prototype.onResponse #2807

npm install fails intermittently due to race condition in Redirect.prototype.onResponse #2807

Comments

octogonz commented Nov 3, 2017 • edited Loading

Summary

Simplest Example to Reproduce

Problem Analysis

Possible Solution

Context

Your Environment

octogonz commented Nov 3, 2017

octogonz commented Feb 15, 2018 • edited Loading

hwanders commented Nov 4, 2018

stale bot commented Nov 4, 2019

octogonz commented Nov 3, 2017 •

edited

Loading

octogonz commented Feb 15, 2018 •

edited

Loading