Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misleading or nonexistent errors in path resolving #348

Closed
kzc opened this issue Aug 25, 2020 · 19 comments
Closed

misleading or nonexistent errors in path resolving #348

kzc opened this issue Aug 25, 2020 · 19 comments

Comments

@kzc
Copy link
Contributor

kzc commented Aug 25, 2020

I stumbled upon this bug while figuring out how to run the native version of esbuild on my machine.

If your default ulimit -n max open file limit is low then esbuild error logging is misleading and not very helpful in determining the cause of the problem. In some cases it can also lead to incorrect bundling output without any error messages logged.

Scenario 1: Bundling projects with a large number of source files

Attempt to bundle react-admin/examples/simple with the default max open file limit on my machine:

$ ulimit -n
256

$ esbuild --bundle --minify --loader:.js=jsx --define:process.env.NODE_ENV='"production"' --define:global=window --outfile=dist/main.js src/index.js
node_modules/ra-language-french/tsconfig.json:2:15: warning: Cannot find base config file "../../tsconfig.json"
    "extends": "../../tsconfig.json",
               ~~~~~~~~~~~~~~~~~~~~~
node_modules/ra-ui-materialui/esm/list/DatagridHeaderCell.js:27:22: error: Could not resolve "@material-ui/core/TableCell"
import TableCell from '@material-ui/core/TableCell';
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
node_modules/ra-ui-materialui/esm/list/DatagridHeaderCell.js:28:27: error: Could not resolve "@material-ui/core/TableSortLabel"
import TableSortLabel from '@material-ui/core/TableSortLabel';
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
1 warning and 10 errors reached (disable error limit with --error-limit=0)

Strangely, esbuild-wasm worked correctly with the same ulimit -n 256.

Scenario 2: Incorrect bundle output for require() statements in presence of EMFILE errors (too many open files)

Given the following source file and a dozen require'd files with console.log() side effects:

$ cat main.js
try {
    // all these required files exist and are well-formed
    require("./z1");
    require("./z2");
    require("./z3");
    require("./z4");
    require("./z5");
    require("./z6");
    require("./z7");
    require("./z8");
    require("./z9");
    require("./z10");
    require("./z11");
    require("./z12");
} catch (x) {}

It works as expected with a sufficiently high ulimit -n value:

$ esbuild main.js --bundle
(() => {
  var __commonJS = (callback, module) => () => {
    if (!module) {
      module = {exports: {}};
      callback(module.exports, module);
    }
    return module.exports;
  };

  // z1.js
  var require_z1 = __commonJS(() => {
    console.log(1);
  });

  // z2.js
  var require_z2 = __commonJS(() => {
    console.log(2);
  });

  // z3.js
  var require_z3 = __commonJS(() => {
    console.log(3);
  });

  // z4.js
  var require_z4 = __commonJS(() => {
    console.log(4);
  });

  // z5.js
  var require_z5 = __commonJS(() => {
    console.log(5);
  });

  // z6.js
  var require_z6 = __commonJS(() => {
    console.log(6);
  });

  // z7.js
  var require_z7 = __commonJS(() => {
    console.log(7);
  });

  // z8.js
  var require_z8 = __commonJS(() => {
    console.log(8);
  });

  // z9.js
  var require_z9 = __commonJS(() => {
    console.log(9);
  });

  // z10.js
  var require_z10 = __commonJS(() => {
    console.log(10);
  });

  // z11.js
  var require_z11 = __commonJS(() => {
    console.log(11);
  });

  // z12.js
  var require_z12 = __commonJS(() => {
    console.log(12);
  });

  // main.js
  try {
    require_z1();
    require_z2();
    require_z3();
    require_z4();
    require_z5();
    require_z6();
    require_z7();
    require_z8();
    require_z9();
    require_z10();
    require_z11();
    require_z12();
  } catch (x) {
  }
})();

But if the ulimit -n value is too low, then the following incorrect output is produced without error or warning - notice the requires are not inlined into the bundle:

$ ulimit -n 11

$ esbuild main.js --bundle
(() => {
  // main.js
  try {
    require("./z1");
    require("./z2");
    require("./z3");
    require("./z4");
    require("./z5");
    require("./z6");
    require("./z7");
    require("./z8");
    require("./z9");
    require("./z10");
    require("./z11");
    require("./z12");
  } catch (x) {
  }
})();

$ echo $?
0

It appears that esbuild's path resolving algorithm incorrectly treats all errors as if they were ENOENT - no such file or directory.

The following patch was useful in debugging the cause of the Scenario 1 failure and might be useful as a starting point for a fix:

--- a/internal/fs/fs.go
+++ b/internal/fs/fs.go
@@ -7,6 +7,7 @@ import (
        "path/filepath"
        "strings"
        "sync"
+       "syscall"
 )
 
 type EntryKind uint8
@@ -301,6 +302,9 @@ func (fs *realFS) ReadDirectory(dir string) map[string]*Entry {
 
 func (fs *realFS) ReadFile(path string) (string, bool) {
        buffer, err := ioutil.ReadFile(path)
+       if err != nil && err.(*os.PathError).Err.(syscall.Errno) != syscall.ENOENT {
+               panic(err)
+       }
        return string(buffer), err == nil
 }
 
@@ -339,6 +343,9 @@ func (*realFS) Rel(base string, target string) (string, bool) {
 func readdir(dirname string) ([]string, error) {
        f, err := os.Open(dirname)
        if err != nil {
+               if err.(*os.PathError).Err.(syscall.Errno) != syscall.ENOENT {
+                       panic(err)
+               }
                return nil, err
        }
        defer f.Close()

Unfortunately, this patch did not log any errors for very low ulimit -n values - such as in Scenario 2.

@evanw
Copy link
Owner

evanw commented Aug 25, 2020

Thanks for the report. I wasn't aware of this problem.

I think the WebAssembly version can't reproduce this issue because it's single-threaded, and so only opens one file at a time. This also happens with my Linux VM when I only allocate one CPU to it. I can only reproduce this on the Linux VM when it has more than one CPU.

After some investigation, I don't think I can make this work for extremely small ulimit values because the runtime may open certain files by itself outside of the control of esbuild (e.g. libpthread.so on Linux). But I can make sure esbuild doesn't open too many files simultaneously, which will work for normal ulimit values. For example, experimentation shows that limiting esbuild to a maximum of 16 concurrent file operations doesn't noticeably slow down the benchmark for me (although I do have a SSD, not a magnetic drive).

@kzc
Copy link
Contributor Author

kzc commented Aug 25, 2020

I think limiting esbuild to, say, a maximum of 64 open fds at any one time would be ideal - it's not a web server, after all. I'm not suggesting esbuild ought to work with fewer than 16 fds, rather it ought to report the nature of the file failures more accurately and should fail outright if the situation warrants it.

@evanw
Copy link
Owner

evanw commented Aug 25, 2020

Yes, I agree. That's why I left this open. I'll try to fix that too. I'd rather feed the error through to the log instead of causing a panic though. That will also give more context on the failure.

@kzc
Copy link
Contributor Author

kzc commented Aug 25, 2020

Good stuff. The panic is also not library friendly.

react-admin/examples/simple is the gift that keeps on giving.

For the record this project has motivated me to learn Go. Half second build time for esbuild itself - that's crazy. A decade in JavaScript and NodeJS has warped my expectations of what a good language + platform should be.

@kzc
Copy link
Contributor Author

kzc commented Aug 25, 2020

fwiw, 878781c did not work on my Mac with ulimit -n 256.

Just to convince myself I was using my build from latest master, I changed Could not to Couldn't globally in internal/bundler/bundler.go. Sure enough, 10 Couldn't resolve errors when building react-admin/examples/simple.

@kzc
Copy link
Contributor Author

kzc commented Aug 26, 2020

878781c is also non-deterministic. With ulimit -n 476 react-admin/examples/simple fails to build with resolve errors 30% of the time. When it does fail it can have from 1 to 10 errors. My Mac has an Intel Core i7 with 4 cores.

@evanw
Copy link
Owner

evanw commented Aug 26, 2020

That's strange. I can't reproduce that on my machine. I only get problems with react-admin/examples/simple when I set ulimit -n 14 or lower (tested out of 100 runs).

What happens when you set the concurrency lower (the fileOpenLimit variable)? Is it still a problem with 1 concurrent file operation? What about 16? When I set esbuild to 1 concurrent file operation, I only get errors with ulimit -n 4 or below.

Is it possible that https://github.com/macports/macports-legacy-support and/or cgo on an unsupported platform is causing a lot of file handles to be used and not reclaimed?

@kzc
Copy link
Contributor Author

kzc commented Aug 26, 2020

  • fileOpenLimit of 1 with ulimit -n 476 - 50 / 50 runs successful - wall time 0.29s.
  • fileOpenLimit of 1 with ulimit -n 475 - 50 / 50 runs successful - wall time 0.29s.
  • fileOpenLimit of 1 with ulimit -n 474 - 100% failure in 10 runs.
  • fileOpenLimit of 2 with ulimit -n 476 - 50 / 50 runs successful - wall time 0.26s.
  • fileOpenLimit of 4 with ulimit -n 476 - 43 / 50 runs successful - wall time of 0.25s when successful.
  • fileOpenLimit of 4 with ulimit -n 256 - 100% failure in 10 runs.

Is it possible that https://github.com/macports/macports-legacy-support is causing a lot of file handles to be used and not reclaimed?

Anything is possible. I did not review the code. The only system call I needed was _fdopendir$INODE64.

Are you certain there isn't a race condition in esbuild?

@evanw
Copy link
Owner

evanw commented Aug 26, 2020

I tried checking this by running the benchmark with dtrace after setting fileOpenLimit to 1. In that situation, esbuild should only ever open one file descriptor at a time. This should be demonstrated by the trace, which should only ever open and close file descriptor 3 (since 0, 1, and 2, are stdin, stdout, and stderr).

In my case it actually uses file descriptors 3 and 4 because it looks like Go's implementation of readdir duplicates the file descriptor, and that appears to be necessary according to the Go source code because the Darwin fdopendir syscall consumes the file descriptor passed to it. This explains why esbuild with a fileOpenLimit of 1 only fails for me when using ulimit -n 4 or lower.

But even in this case, limiting fileOpenLimit to 32 (the current limit) should only use up to 64 file descriptors in the worst case. So that doesn't explain your problem. Can you post the output of sudo dtruss [the esbuild command] (or an equivalent syscall trace)? Maybe that will help explain your problem.

@kzc
Copy link
Contributor Author

kzc commented Aug 26, 2020

Here's another data point - when esbuild-wasm is built with 9882ae3 the lowest ulimit -n that succeeds in building react-admin/examples/simple is 53.

If this patch is applied:

--- a/internal/fs/fs.go
+++ b/internal/fs/fs.go
@@ -233 +233 @@ type realFS struct {
-var fileOpenLimit = make(chan bool, 32)
+var fileOpenLimit = make(chan bool, 1)

and esbuild-wasm is rebuilt then the lowest ulimit -n that succeeds in building react-admin/examples/simple is 23. The patched version of esbuild-wasm takes 6.785s to run versus 5.795s for the unmodified 9882ae3.

So there's quite a lot of variability even for a single-threaded wasm runtime.

@kzc
Copy link
Contributor Author

kzc commented Aug 26, 2020

sudo dtruss fails with:

dyld: Symbol not found: _fdopendir$INODE64

I can't get it to respect the DYLD_INSERT_LIBRARIES setting. I'll look at it again tomorrow.

While I am the first to acknowledge that I'm running Go on an unsupported platform, it does work and did find a couple of genuine problems. Setting a file descriptor limit to 500+ is not a showstopper. But I wonder if other UNIX like platforms might have similar behavior.

@kzc
Copy link
Contributor Author

kzc commented Aug 27, 2020

I think I've isolated the file descriptor leak on my mac to File.Readdirnames(-1). It leaks one fd per invocation on my machine as verified by a small standalone test program.

Although it appears that there's a fd leak in an error condition in macports-legacy-support/src/fdopendir.c it doesn't arise in typical use as verified by an fprintf to stderr.

--- a/src/fdopendir.c
+++ b/src/fdopendir.c
@@ -58,6 +58,7 @@ DIR *fdopendir(int dirfd) {
     dir = opendir (".");
 
     if (best_fchdir(oldCWD) < 0) {
+        if (dir) PROTECT_ERRNO(closedir(dir));
         if (oldCWD != -1) PROTECT_ERRNO(close(oldCWD));
         return 0;
     }

The untested patch above would probably address that edge case in macports-legacy-support, but it's not the cause of exhausting the low ulimit -n value discussed in above posts. I don't see a fd leak in the typical non-error code path in fdopendir.

File.Readdirnames calls File.readdirnames which in turn calls FD.OpenDir which dup's the fd before it is passed to fdopendir.

I'm not sure where closedir is ultimately called on the dir returned by FD.OpenDir encapsulating the dup'd fd. If it's not called that would explain the fd leak, but then it would occur on all darwin boxes, not just my unsupported version of Mac OSX. So it's a bit of a head scratcher.

@kzc
Copy link
Contributor Author

kzc commented Aug 27, 2020

I'm closing this issue because the main issues were addressed in the latest release. The native binary file leak seems to be specific to my machine.

@kzc kzc closed this as completed Aug 27, 2020
@kzc
Copy link
Contributor Author

kzc commented Aug 27, 2020

Today I learned that Go has finalizers. This could explain how the dup'd fd encapsulated in a dir and returned by FD.OpenDir (used by File.Readdirnames) is ultimately closed by dirInfo.close with closedir. But to my knowledge it would only happen non-deterministically as result of GC pressure. But if GC is disabled in the esbuild CLI then I don't know how this would be initiated - but even then it would not be guaranteed. Enabling GC in my local esbuild binary does not remedy the fd leak.

@kzc
Copy link
Contributor Author

kzc commented Aug 27, 2020

I figured out the solution to the File.Readdirnames fd leak on my unsupported version of Mac OSX. The problem was in the macports-legacy-support fdopendir call after all. Upon success it was supposed to take control of the directory file descriptor parameter dirfd. Because of the nature of the implementation, dirfd could be closed within the same function because it was no longer needed.

--- a/src/fdopendir.c
+++ b/src/fdopendir.c
@@ -30,6 +30,18 @@ int best_fchdir(int dirfd);
 
 #define PROTECT_ERRNO(what)  ({ int __err = (errno); what; errno = __err; })
 
+/*
+  https://www.freebsd.org/cgi/man.cgi?query=fdopendir&sektion=3
+
+  Upon successful return from fdopendir(), the file descriptor is under the
+  control of the system, and if any attempt is made to close the file
+  descriptor, or to modify the state of the associated description other
+  than by means of closedir(), readdir(), readdir_r(), or rewinddir(), the
+  behavior is undefined.  Upon calling closedir() the file descriptor is
+  closed.  The FD_CLOEXEC flag is set on the file descriptor by a
+  successful call to fdopendir().
+*/
+
 DIR *fdopendir(int dirfd) {
     DIR *dir;
     struct stat st;
@@ -43,8 +55,12 @@ DIR *fdopendir(int dirfd) {
         return 0;
     }
 
-    if (dirfd == AT_FDCWD)
-        return opendir (".");
+    if (dirfd == AT_FDCWD) {
+        dir = opendir (".");
+        /* dirfd can be closed only upon success */
+        if (dir) PROTECT_ERRNO(close(dirfd));
+        return dir;
+    }
 
     oldCWD = open(".", O_RDONLY);
     if (oldCWD == -1)
@@ -58,6 +74,7 @@ DIR *fdopendir(int dirfd) {
     dir = opendir (".");
 
     if (best_fchdir(oldCWD) < 0) {
+        if (dir) PROTECT_ERRNO(closedir(dir));
         if (oldCWD != -1) PROTECT_ERRNO(close(oldCWD));
         return 0;
     }
@@ -65,6 +82,9 @@ DIR *fdopendir(int dirfd) {
     if (oldCWD != -1)
         PROTECT_ERRNO(close(oldCWD));
 
+    /* dirfd can be closed only upon success */
+    if (dir && dirfd != -1) PROTECT_ERRNO(close(dirfd));
+
     return dir;
 }

With this patch react-admin/examples/simple can be successfully bundled with native esbuild on my Mac with ulimit -n 16. I'll upstream the fix to macports-legacy-support.

@evanw
Copy link
Owner

evanw commented Aug 28, 2020

That's awesome! Congrats on figuring out the underlying issue. It sounds like that was pretty challenging.

@kzc
Copy link
Contributor Author

kzc commented Aug 28, 2020

Thanks. I certainly learned a lot about Go as result.

@airhorns

This comment has been minimized.

@kzc
Copy link
Contributor Author

kzc commented Mar 31, 2021

@airhorns In the absence of a reproducible test case, try the debugging techniques in the top post. Consider opening a new issue as few will follow a closed one.

/unsubscribing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants