Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A proposal to add fs.scandir method to FS module #15699

Closed
mrmlnc opened this issue Sep 30, 2017 · 7 comments
Closed

A proposal to add fs.scandir method to FS module #15699

mrmlnc opened this issue Sep 30, 2017 · 7 comments
Labels
blocked PRs that are blocked by other issues or PRs. feature request Issues that request new features to be added to Node.js. fs Issues and PRs related to the fs subsystem / file system. libuv Issues and PRs related to the libuv dependency or the uv binding.

Comments

@mrmlnc
Copy link

mrmlnc commented Sep 30, 2017

Problem

Now any interaction with files and directories in the File System is as follows:

const fs = require('fs');

const entries = fs.readdirSync('path_to_directory');
const stats = entries.map(fs.statSync);  // Unnecessary File System access

const dirs = [];
const files = [];

entries.forEach((entry, index) => {
    if (stats[index].isDirectory()) {
        dirs.push(entry);
    } else {
        files.push(entry);
    }
});

The problem here is that we are call File System a second time due to the fact that we don't know the directory in front of us, or file (or symlink).

But we can reduce twice File System calls by creating fs.scandir method that can return d_name and d_type. This information is returned from uv_dirent_t (scandir) (libuv). For example, this is implemented in the Luvit and pyuv (also use libuv).

Motivation

  • Performance – reduce twice File System calls
  • More consistency with other platforms/languages
  • Should be easy to implement StringObjectd_name + d_type
    • For fs.readdir: return converted Object to String
    • For fs.scandir: return As is
  • Early speed up for node-glob (also for each package that uses fs.readdir for traversing directories) in most cases (need fs.stat when d_type is a DT_UNKNOWN on the old FS)

Proposed solution

Add a methods fs.scandir and fs.scandirSync in a standard Fyle System module.

const fs = require('fs');

const entries = fs.scandirSync('path_to_directory');

const dirs = [];
const files = [];

entries.forEach((entry) => {
    if (entry.type === fs.constants.S_IFDIR) {
        dirs.push(entry);
    } else {
        files.push(entry);
    }
});

Where entries is an array of objects:

  • name {String} – filename (d_name in libuv)
  • type {Number} – fs.constants.S_* (d_type in libuv)

Final words

Now I solved this problem by creating C++ Addon but... But I don't speak C++ or speak but very bad (i try 😅 ) and it requires you compile when you install a package that requires additional manipulation to the end user (like https://github.com/nodejs/node-gyp#on-windows).

@addaleax addaleax added feature request Issues that request new features to be added to Node.js. fs Issues and PRs related to the fs subsystem / file system. labels Sep 30, 2017
@mrmlnc mrmlnc changed the title Ability to return d_name and d_type from fs.readdir A proposal to add fs.scandir method to FS module Oct 1, 2017
@mscdex
Copy link
Contributor

mscdex commented Oct 1, 2017

I think if we're going to introduce scandir*() methods, they should work more or less exactly the same as the underlying C function of the same name. One benefit of such a function over the readdir*() methods is that there would be no forced buffering of entry names, which is nice for very large directories. To improve performance, we could allow an option that dictates how many directories to buffer before calling out to the JS callbacks.

@bnoordhuis
Copy link
Member

For context: libuv/libuv#416 - stalled, and itself a continuation of an older, also stalled PR.

@Fishrock123 Fishrock123 added the libuv Issues and PRs related to the libuv dependency or the uv binding. label Oct 5, 2017
@rdeforest
Copy link

Speaking from the wild (20+ years of syseng experience): large directories are a recurring ops problem I use as an interview question because I have seen it multiple times, and as recently as 2013. Even at companies like Disney and Amazon, there are developers who don't see a problem with using a directory as a flat key-value store, or creating empty files and never removing them. Eventually an operator like me has to do something with hundreds of millions of files all in the same directory. In the past I've used Perl to deal with it, but I fell out of love with Perl years ago.

The usual tools are usually useless specifically because (as I found out with strace) they stat each entry. While stat itself isn't super expensive, it ends up driving the memory and CPU cost of a scan higher than it needs to be if there's enough information in the file name to make decisions from.

(Though while writing this I discovered 'ls -1' uses mmap and does not perform any stat()s.)

If/when y'all decide to tackle this issue, I would ask that you also offer fs.*dir* so that I can do precisely what I want to:

cleanUp = (problemDir, prefix) ->
  new Promise (resolve, reject) ->
    fs.opendir problemDir, (err, handle) ->
      reject err

      do next = ->
        handle.readdir (err, stat) ->
          switch
            when err              then reject err
            when not stat         then resolve()
            when stat.isDirectory then next()

            when (fileName = stat.name).startsWith prefix
              fullPath = path.resolve problemDir, fileName

              fs.unlink fullPath, (err) ->
                if err then reject err else next()

            else next()

cleanUp 'incoming', 'system.system'
  .then -> console.log 'Completed'
  .catch (err) -> console.log err

That said, this kind of feature probably has a small audience and third-party
modules exist which addess
he problem, so I woundn't blame you if this never became a high priority.

@YurySolovyov
Copy link

Another not mutually exclusive option is to add a new type of stream that emits scandir-like entries or full-blown stats

@caccialdo
Copy link

caccialdo commented Oct 25, 2017

Not essential but it would be nice to have an option to produce a recursive scan. I guess it wouldn’t follow symlinks though to avoid falling into a closed loop

@rektide
Copy link

rektide commented Dec 8, 2017

This would be great to have.

I made a very simple parallel find implementation to see if I could use node's async nature to keep a higher queue depth. I'm not sure, but I really think the lack of exposure to dirent- specifically that inability to see that an entry is a directory- is what keeps Node from trouncing find.

I went looking into libuv and found that libuv does expose the dirent type, &c, it's just Node's readdir that is limiting. Then I found this thread. scandir would help my immediate problem.

But also: having access to things like inodes, extents, &c would open up Node's viability for a wide array of system tools. I'd expect 95% of use cases to be recursing directory trees, but I wanted to call out that there's a lot of other good helpful systems stuff that scandir does.

Proposal: add an option to fs.readdir to get full directory entities. Scandir is useful for recursing, but there's still really good useful things that can be done with readdir(3) that libuv permits, but Node doesn't. I'd love to see dirents results available for Node's readdir!

@targos targos added the blocked PRs that are blocked by other issues or PRs. label Jul 12, 2018
bengl added a commit to bengl/node that referenced this issue Aug 13, 2018
readdir and readdirSync now have a "withFileTypes" option, which, when
enabled, provides an array of DirectoryEntry objects, similar to Stats
objects, which have the filename and the type information.

Ref: nodejs#15699
bengl added a commit that referenced this issue Aug 14, 2018
readdir and readdirSync now have a "withFileTypes" option, which, when
enabled, provides an array of DirectoryEntry objects, similar to Stats
objects, which have the filename and the type information.

Refs: #15699

PR-URL: #22020
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Roman Reiss <me@silverwind.io>
Reviewed-By: John-David Dalton <john.david.dalton@gmail.com>
targos pushed a commit that referenced this issue Aug 19, 2018
readdir and readdirSync now have a "withFileTypes" option, which, when
enabled, provides an array of DirectoryEntry objects, similar to Stats
objects, which have the filename and the type information.

Refs: #15699

PR-URL: #22020
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Roman Reiss <me@silverwind.io>
Reviewed-By: John-David Dalton <john.david.dalton@gmail.com>
targos pushed a commit that referenced this issue Sep 3, 2018
readdir and readdirSync now have a "withFileTypes" option, which, when
enabled, provides an array of DirectoryEntry objects, similar to Stats
objects, which have the filename and the type information.

Refs: #15699

PR-URL: #22020
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Roman Reiss <me@silverwind.io>
Reviewed-By: John-David Dalton <john.david.dalton@gmail.com>
@silverwind
Copy link
Contributor

This was fixed in #22020.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked PRs that are blocked by other issues or PRs. feature request Issues that request new features to be added to Node.js. fs Issues and PRs related to the fs subsystem / file system. libuv Issues and PRs related to the libuv dependency or the uv binding.
Projects
None yet
Development

No branches or pull requests