-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
path: add path.glob
#47490
path: add path.glob
#47490
Conversation
Review requested:
|
CC @anonrig WDYT? |
You have to be careful as both bash and Linux are under the GPL license |
This can't be merged as-is for the reason @targos mentions. I suggest closing this and starting over from a clean slate. |
@MoLow I think this is a better approach than the previous one. Thank you for valuing my opinion. Happy help on the C++ side. |
@targos @bnoordhuis according to the source code this specific module is licenced under MIT as well: |
bdbd172
to
9ae0a18
Compare
207625e
to
f21664c
Compare
* it against the remaining unmatched tail of str. Return false | ||
* on mismatch, or true after matching the trailing nul bytes. | ||
*/ | ||
for (;;) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: There is no need to iterate each character. input.find_first_of("?*[") until the input is finished might produce better results if performance is the priority. If not, we can keep it as it is, and later improve it. Your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand these comments - is there really motivation in diverging from the implementation lifted from glob
rather than benefit from its updates?
If we have improvement suggestions wouldn't it be better to upstream them to linux glob?
(genuinely asking, not sure it's. just not really intuitive)
unsigned char d = *pat++; | ||
|
||
switch (d) { | ||
case '?': /* Wildcard: anything but nul */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This switch case can be removed and be simplified for better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update documentation too?
The
notable-change
Please suggest a text for the release notes if you'd like to include a more detailed summary, then proceed to update the PR description with the text or a link to the notable change suggested text comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Left some comments. There are also some things about the licensing I'm unsure about.
@@ -52,6 +52,7 @@ | |||
V(mksnapshot) \ | |||
V(options) \ | |||
V(os) \ | |||
V(path) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a strong opinion, but maybe we should name the binding and file glob
instead of path
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjihrig I'm planning on moving a couple of functions to C++ (actually done the same exact change), so this change is beneficial for me too.
Usage-Guide: | ||
To use the MIT License put the following SPDX tag/value pair into a | ||
comment according to the placement guidelines in the licensing rules | ||
documentation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we following this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are instructions for maintainers of the Linux kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/LICENSES/preferred/MIT?h=v6.3-rc6
I am not really sure how we should embed this since it only really appears here
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/glob.c?h=v6.3-rc6#n10
see also the comment on top of #define MODULE_LICENSE
:
/*
* The following license idents are currently accepted as indicating free
* software modules
*
* "GPL" [GNU Public License v2]
* "GPL v2" [GNU Public License v2]
* "GPL and additional rights" [GNU Public License v2 rights and more]
* "Dual BSD/GPL" [GNU Public License v2
* or BSD license choice]
* "Dual MIT/GPL" [GNU Public License v2
* or MIT license choice]
* "Dual MPL/GPL" [GNU Public License v2
* or Mozilla license choice]
*
* The following other idents are available
*
* "Proprietary" [Non free products]
*
* Both "GPL v2" and "GPL" (the latter also in dual licensed strings) are
* merely stating that the module is licensed under the GPL v2, but are not
* telling whether "GPL v2 only" or "GPL v2 or later". The reason why there
* are two variants is a historic and failed attempt to convey more
* information in the MODULE_LICENSE string. For module loading the
* "only/or later" distinction is completely irrelevant and does neither
* replace the proper license identifiers in the corresponding source file
* nor amends them in any way. The sole purpose is to make the
* 'Proprietary' flagging work and to refuse to bind symbols which are
* exported with EXPORT_SYMBOL_GPL when a non free module is loaded.
*
* In the same way "BSD" is not a clear license information. It merely
* states, that the module is licensed under one of the compatible BSD
* license variants. The detailed and correct license information is again
* to be found in the corresponding source files.
*
* There are dual licensed components, but when running with Linux it is the
* GPL that is relevant so this is a non issue. Similarly LGPL linked with GPL
* is a GPL combined work.
*
* This exists for several reasons
* 1. So modinfo can show license info for users wanting to vet their setup
* is free
* 2. So the community can ignore bug reports including proprietary modules
* 3. So vendors can do likewise based on their own policies
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did what I best understood from what I found in the Linux kernel and in other license examples so if anyone can help out with what really should be used as the license - I'd appreciate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have played around with this a bit, and the globbing algorithm is much more limited than I have thought, I am looking into other alternatives |
@Trott @joesepi (pinging because you're our OpenJS CPC delegates), can either of you help with getting in touch with legal from the Linux Foundation to make sure the license was understood correctly and we are in compliance? I remember that once upon a time in the Node.js foundation days that's something that wasn't too hard to do and we'd all like to avoid potential issues with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly personally prefer a plain JS implementation if the performance delta isn't big because it's more obviously secure and we haven't seen a benchmark to indicate this is actually faster at all. I'm also ignorant about what use cases linux glob was optimized for.
That said - this looks correct and I'm fine with it landing with green CI and we can always switch the implementation later.
Given that the original feature request (#40731) as well as the previous PR (#47486) specifically targeted
|
Do not assume that any particular glob implementation in C will be faster. The pattern expansion implementation found in bash and zsh is likely to have serious performance issues in node. Shell expansion is optimized for intuitive shell UX and correct behavior in low-memory systems, and will likely be a footgun if anyone puts it to serious use in a node program at runtime. Node programs need to go fast, and they're memory hogs, it's the exact opposite of what Bash's glob is designed for. I have benchmarked node-glob, fast-glob, and globby extensively, and the source of latency is almost entirely the overhead of accessing the filesystem, with GC cleanup a distant second. The bash implementation (at least last I checked) is an absolute glutton with syscalls, caching almost nothing, walking directories multiple times, etc. And that's fine, because that's what it's for. It's built to be intuitive, consistent, and efficient with memory usage, not to be fast. There's even a note in Just to caveat this, of course, I haven't closely investigated bash's glob it in a few years, since a few years ago when bash 5.0 came out, but I doubt that much has changed, since the intended use case is still the same, and most potential improvements to performance would involve breaking changes in behavior. The original implementation of node-glob was a binding to Guido van Rossum's Also, as it seems @MoLow is finding, what we colloquially think of as "glob" is actually a few different things: posix regular expression classes, globstar, extended glob patterns, brace expansion, variable string expansion, and then file path portion expansion. If you pull in just that last part, but don't support any of the rest, it's going to be very underwhelming. As I said previously:
|
@isaacs Thank you very much for your detailed comment!
yes, I tried adding more tests from the
the reason why we want glob in the core is for features in core such as |
If you really want to have globbing built into core, the reasonable thing to do is to pull in either fast-glob or node-glob. They're both "big" complex modules that do a lot of stuff and have a bunch of optimizations and moving parts, around 5-10 kLoC each, but they're extremely well tested, complete, and performant. The choice depends ultimately on what kind of features/behavior/performance are desired. Writing a brand new clean-room implementation is an option, but probably the costliest option possible, and given that fast-glob and node-glob evolved to be so similar to one another just by following the path of optimization, it's reasonable to assume that you'll end up with a very similar implementation anyway. Skip the effort, get the good result. Also: expanding globs to a set of filesystem entries should really live in Along the way, you'll need a comprehensive file-system walking implementation. (Bash's glob does not have one. It caches nothing and just keeps recursing until done. It's the slowest possible way to tackle this problem, but also the most straightforward and memory efficient, and the approach node-glob used to use.) Fast-glob uses |
Why not pull it in as a dep and also expose it? (Not challenging, genuinely curious.) const { minimatch } = require('node:internal/lib/minimatch.js')
path.match = (pattern, pathLike) => minimatch.match(pattern, pathLike)
const { glob, globSync, globIterate, globIterateSync } = require('node:internal/lib/glob.js')
fs.glob = (pattern, options, cb) => {
glob(pattern, options).then(res => cb(null, res), cb)
}
fs.globSync = globSync
fs.promises.glob = glob
// ... Like, ok, it's a lot of code, but you'll end up writing more or less the same code anyway to implement it (or wish you had, if you ship bash's globs lol) edit: Oh, I see, something to do with not using Primordials. I'm not sure what those are 😅 |
there were concerns raised about that #47486 (comment) |
@isaacs primordials are basically a frozen reference to js built-ins to avoid them being tamperd |
Also, take a look at the benchmark scripts in https://github.com/isaacs/node-glob. I don't think a node core implementation needs to necessarily be faster than fast-glob or node-glob, but it should be within that range to not be a hazard if it's exposed at all. |
Ah, nice. Yeah, porting a module as involved as glob (+path-scurry, +minimatch, etc.) would be quite a bit of work, and likely make it painful to pull future patches. Maybe it'd be possible to "mimic" some of that hardening in the library itself (or in a standalone userland module), and then just swap out |
@isaacs If I understand correctly there is an effort to do something like this: #41439 |
Oh, neat. That really needs a streaming API (ideally not a Stream per se, but a low level "get the next thing" kinda like the opendir/scandir stuff) as well as a way to filter whether a directory gets descended into or not, or it won't be of much use for a glob walker. Eg: https://github.com/isaacs/node-glob/blob/main/src/walker.ts#L238-L309 |
PR-URL: nodejs#47499 Refs: nodejs#47490 Refs: nodejs#47486 Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com> Reviewed-By: Robert Nagy <ronagy@icloud.com>
PR-URL: nodejs#47499 Refs: nodejs#47490 Refs: nodejs#47486 Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com> Reviewed-By: Robert Nagy <ronagy@icloud.com>
an alternative for #47486 (suggested at #40731 (comment))
as my experience in cpp is much more limited than in js - this code probably has many issues, so feel free to comment on anything.
the glob implementation is used from the linux kernel, it complies with https://man7.org/linux/man-pages/man7/glob.7.html, and is licenced under MIT
this implementation lacks some features that
minimatch
has such as brace expansion, but according to my tests Python comes with a similar implementation: https://docs.python.org/3/library/glob.html