-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Buffer.lastIndexOf #4846
Conversation
cc @trevnorris :) |
Oh wow, I've been waiting years for this :) This should make writing parsers a lot nicer :) |
@mikeal cool. I think I'm more or less done! performanceI've refactored so that both To do this with a minimum of messiness and without any code duplication, I expanded the This works because testingI added some additional test cases, to make sure that all of the search algorithms were being exercised. (Background: Under the hood, the existing Surprisingly, the existing test cases never seem to exercise the Boyer-Moore fallback at all! I added a test case that does go there. If you want to reproduce the output below, compile with
(Above: existing test cases for (Below: new test cases, added in this PR, for
|
Also, this PR uses Unfortunately
|
Thanks for tackling this PR, @dcposch! |
@@ -847,31 +882,25 @@ void IndexOfString(const FunctionCallbackInfo<Value>& args) { | |||
SPREAD_ARG(args[0], ts_obj); | |||
|
|||
Local<String> needle = args[1].As<String>(); | |||
int64_t offset_i64 = args[2]->IntegerValue(); | |||
bool is_forward = args[4]->BooleanValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faster to do IsTrue()
Started to review, but have to step away. Will finish this tomorrow. @dcposch Nice job on the inline comments. Makes the patch easier to follow. Especially for one this size. |
@trevnorris thx. Fixed all the things you pointed out so far |
} else if (byteOffset < -0x80000000) { | ||
byteOffset = -0x80000000; | ||
} | ||
if (typeof byteOffset !== 'number' || isNaN(byteOffset)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary. To follow string.lastIndexOf()
the offset should be coerced to a primitive. So for example 'abcde'.lastIndexOf('c', [1]) === -1
. Basically the byteOffset >>= 0
below should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this to match the behavior of Buffer.indexOf
buf.indexOf('foo')
searches the whole buffer, as doesbuf.indexOf('foo', null)
,buf.indexOf('foo', 'foo')
, etcbuf.indexOf('foo', 0)
searches starting from index 0, which also searches the whole bufferbuf.lastIndexOf('foo')
should def search the whole buffer, butbuf.lastIndexOf('foo', 0)
does a reverse search starting from index 0, so it only checks for a match at index 0
So a minimum, we have to special-case undefined
I think it's best if buf.lastIndexOf('foo', null)
, buf.lastIndexOf('foo', NaN)
etc match buf.lastIndexOf('foo')
-- in other words, they should search the whole buffer. That means they're NOT equivalent to buf.lastIndexOf('foo', 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The operation is as simple as offset = +offset
. This will coerce all isNaN()
values to NaN
, which can then be checked by Number.isNaN()
. It will also coerce values like [2]
to 2
, which is also how String#lastIndexOf()
operates.
So any value that returns true for Number.isNaN()
after the coercion is set to the default value. Though note this does exclude null
. Which is the same way strings work. e.g. 'abc'.lastIndexOf('b', null) === -1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. i added a test case to ensure 'abc'.lastIndexOf('b', null) === -1
@trevnorris notice me senpai |
@dcposch some of us who work on this more do take the weekends off. ;) |
#define DEBUG_TRACE(s) printf("%s search %s\n", \ | ||
subject.forward() ? "forward" : "reverse", s); | ||
#else | ||
#define DEBUG_TRACE(s) // no-op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnoordhuis have any comments on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Note though that this was the only way I noticed some serious missing test coverage: the whole Boyer-Moore algorithm (nontrivial code, rarely exercised) was never reached by the unit tests before. (The unit tests previously only cover Boyer-Moore-Horspool, Linear, and Single-Char. After this PR they cover Boyer-Moore as well.)
@dcposch Left few more comments. Now that the weekend is over will be more attentive. :) |
@trevnorris fixed. Thanks for checking it out! |
Excellent work. CI: https://ci.nodejs.org/job/node-test-pull-request/1515/ |
@trevnorris I clicked Authorize, but it says
|
Sorry @dcposch, we have CI in lockdown until we get our security releases out, you'll have to rely on collaborators to get you info on how the jobs have gone until it's opened back up again next week. / #4857 There are compile errors on OSX:
Windows
smartos
the test also failed to run on armv8 but that was a jenkins problem as far as I can tell. Looks like there's still quite a bit of work to do on cross-platform compat here. |
@rvagg @trevnorris thanks! Yeah this goes back to my first question at the top of the PR, about whether node builds can use I added a fallback for those systems. LMK if the CI is happier now! |
Whatever makes it in before I start working on it on Monday ;)
|
Seems there's a minor performance hit with this PR, but I believe it's within an acceptable level. Anyone have any objections? If not then let's land it. |
🚢 |
Works for me. |
@trevnorris ... there's still time to get this in. Is it ready to go? |
* Remove unnecessary templating from SearchString SearchString used to have separate PatternChar and SubjectChar template type arguments, apparently to support things like searching for an 8-bit string inside a 16-bit string or vice versa. However, SearchString is only used from node_buffer.cc, where PatternChar and SubjectChar are always the same. Since this is extra complexity that's unused and untested (simplifying to a single Char template argument still compiles and didn't break any unit tests), I removed it. * Use Boyer-Hoore[-Horspool] for both indexOf and lastIndexOf Add test cases for lastIndexOf. Test the fallback from BMH to Boyer-Moore, which looks like it was totally untested before. * Extra bounds checks in node_buffer.cc * Extra asserts in string_search.h * Buffer.lastIndexOf: clean up, enforce consistency w/ String.lastIndexOf * Polyfill memrchr(3) for non-GNU systems
@jasnell fixed |
One final CI before landing: https://ci.nodejs.org/job/node-test-pull-request/2371/ |
@jasnell looks like Failing build: https://ci.nodejs.org/job/node-test-commit-freebsd/2167/ |
Sweet, everything worked that time including FreeBSD |
* Remove unnecessary templating from SearchString SearchString used to have separate PatternChar and SubjectChar template type arguments, apparently to support things like searching for an 8-bit string inside a 16-bit string or vice versa. However, SearchString is only used from node_buffer.cc, where PatternChar and SubjectChar are always the same. Since this is extra complexity that's unused and untested (simplifying to a single Char template argument still compiles and didn't break any unit tests), I removed it. * Use Boyer-Hoore[-Horspool] for both indexOf and lastIndexOf Add test cases for lastIndexOf. Test the fallback from BMH to Boyer-Moore, which looks like it was totally untested before. * Extra bounds checks in node_buffer.cc * Extra asserts in string_search.h * Buffer.lastIndexOf: clean up, enforce consistency w/ String.lastIndexOf * Polyfill memrchr(3) for non-GNU systems PR-URL: #4846 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Trevor Norris <trev.norris@gmail.com>
Landed in 6c1e5ad |
* Remove unnecessary templating from SearchString SearchString used to have separate PatternChar and SubjectChar template type arguments, apparently to support things like searching for an 8-bit string inside a 16-bit string or vice versa. However, SearchString is only used from node_buffer.cc, where PatternChar and SubjectChar are always the same. Since this is extra complexity that's unused and untested (simplifying to a single Char template argument still compiles and didn't break any unit tests), I removed it. * Use Boyer-Hoore[-Horspool] for both indexOf and lastIndexOf Add test cases for lastIndexOf. Test the fallback from BMH to Boyer-Moore, which looks like it was totally untested before. * Extra bounds checks in node_buffer.cc * Extra asserts in string_search.h * Buffer.lastIndexOf: clean up, enforce consistency w/ String.lastIndexOf * Polyfill memrchr(3) for non-GNU systems PR-URL: nodejs#4846 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Trevor Norris <trev.norris@gmail.com>
@jasnell Was the squash/merge button used? I'm trying to figure out why the |
No, I squashed like normal. Didn't notice that the author changed :-/
|
Strange. Oh well. Nothing serious. |
* Remove unnecessary templating from SearchString SearchString used to have separate PatternChar and SubjectChar template type arguments, apparently to support things like searching for an 8-bit string inside a 16-bit string or vice versa. However, SearchString is only used from node_buffer.cc, where PatternChar and SubjectChar are always the same. Since this is extra complexity that's unused and untested (simplifying to a single Char template argument still compiles and didn't break any unit tests), I removed it. * Use Boyer-Hoore[-Horspool] for both indexOf and lastIndexOf Add test cases for lastIndexOf. Test the fallback from BMH to Boyer-Moore, which looks like it was totally untested before. * Extra bounds checks in node_buffer.cc * Extra asserts in string_search.h * Buffer.lastIndexOf: clean up, enforce consistency w/ String.lastIndexOf * Polyfill memrchr(3) for non-GNU systems PR-URL: #4846 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Trevor Norris <trev.norris@gmail.com>
Fixes #4604
work done
Buffer.lastIndexOf
to matchBuffer.indexOf
Buffer.indexOf
Uint8Array.lastIndexOf
, which is now shadowed byBuffer.lastIndexOf
, so existing code should continue to workwork left to do
string_search.cc
is naive and just uses a double for loop. Ideally we'd adaptBoyerMooreSearch
to support reverse search, so thatlastIndexOf
will be equally fast asindexOf