-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: replace naive search with naive + BMH in Buffer::IndexOf #2539
Conversation
+1 |
Would this make it into 4.0 ? |
/cc @trevnorris |
void IndexOfString(const FunctionCallbackInfo<Value>& args) { | ||
ASSERT(args[1]->IsString()); | ||
ASSERT(args[2]->IsNumber()); | ||
|
||
enum encoding enc = ParseEncoding(args.GetIsolate(), | ||
args[3].As<String>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From JS you'll need to do if (!encoding) encoding = 'utf8';
. Then from here add CHECK(args[3]->IsString());
just above this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least, if args[3].IsEmpty()
then .As<String>()
will abort in debug mode. Need to make sure it's a string first.
Not sure how I feel about Probably not for this PR, but I think a useful addition (especially now that performance will make this more usable) is the addition of a Did a quick initial review. Also want @bnoordhuis to take a look. @skomski Thanks much for the PR and for fixing my initial naive implementation. |
0aa8558
to
decd947
Compare
@trevnorris |
return kLatin1AlphabetSize; | ||
} else { | ||
// UC16 needle. | ||
return kUC16AlphabetSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a static_assert(sizeof(PatternChar) == sizeof(uint16_t), "sizeof(PatternChar) == sizeof(uint16_t)")
here?
Oh yup. You're right. Missed the As for changing |
node::Utf8Value str(args.GetIsolate(), args[1]); | ||
int32_t offset_i32 = args[2]->Int32Value(); | ||
uint32_t offset; | ||
Local<String> needle = args[1]->ToString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we've confirmed what the argument type is, use args[1].As<String>();
instead. Skips some internal type checking V8 performs.
e59db22
to
abb35e9
Compare
Always specifying an encoding like below is too much of a performance drop.
|
abb35e9
to
07bc1e5
Compare
@skomski From your diff it looks like a default string is always sent. So there's no need to check Though I also have a hard time taking that at face value. The performance drop from that patch should be proportional to the number of time |
if (encoding === undefined || | ||
encoding === 'ucs2' || | ||
encoding === 'utf16le' || | ||
encoding === 'utf8') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the encoding is case insensitive. look at Buffer.isEncoding()
to see how we do it now to speed up the common case.
9369e24
to
b2088ce
Compare
} | ||
|
||
result = SearchString(reinterpret_cast<const uint16_t*>(haystack), | ||
haystack_length / 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a concern if user does this on an odd length buffer?
Looks great. Amazing set of tests. If CI is happy then LGTM. |
Hmm.. previous test run on this failed but it's not obvious if the failure was related. Running another CI run just to make sure. If it passes, will land in master and on v4.x |
Some lint issues (https://ci.nodejs.org/job/node-linter/893/). I'll fix those manually when I land |
Adds the string search implementation from v8 which uses naive search if pattern length < 8 or to a specific badness then uses Boyer-Moore-Horspool Added benchmark shows the expected improvements Added option to use ucs2 encoding with Buffer::IndexOf Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Trevor Norris <trev.norris@gmail.com> PR-URL: #2539
Landed in a18dd7b |
Adds the string search implementation from v8 which uses naive search if pattern length < 8 or to a specific badness then uses Boyer-Moore-Horspool Added benchmark shows the expected improvements Added option to use ucs2 encoding with Buffer::IndexOf Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Trevor Norris <trev.norris@gmail.com> PR-URL: #2539
This change has introduced a regression on Big-Endian machines. Buffer.indexOf with a string parameter is now broken. This is blocking #3258 now and needs a fix before we can progress. Also, given that this adds a parameter to buffer.indexOf, it really ought to have been marked @skomski, I believe that the issue has to do with the byte ordering of the string value being passed in to Buffer.indexOf but haven't yet been able to test my theory. A bit later today I'll be able to get into the BE machine and test the theory but if you happen to get in there sooner, please let me know. |
Ok thank you! I'll be able to look shortly.
|
@mhdawson ... Can you look at this and verify that it fixes the problem you
|
Adds the string search implementation from v8
which uses naive search if pattern length < 8
or to a specific badness then uses Boyer-Moore-Horspool
Added benchmark shows the expected improvements
Added option to use ucs2 encoding with Buffer::IndexOf
Since v8 is BSD licensed it should be no problem to use the code if credited.