Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: replace naive search with naive + BMH in Buffer::IndexOf #2539

Closed
wants to merge 1 commit into from

Conversation

skomski
Copy link
Contributor

@skomski skomski commented Aug 25, 2015

Adds the string search implementation from v8
which uses naive search if pattern length < 8
or to a specific badness then uses Boyer-Moore-Horspool

Added benchmark shows the expected improvements
Added option to use ucs2 encoding with Buffer::IndexOf

Since v8 is BSD licensed it should be no problem to use the code if credited.

buffers/buffer-indexof.js search=@ iter=1: ./iojs: 3523000 ./iojs_naive: 423220 .............................................................. 732.42%
buffers/buffer-indexof.js search=SQ iter=1: ./iojs: 822020 ./iojs_naive: 67540 .............................................................. 1117.08%
buffers/buffer-indexof.js search=10x iter=1: ./iojs: 3168900 ./iojs_naive: 664590 ............................................................ 376.82%
buffers/buffer-indexof.js search=--l iter=1: ./iojs: 167290 ./iojs_naive: 14779 ............................................................. 1031.95%
buffers/buffer-indexof.js search=Alice iter=1: ./iojs: 4228800 ./iojs_naive: 4458200 .......................................................... -5.15%
buffers/buffer-indexof.js search=Gryphon iter=1: ./iojs: 269360 ./iojs_naive: 14296 ......................................................... 1784.19%
buffers/buffer-indexof.js search=Panther iter=1: ./iojs: 218090 ./iojs_naive: 12590 ......................................................... 1632.31%
buffers/buffer-indexof.js search=Ou est ma chatte? iter=1: ./iojs: 99297 ./iojs_naive: 50987 .................................................. 94.75%
buffers/buffer-indexof.js search=found it very iter=1: ./iojs: 18249 ./iojs_naive: 11225 ...................................................... 62.57%
buffers/buffer-indexof.js search=among mad people iter=1: ./iojs: 31875 ./iojs_naive: 8616 ................................................... 269.95%
buffers/buffer-indexof.js search=neighbouring pool iter=1: ./iojs: 17349 ./iojs_naive: 4754.1 ................................................ 264.93%
buffers/buffer-indexof.js search=Soo--oop iter=1: ./iojs: 13625 ./iojs_naive: 12154 ........................................................... 12.10%
buffers/buffer-indexof.js search=aaaaaaaaaaaaaaaaa iter=1: ./iojs: 23441 ./iojs_naive: 4263.1 ................................................ 449.86%
buffers/buffer-indexof.js search=venture to go near the house till she had brought herself down to iter=1: ./iojs: 75611 ./iojs_naive: 20240 . 273.58%
buffers/buffer-indexof.js search=</i> to the Caterpillar iter=1: ./iojs: 26146 ./iojs_naive: 9589.9 .......................................... 172.64%

@Fishrock123 Fishrock123 added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. labels Aug 25, 2015
@mscdex
Copy link
Contributor

mscdex commented Aug 25, 2015

+1

@YurySolovyov
Copy link

Would this make it into 4.0 ?

@rvagg
Copy link
Member

rvagg commented Aug 26, 2015

/cc @trevnorris

void IndexOfString(const FunctionCallbackInfo<Value>& args) {
ASSERT(args[1]->IsString());
ASSERT(args[2]->IsNumber());

enum encoding enc = ParseEncoding(args.GetIsolate(),
args[3].As<String>(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From JS you'll need to do if (!encoding) encoding = 'utf8';. Then from here add CHECK(args[3]->IsString()); just above this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, if args[3].IsEmpty() then .As<String>() will abort in debug mode. Need to make sure it's a string first.

@trevnorris
Copy link
Contributor

Not sure how I feel about benchmark/fixtures/alice.html, but also don't have a better idea.

Probably not for this PR, but I think a useful addition (especially now that performance will make this more usable) is the addition of a fromIndex argument. The signature may possibly be
indexOf(value[, fromIndex = 0][, encoding = 'utf8']).

Did a quick initial review. Also want @bnoordhuis to take a look.

@skomski Thanks much for the PR and for fixing my initial naive implementation.

@skomski skomski force-pushed the add-fast-string-search branch 2 times, most recently from 0aa8558 to decd947 Compare August 26, 2015 07:45
@skomski
Copy link
Contributor Author

skomski commented Aug 26, 2015

@trevnorris Buffer.indexOf already takes the argument byteOffset. It would make sense to rename the current argument byteOffset to fromIndex.

return kLatin1AlphabetSize;
} else {
// UC16 needle.
return kUC16AlphabetSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a static_assert(sizeof(PatternChar) == sizeof(uint16_t), "sizeof(PatternChar) == sizeof(uint16_t)") here?

@trevnorris
Copy link
Contributor

Buffer.indexOf already takes the argument byteOffset. It would make sense to rename the current argument byteOffset to fromIndex.

Oh yup. You're right. Missed the int32_t offset = args[2]->Int32Value(); since it was moved down further. Thanks for pointing it out.

As for changing fromIndex to byteOffset, do you mean just changing the name?

node::Utf8Value str(args.GetIsolate(), args[1]);
int32_t offset_i32 = args[2]->Int32Value();
uint32_t offset;
Local<String> needle = args[1]->ToString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we've confirmed what the argument type is, use args[1].As<String>(); instead. Skips some internal type checking V8 performs.

@skomski skomski force-pushed the add-fast-string-search branch 2 times, most recently from e59db22 to abb35e9 Compare August 27, 2015 14:01
@skomski
Copy link
Contributor Author

skomski commented Aug 27, 2015

Always specifying an encoding like below is too much of a performance drop.

+++ lib/buffer.js
@@ -399,6 +399,9 @@ Buffer.prototype.indexOf = function indexOf(val, byteOffset, encoding) {
     byteOffset = -0x80000000;
   byteOffset >>= 0;

+  if (encoding === undefined)
+    encoding = 'utf8';
+
   if (typeof val === 'string') {
     if (encoding === undefined ||
         encoding === 'ucs2' ||
diff --git src/node_buffer.cc src/node_buffer.cc
index ab9d69e..2650d82 100644
--- src/node_buffer.cc
+++ src/node_buffer.cc
@@ -785,9 +785,7 @@ void IndexOfString(const FunctionCallbackInfo<Value>& args) {
   ASSERT(args[1]->IsString());
   ASSERT(args[2]->IsNumber());

-  enum encoding enc =
-      args[3].IsEmpty() ? UTF8 : ParseEncoding(args.GetIsolate(),
-                                               args[3].As<String>(), UTF8);
+  enum encoding enc = ParseEncoding(args.GetIsolate(), args[3].As<String>(), UTF8);

   THROW_AND_RETURN_UNLESS_BUFFER(Environment::GetCurrent(args), args[0]);
   SPREAD_ARG(args[0], ts_obj);
buffers/buffer-indexof.js search=@ iter=1: ./iojs_fastenc: 3203100 ./iojs: 2550500 ............................................................. 25.59%
buffers/buffer-indexof.js search=SQ iter=1: ./iojs_fastenc: 735410 ./iojs: 698660 ............................................................... 5.26%
buffers/buffer-indexof.js search=10x iter=1: ./iojs_fastenc: 2731700 ./iojs: 2237400 ........................................................... 22.10%
buffers/buffer-indexof.js search=--l iter=1: ./iojs_fastenc: 162590 ./iojs: 155260 .............................................................. 4.72%
buffers/buffer-indexof.js search=Alice iter=1: ./iojs_fastenc: 3695700 ./iojs: 2179300 ......................................................... 69.58%
buffers/buffer-indexof.js search=Gryphon iter=1: ./iojs_fastenc: 242620 ./iojs: 245240 ......................................................... -1.07%
buffers/buffer-indexof.js search=Panther iter=1: ./iojs_fastenc: 179150 ./iojs: 154900 ......................................................... 15.66%
buffers/buffer-indexof.js search=Ou est ma chatte? iter=1: ./iojs_fastenc: 106870 ./iojs: 105720 ................................................ 1.08%
buffers/buffer-indexof.js search=found it very iter=1: ./iojs_fastenc: 19512 ./iojs: 19658 ..................................................... -0.74%
buffers/buffer-indexof.js search=among mad people iter=1: ./iojs_fastenc: 33341 ./iojs: 33526 .................................................. -0.55%
buffers/buffer-indexof.js search=neighbouring pool iter=1: ./iojs_fastenc: 18422 ./iojs: 18294 .................................................. 0.70%
buffers/buffer-indexof.js search=Soo--oop iter=1: ./iojs_fastenc: 14654 ./iojs: 14619 ........................................................... 0.24%
buffers/buffer-indexof.js search=aaaaaaaaaaaaaaaaa iter=1: ./iojs_fastenc: 24591 ./iojs: 24638 ................................................. -0.19%
buffers/buffer-indexof.js search=venture to go near the house till she had brought herself down to iter=1: ./iojs_fastenc: 79716 ./iojs: 79876 . -0.20%
buffers/buffer-indexof.js search=</i> to the Caterpillar iter=1: ./iojs_fastenc: 27657 ./iojs: 28041 ........................................... -1.37%

@skomski skomski force-pushed the add-fast-string-search branch from abb35e9 to 07bc1e5 Compare August 27, 2015 14:30
@trevnorris
Copy link
Contributor

@skomski From your diff it looks like a default string is always sent. So there's no need to check args[3].IsEmpty().

Though I also have a hard time taking that at face value. The performance drop from that patch should be proportional to the number of time indexOf() is called. Which is not the case.

if (encoding === undefined ||
encoding === 'ucs2' ||
encoding === 'utf16le' ||
encoding === 'utf8') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the encoding is case insensitive. look at Buffer.isEncoding() to see how we do it now to speed up the common case.

@skomski skomski force-pushed the add-fast-string-search branch from 9369e24 to b2088ce Compare September 21, 2015 19:30
}

result = SearchString(reinterpret_cast<const uint16_t*>(haystack),
haystack_length / 2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a concern if user does this on an odd length buffer?

@trevnorris
Copy link
Contributor

Looks great. Amazing set of tests. If CI is happy then LGTM.

CI: https://ci.nodejs.org/job/node-test-pull-request/374/

@jasnell
Copy link
Member

jasnell commented Oct 8, 2015

Hmm.. previous test run on this failed but it's not obvious if the failure was related. Running another CI run just to make sure. If it passes, will land in master and on v4.x

https://ci.nodejs.org/job/node-test-pull-request/448/

@jasnell
Copy link
Member

jasnell commented Oct 8, 2015

Some lint issues (https://ci.nodejs.org/job/node-linter/893/). I'll fix those manually when I land

jasnell pushed a commit that referenced this pull request Oct 8, 2015
Adds the string search implementation from v8
which uses naive search if pattern length < 8
or to a specific badness then uses Boyer-Moore-Horspool

Added benchmark shows the expected improvements
Added option to use ucs2 encoding with Buffer::IndexOf

Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
PR-URL: #2539
@jasnell
Copy link
Member

jasnell commented Oct 8, 2015

Landed in a18dd7b

@jasnell jasnell closed this Oct 8, 2015
jasnell pushed a commit that referenced this pull request Oct 8, 2015
Adds the string search implementation from v8
which uses naive search if pattern length < 8
or to a specific badness then uses Boyer-Moore-Horspool

Added benchmark shows the expected improvements
Added option to use ucs2 encoding with Buffer::IndexOf

Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
PR-URL: #2539
@jasnell
Copy link
Member

jasnell commented Oct 9, 2015

This change has introduced a regression on Big-Endian machines. Buffer.indexOf with a string parameter is now broken. This is blocking #3258 now and needs a fix before we can progress.

Also, given that this adds a parameter to buffer.indexOf, it really ought to have been marked semver-minor. My bad as I missed that in my review.

@skomski, I believe that the issue has to do with the byte ordering of the string value being passed in to Buffer.indexOf but haven't yet been able to test my theory. A bit later today I'll be able to get into the BE machine and test the theory but if you happen to get in there sooner, please let me know.

@jasnell jasnell added the semver-minor PRs that contain new features and should be released in the next minor version. label Oct 9, 2015
@skomski
Copy link
Contributor Author

skomski commented Oct 9, 2015

@jasnell Pushed a fix at #3295

@jasnell
Copy link
Member

jasnell commented Oct 9, 2015

Ok thank you! I'll be able to look shortly.
On Oct 9, 2015 9:30 AM, "Karl Skomski" notifications@github.com wrote:

@jasnell /~https://github.com/jasnell Pushed a fix at #3295
#3295


Reply to this email directly or view it on GitHub
#2539 (comment).

@jasnell
Copy link
Member

jasnell commented Oct 9, 2015

@mhdawson ... Can you look at this and verify that it fixes the problem you
were seeing. Looks like it should after a quick review
On Oct 9, 2015 9:31 AM, "James M Snell" jasnell@gmail.com wrote:

Ok thank you! I'll be able to look shortly.
On Oct 9, 2015 9:30 AM, "Karl Skomski" notifications@github.com wrote:

@jasnell /~https://github.com/jasnell Pushed a fix at #3295
#3295


Reply to this email directly or view it on GitHub
#2539 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. semver-minor PRs that contain new features and should be released in the next minor version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants