-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(parser): use faster string parser methods #8227
Conversation
This makes use of memchr for parsing strings. It sadly does introduce one use of `unsafe` to create a string that is valid to pass into `u32::from_str_radix` because I was unable to find another method that does not require far more code than required with `unsafe`.
CodSpeed Performance ReportMerging #8227 will improve performances by 22.78%Comparing Summary
Benchmarks breakdown
|
PR Check ResultsEcosystem✅ ecosystem check detected no changes. |
Wow, that's amazing. We had it on our bucket list to rewrite the String parsing to use our I hope to find some time soon to review this PR. |
This is really cool, thank you for putting this together. |
multi-byte UTF-8 characters
Thank you, I also noticed another panic while looking through the code so hopefully there won't be any more panics in here :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! And it's good to see how much potential there still is to improve our parser.
I would prefer if we could split the memchr
usage out of this PR and submit it as its own PR to better assess whether replacing find
with memchr
is worth it.
It would be nice if we could explore using Cursor
for StringParser
as part of another PR. Cursor
is what we use in the Lexer
and other places where we need to parse text. Using Cursor
everywhere has the benefit that maintainers are familiar with it, simplifying code reviews and code maintenance.
|
||
if name.len() > MAX_UNICODE_NAME { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why this check is no longer necessary? Is it because the optimisation (never was) is no longer necessary because the operation above is so fast and unicode_names2::character
handles it for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed like a code smell to me- I did not understand why we should optimize for a fail state as obscure as a unicode escape name > 80 characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, the relevant issue: RustPython/RustPython#3798
The constant value is now publicly available: /~https://github.com/progval/unicode_names2/blob/22759d0e725a4c253e401dd8a5edf6d200008299/generator/src/lib.rs#L340, so the following should work.
use unicode_names2::MAX_NAME_LENGTH;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest we re-add -- costs us very little (nothing?) and gives us an error rather than a panic, if I understand this conversation correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have validated that the error does not exist by testing the previous reproduction. The issue was fixed in the crate here progval/unicode_names2@9404fb6 (Note that it is included in the 1.2.0
tag that we are using)
I did not realize that the motivation was to fix a previous panic in the crate and not a performance trick. Therefore, should we be fine not adding in the magic constants again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent -- thank you for testing this.
Agree, I believe we could use this technique fairly cleanly in both the lexer and parser with something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is pretty neat. Thanks for doing this!
|
||
if name.len() > MAX_UNICODE_NAME { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, the relevant issue: RustPython/RustPython#3798
The constant value is now publicly available: /~https://github.com/progval/unicode_names2/blob/22759d0e725a4c253e401dd8a5edf6d200008299/generator/src/lib.rs#L340, so the following should work.
use unicode_names2::MAX_NAME_LENGTH;
Co-authored-by: Dhruv Manilawala <dhruvmanila@gmail.com>
@dhruvmanila The Would you like for me to re-copy the constant into our source? (The reply box is not underneath your response for some reason.) |
Thanks @sno2, great to have you contributing! |
While the usage looks correct, the use of `unsafe` here does not seem justified to me. Namely, it's already doing integer parsing. And perhaps most importantly, this is for parsing an octal literal which are likely to be rare enough to not have a major impact on perf. (And it's not like UTF-8 validation is slow.) This was originally introduced in #8227 and it doesn't look like unchecked string conversion was the main point there.
While the usage looks correct, the use of `unsafe` here does not seem justified to me. Namely, it's already doing integer parsing. And perhaps most importantly, this is for parsing an octal literal which are likely to be rare enough to not have a major impact on perf. (And it's not like UTF-8 validation is slow.) This was originally introduced in #8227 and it doesn't look like unchecked string conversion was the main point there.
While the usage looks correct, the use of `unsafe` here does not seem justified to me. Namely, it's already doing integer parsing. And perhaps most importantly, this is for parsing an octal literal which are likely to be rare enough to not have a major impact on perf. (And it's not like UTF-8 validation is slow.) This was originally introduced in #8227 and it doesn't look like unchecked string conversion was the main point there.
While the usage looks correct, the use of `unsafe` here does not seem justified to me. Namely, it's already doing integer parsing. And perhaps most importantly, this is for parsing an octal literal which are likely to be rare enough to not have a major impact on perf. (And it's not like UTF-8 validation is slow.) This was originally introduced in #8227 and it doesn't look like unchecked string conversion was the main point there.
While the usage looks correct, the use of `unsafe` here does not seem justified to me. Namely, it's already doing integer parsing. And perhaps most importantly, this is for parsing an octal literal which are likely to be rare enough to not have a major impact on perf. (And it's not like UTF-8 validation is slow.) This was originally introduced in #8227 and it doesn't look like unchecked string conversion was the main point there.
Summary
This makes use of memchr and other methods to parse the strings (hopefully) faster. It might also be worth converting the
parse_fstring_middle
helper to use similar techniques, but I did not implement it in this PR.Test Plan
This was tested using the existing tests and passed all of them.