Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(parser): use faster string parser methods #8227

Merged
merged 6 commits into from
Oct 28, 2023

Conversation

sno2
Copy link
Contributor

@sno2 sno2 commented Oct 25, 2023

Summary

This makes use of memchr and other methods to parse the strings (hopefully) faster. It might also be worth converting the parse_fstring_middle helper to use similar techniques, but I did not implement it in this PR.

Test Plan

This was tested using the existing tests and passed all of them.

This makes use of memchr for parsing strings. It sadly does introduce
one use of `unsafe` to create a string that is valid to pass into
`u32::from_str_radix` because I was unable to find another method that
does not require far more code than required with `unsafe`.
@codspeed-hq
Copy link

codspeed-hq bot commented Oct 25, 2023

CodSpeed Performance Report

Merging #8227 will improve performances by 22.78%

Comparing sno2:perf/parser-string (28b823f) with main (9792b15)

Summary

⚡ 5 improvements
✅ 20 untouched benchmarks

Benchmarks breakdown

Benchmark main sno2:perf/parser-string Change
linter/all-rules[numpy/globals.py] 4.1 ms 3.9 ms +6.14%
parser[unicode/pypinyin.py] 4.1 ms 3.9 ms +4.83%
parser[numpy/ctypeslib.py] 12 ms 11.2 ms +7.36%
parser[numpy/globals.py] 1.3 ms 1.1 ms +22.78%
linter/default-rules[numpy/globals.py] 2 ms 1.7 ms +15.48%

@sno2 sno2 marked this pull request as ready for review October 25, 2023 21:10
@github-actions
Copy link
Contributor

github-actions bot commented Oct 25, 2023

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

@MichaReiser
Copy link
Member

Wow, that's amazing. We had it on our bucket list to rewrite the String parsing to use our Cursor implementation that is also used by the Lexer and should be easier to optimize by the compiler.

I hope to find some time soon to review this PR.

@charliermarsh
Copy link
Member

This is really cool, thank you for putting this together.

multi-byte UTF-8 characters
@sno2
Copy link
Contributor Author

sno2 commented Oct 26, 2023

Thank you, I also noticed another panic while looking through the code so hopefully there won't be any more panics in here :)

@MichaReiser MichaReiser added the parser Related to the parser label Oct 27, 2023
Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! And it's good to see how much potential there still is to improve our parser.

I would prefer if we could split the memchr usage out of this PR and submit it as its own PR to better assess whether replacing find with memchr is worth it.

It would be nice if we could explore using Cursor for StringParser as part of another PR. Cursor is what we use in the Lexer and other places where we need to parse text. Using Cursor everywhere has the benefit that maintainers are familiar with it, simplifying code reviews and code maintenance.

crates/ruff_python_parser/src/string.rs Outdated Show resolved Hide resolved

if name.len() > MAX_UNICODE_NAME {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this check is no longer necessary? Is it because the optimisation (never was) is no longer necessary because the operation above is so fast and unicode_names2::character handles it for us?

Copy link
Contributor Author

@sno2 sno2 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed like a code smell to me- I did not understand why we should optimize for a fail state as obscure as a unicode escape name > 80 characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, the relevant issue: RustPython/RustPython#3798

The constant value is now publicly available: /~https://github.com/progval/unicode_names2/blob/22759d0e725a4c253e401dd8a5edf6d200008299/generator/src/lib.rs#L340, so the following should work.

use unicode_names2::MAX_NAME_LENGTH;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we re-add -- costs us very little (nothing?) and gives us an error rather than a panic, if I understand this conversation correctly.

Copy link
Contributor Author

@sno2 sno2 Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have validated that the error does not exist by testing the previous reproduction. The issue was fixed in the crate here progval/unicode_names2@9404fb6 (Note that it is included in the 1.2.0 tag that we are using)

I did not realize that the motivation was to fix a previous panic in the crate and not a performance trick. Therefore, should we be fine not adding in the magic constants again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent -- thank you for testing this.

crates/ruff_python_parser/src/string.rs Show resolved Hide resolved
crates/ruff_python_parser/src/string.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/string.rs Show resolved Hide resolved
@sno2
Copy link
Contributor Author

sno2 commented Oct 27, 2023

It would be nice if we could explore using Cursor for StringParser as part of another PR. Cursor is what we use in the Lexer and other places where we need to parse text. Using Cursor everywhere has the benefit that maintainers are familiar with it, simplifying code reviews and code maintenance.

Agree, I believe we could use this technique fairly cleanly in both the lexer and parser with something like Cursor::eat_until_byte{1,2} which returns an Option<&str>.

Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is pretty neat. Thanks for doing this!

crates/ruff_python_parser/src/string.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/string.rs Show resolved Hide resolved

if name.len() > MAX_UNICODE_NAME {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, the relevant issue: RustPython/RustPython#3798

The constant value is now publicly available: /~https://github.com/progval/unicode_names2/blob/22759d0e725a4c253e401dd8a5edf6d200008299/generator/src/lib.rs#L340, so the following should work.

use unicode_names2::MAX_NAME_LENGTH;

@dhruvmanila dhruvmanila added the performance Potential performance improvement label Oct 27, 2023
Co-authored-by: Dhruv Manilawala <dhruvmanila@gmail.com>
@sno2
Copy link
Contributor Author

sno2 commented Oct 27, 2023

@dhruvmanila The MAX_NAME_LENGTH is public in the generated file. But, the file that uses the constants does not mark them as public:

/~https://github.com/progval/unicode_names2/blob/22759d0e725a4c253e401dd8a5edf6d200008299/src/lib.rs#L70-L72

Would you like for me to re-copy the constant into our source?

(The reply box is not underneath your response for some reason.)

@charliermarsh charliermarsh merged commit 2f5734d into astral-sh:main Oct 28, 2023
17 checks passed
@charliermarsh
Copy link
Member

Thanks @sno2, great to have you contributing!

@sno2 sno2 deleted the perf/parser-string branch October 28, 2023 23:01
BurntSushi added a commit that referenced this pull request Nov 10, 2023
While the usage looks correct, the use of `unsafe` here does not seem
justified to me. Namely, it's already doing integer parsing. And perhaps
most importantly, this is for parsing an octal literal which are likely
to be rare enough to not have a major impact on perf. (And it's not like
UTF-8 validation is slow.)

This was originally introduced in #8227 and it doesn't look like
unchecked string conversion was the main point there.
BurntSushi added a commit that referenced this pull request Nov 10, 2023
While the usage looks correct, the use of `unsafe` here does not seem
justified to me. Namely, it's already doing integer parsing. And perhaps
most importantly, this is for parsing an octal literal which are likely
to be rare enough to not have a major impact on perf. (And it's not like
UTF-8 validation is slow.)

This was originally introduced in #8227 and it doesn't look like
unchecked string conversion was the main point there.
BurntSushi added a commit that referenced this pull request Nov 10, 2023
While the usage looks correct, the use of `unsafe` here does not seem
justified to me. Namely, it's already doing integer parsing. And perhaps
most importantly, this is for parsing an octal literal which are likely
to be rare enough to not have a major impact on perf. (And it's not like
UTF-8 validation is slow.)

This was originally introduced in #8227 and it doesn't look like
unchecked string conversion was the main point there.
BurntSushi added a commit that referenced this pull request Nov 27, 2023
While the usage looks correct, the use of `unsafe` here does not seem
justified to me. Namely, it's already doing integer parsing. And perhaps
most importantly, this is for parsing an octal literal which are likely
to be rare enough to not have a major impact on perf. (And it's not like
UTF-8 validation is slow.)

This was originally introduced in #8227 and it doesn't look like
unchecked string conversion was the main point there.
BurntSushi added a commit that referenced this pull request Nov 28, 2023
While the usage looks correct, the use of `unsafe` here does not seem
justified to me. Namely, it's already doing integer parsing. And perhaps
most importantly, this is for parsing an octal literal which are likely
to be rare enough to not have a major impact on perf. (And it's not like
UTF-8 validation is slow.)

This was originally introduced in #8227 and it doesn't look like
unchecked string conversion was the main point there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser performance Potential performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants