-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
$ doesn't match CRLF #244
Comments
Could you please provide a test case that doesn't act as you expect? |
Sure, here is a program that I would expect to print "Matched: true", but it prints "Matched: false":
|
That is indeed expected behavior. |
@BurntSushi Would treating |
@BatmanAoD That sounds feasible from an implementation perspective, but I'm not a fan of implementing something that's wrong. (Essentially no systems use |
But the existing implementation is more wrong, so I'm not sure I understand that as an objection. |
(Also, just last week I actually did run into something that uses |
I just ran into this when trying to use it to parse some data coming in via HTTP. This is incredibly confusing at the very leasts as this is the only regex engine I know with this behavior. |
I believe RE2 also has this behavior. `\r?$` may be a not-ideal work around.
…On Mar 22, 2017 3:03 PM, "Armin Ronacher" ***@***.***> wrote:
I just ran into this when trying to use it to parse some data coming in
via HTTP. This is incredibly confusing at the very leasts as this is the
only regex engine I know with this behavior.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#244 (comment)>,
or mute the thread
</~https://github.com/notifications/unsubscribe-auth/AAb34k3E9W0z4yc8ouT2YOg23NVVsbWpks5roXCIgaJpZM4IwcKx>
.
|
I'm going to re-open this, but I don't have any immediate plans to work on it. |
Just wanted to throw in that I was wrong. It's indeed the same behavior in Python and Go as well. I did never notice in the former because a similar code I was working with stripped the EOL chars whereas the API I used in rust only split on \n and left an empty \r hanging around. So when parts of the data were recombined into a string later the \r were left in there. |
So the regex behavior here does match other engines dispite what I said earlier. |
@mitsuhiko Oh interesting. I should have known for Go, but it's interesting to see that Python doesn't do it either:
So I guess we're in good company? |
@BurntSushi That's odd. I've used |
@BatmanAoD In Python, I believe if you open your files with |
|
Java's regex also matches
prints |
For Javascript and Java, does
Is there any objection to simply treating every Unicode line terminator character sequence as a match for |
JavaScript matches them alone: > '1\n2\r3\r\n4'.split(/$/m)
[ '1', '\n2', '\r3', '\r', '\n4' ] |
@mitsuhiko Hmm. If the interpreter here is correct, JavaScript also returns a four-element array here: |
@BatmanAoD which browser are you using? Chrome, Firefox and Safari gives me 5 elements. Same with JavaScript Core and V8 in node. |
Sorry, I meant 5 elements, splitting on each of the There should only be 3 elements (2 matches). |
I'm not sure. The above behavior makes perfect sense if you go by unicodes classification of newline characters. I find that behavior quite good because it means that matching with |
I don't believe you're interpreting that list of Unicode newline character-sequences correctly, because they list A common pattern is to use |
Java does exactly the right thing:
This prints:
I.e., the first |
(To be more precise about what I mean by "the right thing": Java's behavior is, as far as I can tell, exactly equivalent to the behavior we'd get from implementing my proposal of treating |
Unicode does not define control characters. Unicode only has recommendations on newline handling and the recommendations and those would be "convert from platform newline characters to LS or PS" and then back which I think nobody does. So I think unicode in itself is unclear on it. However unicode has guidelines on regular expressions: These two things apply:
as well as
WRT to Java behavior from above it yields this (in pseudocode): > '1\n2\r3\r\n4'.split(/$/m)
['1', '\n2', '\r3', '\r\n4'] |
@BurntSushi Have you read what we were planning to do in .NET? I was assigned to go code it but didn't get to it yet... we decided it would be best to follow the TR18: dotnet/runtime#25598 (comment) |
@jzabroski Yes. .NET is a backtracking based implementation, right? In that context, the implementation is far more flexible. UTS#18 is a bit of a tortured document, where the writers of it were clearly aware that some of their recommendations would be quite difficult to satisfy in the context of finite automata based regex engines, which is the case here. In particular, I will definitely not be doing this:
Otherwise, treating |
I think it's entirely reasonable to unconditionally consider
That sounds right to me. As a user, I can't imagine a scenario where
I assume that's exactly why they phrased it the way they did, i.e., they're talking about implementing a "multiline" mode (mentioned above the list), such as this library's |
Ah I see. Yeah, in that case, |
OK, I'm happy to report that this feature should land once #656 is complete. I have it working. Would folks like to review the docs for it? /// Enable or disable the "CRLF mode" flag by default.
///
/// By default this is disabled. It may alternatively be selectively
/// enabled in the regular expression itself via the `R` flag.
///
/// When CRLF mode is enabled, the following happens:
///
/// * Unless `dot_matches_new_line` is enabled, `.` will match any character
/// except for `\r` and `\n`.
/// * When `multi_line` mode is enabled, `^` and `$` will treat `\r\n`,
/// `\r` and `\n` as line terminators. And in particular, neither will
/// match between a `\r` and a `\n`.
The key things to note here are:
For a deeper dive, here's a smattering of tests showing CRLF mode semantics: # This is a basic test that checks ^ and $ treat \r\n as a single line
# terminator. If ^ and $ only treated \n as a line terminator, then this would
# only match 'xyz' at the end of the haystack.
[[test]]
name = "basic"
regex = '(?mR)^[a-z]+$'
haystack = "abc\r\ndef\r\nxyz"
matches = [[0, 3], [5, 8], [10, 13]]
# Tests that a CRLF-aware '^$' assertion does not match between CR and LF.
[[test]]
name = "start-end-non-empty"
regex = '(?mR)^$'
haystack = "abc\r\ndef\r\nxyz"
matches = []
# Tests that a CRLF-aware '^$' assertion matches the empty string, just like
# a non-CRLF-aware '^$' assertion.
[[test]]
name = "start-end-empty"
regex = '(?mR)^$'
haystack = ""
matches = [[0, 0]]
# Tests that a CRLF-aware '^$' assertion matches the empty string preceding
# and following a line terminator.
[[test]]
name = "start-end-before-after"
regex = '(?mR)^$'
haystack = "\r\n"
matches = [[0, 0], [2, 2]]
# Tests that a CRLF-aware '^' assertion does not split a line terminator.
[[test]]
name = "start-no-split"
regex = '(?mR)^'
haystack = "abc\r\ndef\r\nxyz"
matches = [[0, 0], [5, 5], [10, 10]]
# Same as above, but with adjacent runs of line terminators.
[[test]]
name = "start-no-split-adjacent"
regex = '(?mR)^'
haystack = "\r\n\r\n\r\n"
matches = [[0, 0], [2, 2], [4, 4], [6, 6]]
# Same as above, but with adjacent runs of just carriage returns.
[[test]]
name = "start-no-split-adjacent-cr"
regex = '(?mR)^'
haystack = "\r\r\r"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]
# Same as above, but with adjacent runs of just line feeds.
[[test]]
name = "start-no-split-adjacent-lf"
regex = '(?mR)^'
haystack = "\n\n\n"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]
# Tests that a CRLF-aware '$' assertion does not split a line terminator.
[[test]]
name = "end-no-split"
regex = '(?mR)$'
haystack = "abc\r\ndef\r\nxyz"
matches = [[3, 3], [8, 8], [13, 13]]
# Same as above, but with adjacent runs of line terminators.
[[test]]
name = "end-no-split-adjacent"
regex = '(?mR)$'
haystack = "\r\n\r\n\r\n"
matches = [[0, 0], [2, 2], [4, 4], [6, 6]]
# Same as above, but with adjacent runs of just carriage returns.
[[test]]
name = "end-no-split-adjacent-cr"
regex = '(?mR)$'
haystack = "\r\r\r"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]
# Same as above, but with adjacent runs of just line feeds.
[[test]]
name = "end-no-split-adjacent-lf"
regex = '(?mR)$'
haystack = "\n\n\n"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]
# Tests that '.' does not match either \r or \n when CRLF mode is enabled. Note
# that this doesn't require multi-line mode to be enabled.
[[test]]
name = "dot-no-crlf"
regex = '(?R).'
haystack = "\r\n\r\n\r\n"
matches = [] It was quite gnarly to add, and in so doing, I actually uncovered a bug in the lazy DFA (present in both the status quo and my rewrite): # A variant of the problem described here:
# /~https://github.com/google/re2/blob/89567f5de5b23bb5ad0c26cbafc10bdc7389d1fa/re2/dfa.cc#L658-L667
[[test]]
name = "alt-with-assertion-repetition"
regex = '(?:\b|%)+'
haystack = "z%"
bounds = [1, 2]
anchored = true
# This was reporting [1, 2], which is
# not the correct leftmost-first match.
matches = [[1, 1]] |
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](/~https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` | --- ### Release Notes <details> <summary>rust-lang/regex (regex)</summary> ### [`v1.9.1`](/~https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07) [Compare Source](rust-lang/regex@1.9.0...1.9.1) \================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: - [BUG #​1027](rust-lang/regex#1027): Change the allocation strategy for the backtracker to be less aggressive. ### [`v1.9.0`](/~https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05) [Compare Source](rust-lang/regex@1.8.4...1.9.0) \================== This release marks the end of a [years long rewrite of the regex crate internals](rust-lang/regex#656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: - A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. - A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. - `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. - The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. - Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: - [FEATURE #​244](rust-lang/regex#244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. - [FEATURE #​259](rust-lang/regex#259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. - [FEATURE #​476](rust-lang/regex#476): `std` is now an optional feature. `regex` may be used with only `alloc`. - [FEATURE #​644](rust-lang/regex#644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. - [FEATURE #​675](rust-lang/regex#675): Anchored search APIs are now available in `regex-automata 0.3`. - [FEATURE #​824](rust-lang/regex#824): Add new `Captures::extract` method for easier capture group access. - [FEATURE #​961](rust-lang/regex#961): Add `regex-lite` crate with smaller binary sizes and faster compile times. - [FEATURE #​1022](rust-lang/regex#1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: - [PERF #​68](rust-lang/regex#68): Added a one-pass DFA engine for faster capture group matching. - [PERF #​510](rust-lang/regex#510): Inner literals are now used to accelerate searches, e.g., `\w+@​\w+` will scan for `@`. - [PERF #​787](rust-lang/regex#787), [PERF #​891](rust-lang/regex#891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: - [BUG #​429](rust-lang/regex#429): Fix matching bugs related to `\B` and inconsistencies across internal engines. - [BUG #​517](rust-lang/regex#517): Fix matching bug with capture groups. - [BUG #​579](rust-lang/regex#579): Fix matching bug with word boundaries. - [BUG #​779](rust-lang/regex#779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. - [BUG #​850](rust-lang/regex#850): Fix matching bug inconsistency between NFA and DFA engines. - [BUG #​921](rust-lang/regex#921): Fix matching bug where literal extraction got confused by `$`. - [BUG #​976](rust-lang/regex#976): Add documentation to replacement routines about dealing with fallibility. - [BUG #​1002](rust-lang/regex#1002): Use corpus rejection in fuzz testing. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](/~https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9--> Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957 Reviewed-by: crapStone <crapstone01@gmail.com> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
I created a regex with
multi_line
set totrue
, but after debugging why the regex was matching in a unittest but not in a file, I found out that$
isn't matching the end of a line in the file. I'm using Windows so the newlines are\r\n
.The text was updated successfully, but these errors were encountered: