Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$ doesn't match CRLF #244

Closed
jminer opened this issue Jun 7, 2016 · 67 comments
Closed

$ doesn't match CRLF #244

jminer opened this issue Jun 7, 2016 · 67 comments

Comments

@jminer
Copy link

jminer commented Jun 7, 2016

I created a regex with multi_line set to true, but after debugging why the regex was matching in a unittest but not in a file, I found out that $ isn't matching the end of a line in the file. I'm using Windows so the newlines are \r\n.

@BurntSushi
Copy link
Member

Could you please provide a test case that doesn't act as you expect?

@jminer
Copy link
Author

jminer commented Jun 7, 2016

Sure, here is a program that I would expect to print "Matched: true", but it prints "Matched: false":

extern crate regex;

use regex::RegexBuilder;

fn main() {
    let regex = RegexBuilder::new("^apple$").multi_line(true).compile().unwrap();
    let text = "\r\napple\r\nbanana";
    let mut matched = false;
    for _ in regex.captures_iter(text) {
        matched = true;
    }
    println!("Matched: {}", matched);
}

@BurntSushi
Copy link
Member

That is indeed expected behavior. $ will only match \n in multi line mode. It's not clear to me whether supporting \r\n is feasible unfortunately.

@BatmanAoD
Copy link
Member

@BurntSushi Would treating \r as an end-of-line character, and \n as non-EOL if preceded by \r, be an acceptable change? This should be doable in a DFA, I think, though it does mean two EOL states. It might be somewhat surprising behavior when \r is embedded in a line, but that seems like a much rarer case than EOL \r, and it's not even clear to me that treating carriage return as EOL is actually wrong.

@BurntSushi
Copy link
Member

@BatmanAoD That sounds feasible from an implementation perspective, but I'm not a fan of implementing something that's wrong. (Essentially no systems use \r for line endings any more, and on Windows, it's \r\n, not \r.)

@BatmanAoD
Copy link
Member

But the existing implementation is more wrong, so I'm not sure I understand that as an objection.

@BatmanAoD
Copy link
Member

(Also, just last week I actually did run into something that uses \r on its own as EOL by default: Putty in serial mode! I was shocked.)

@mitsuhiko
Copy link

I just ran into this when trying to use it to parse some data coming in via HTTP. This is incredibly confusing at the very leasts as this is the only regex engine I know with this behavior.

@BurntSushi
Copy link
Member

BurntSushi commented Mar 22, 2017 via email

@BurntSushi BurntSushi reopened this Mar 22, 2017
@BurntSushi
Copy link
Member

I'm going to re-open this, but I don't have any immediate plans to work on it.

@mitsuhiko
Copy link

Just wanted to throw in that I was wrong. It's indeed the same behavior in Python and Go as well. I did never notice in the former because a similar code I was working with stripped the EOL chars whereas the API I used in rust only split on \n and left an empty \r hanging around.

So when parts of the data were recombined into a string later the \r were left in there.

@mitsuhiko
Copy link

So the regex behavior here does match other engines dispite what I said earlier.

@BurntSushi
Copy link
Member

BurntSushi commented Mar 22, 2017

@mitsuhiko Oh interesting. I should have known for Go, but it's interesting to see that Python doesn't do it either:

>>> import re
>>> re.match('foo$', 'foo\n', re.MULTILINE) is not None
True
>>> re.match('foo$', 'foo\r\n', re.MULTILINE) is not None
False

So I guess we're in good company?

@BatmanAoD
Copy link
Member

@BurntSushi That's odd. I've used re in Python 2.7 on Windows pretty extensively, and I'm sure I've used $, and I know the files I was working with had carriage returns in them. I would have thought I'd notice this! I suppose I must have stripped all my input lines before searching for patterns.

@BurntSushi
Copy link
Member

@BatmanAoD In Python, I believe if you open your files with U (universal line mode?), then Python will do something clever automatically. Splitting by line and then searching will probably also do it.

@BurntSushi
Copy link
Member

/foo$/m in Javascript does match foo\n and foo\r\n.

@jminer
Copy link
Author

jminer commented Mar 23, 2017

Java's regex also matches \r\n by default:

Pattern p = Pattern.compile("foo$", Pattern.MULTILINE);
System.out.println(p.matcher("foo\r\n").find());

prints true.

@BatmanAoD
Copy link
Member

For Javascript and Java, does \r on its own match as a newline anchor?

\r on its own is apparently considered a newline character by Unicode.

Is there any objection to simply treating every Unicode line terminator character sequence as a match for $? @BurntSushi, I know you said earlier that treating \r on its own as a line terminator would be "wrong", but I'm still not quite sure I see why you'd consider that to be incorrect behavior, even if it's not the norm for regex engines.

@mitsuhiko
Copy link

JavaScript matches them alone:

> '1\n2\r3\r\n4'.split(/$/m)
[ '1', '\n2', '\r3', '\r', '\n4' ]

@BatmanAoD
Copy link
Member

BatmanAoD commented Mar 23, 2017

@mitsuhiko Hmm. If the interpreter here is correct, JavaScript also returns a four-element array here: '1\r\n\r\n4'.split(/$/m) This is obviously not correct on Windows (there are only two line-endings there).

@mitsuhiko
Copy link

@BatmanAoD which browser are you using? Chrome, Firefox and Safari gives me 5 elements. Same with JavaScript Core and V8 in node.

@BatmanAoD
Copy link
Member

Sorry, I meant 5 elements, splitting on each of the \r's and \n's (so 4 matches, which is what I was thinking when I typed that).

There should only be 3 elements (2 matches).

@mitsuhiko
Copy link

I'm not sure. The above behavior makes perfect sense if you go by unicodes classification of newline characters. I find that behavior quite good because it means that matching with $ works in all newline environments. I know some people say files ending with \r are not common any more but if anyone ever worked with OS X knows that \r newlines are a thing of the present. I come across such files regularly.

@BatmanAoD
Copy link
Member

BatmanAoD commented Mar 23, 2017

I don't believe you're interpreting that list of Unicode newline character-sequences correctly, because they list \r\n as a separate entry from \r. I.e., Unicode considers \r\n to be one single newline (as they must, since Windows is so widely used).

A common pattern is to use ^$ to find blank lines; this would give 100% false positives on Windows using the JavaScript behavior.

@BatmanAoD
Copy link
Member

Java does exactly the right thing:

Pattern p = Pattern.compile("^$", Pattern.MULTILINE);
Matcher m = p.matcher("1\r\r\n2\r\n\r\n3");
while (m.find()) { System.out.println("match at " + m.start() + ": " + m.group()); } 

This prints:

match at 2: 
match at 7: 

I.e., the first \r is treated as a newline, after which each group of \r\n together is a single newline.

@BatmanAoD
Copy link
Member

BatmanAoD commented Mar 23, 2017

(To be more precise about what I mean by "the right thing": Java's behavior is, as far as I can tell, exactly equivalent to the behavior we'd get from implementing my proposal of treating \r like a newline all the time and \n like a newline except when preceded by \r. This behavior also matches what I would expect a regex engine to do, though as I've learned today, clearly many do not behave this way!)

@mitsuhiko
Copy link

Unicode does not define control characters. Unicode only has recommendations on newline handling and the recommendations and those would be "convert from platform newline characters to LS or PS" and then back which I think nobody does. So I think unicode in itself is unclear on it. However unicode has guidelines on regular expressions:

These two things apply:

Line Boundaries
To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PARAGRAPH SEPARATOR (U+2029) and LINE SEPARATOR (U+2028).

as well as

It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (for example, in #1). This would correspond to something equivalent to the following expression. That expression is slightly complicated by the need to avoid backup.

(?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]

Note: For some implementations, there may be a performance impact in recognizing CRLF as a single entity, such as with an arbitrary pattern character ("."). To account for that, an implementation may also satisfy R1.6 if there is a mechanism available for converting the sequence CRLF to a single line boundary character before regex processing.

WRT to Java behavior from above it yields this (in pseudocode):

> '1\n2\r3\r\n4'.split(/$/m)
['1', '\n2', '\r3', '\r\n4']

@jzabroski
Copy link

@BurntSushi Have you read what we were planning to do in .NET? I was assigned to go code it but didn't get to it yet... we decided it would be best to follow the TR18: dotnet/runtime#25598 (comment)

@BurntSushi
Copy link
Member

@jzabroski Yes. .NET is a backtracking based implementation, right? In that context, the implementation is far more flexible. UTS#18 is a bit of a tortured document, where the writers of it were clearly aware that some of their recommendations would be quite difficult to satisfy in the context of finite automata based regex engines, which is the case here.

In particular, I will definitely not be doing this:

  • Making $ Unicode-aware. If it was easy to do this, I'd do it. But it's not. Combine that with the fact that I don't think I have ever seen anyone actually want a Unicode-aware $ means I'm not really motivated by this. CRLF is different, because of Windows.
  • If I do add CRLF, it sounds like . needs to become [^\r\n]. I can't do anything more complex than that. And I note that the wording of UTS#18 around . is really really weird: "Where the 'arbitrary character pattern' matches a newline sequence, it must match all of the newline sequences." Like, huh? . is usually specified to not match a newline sequence. And the writing goes on to say that treating \r\n as a single unit is not necessary for conformance.

Otherwise, treating \r as a line-ending and \r\n as a single line ending seems like a strict subset of UTS#18.

@BatmanAoD
Copy link
Member

BatmanAoD commented May 10, 2021

I think it's entirely reasonable to unconditionally consider \r to be the start of a line-ending, and handle the "\n following \r" situation as the actual special case. (That's basically my suggestion from back in 2017. As I mentioned then, Putty actually does use \r in isolation by default, so it's not an entirely obsolete form of newline.)

it sounds like . needs to become [^\r\n].

That sounds right to me. As a user, I can't imagine a scenario where . failing to match \r would be would be more surprising than matching would be (outside of dot_matches_new_line mode, of course).

"Where the 'arbitrary character pattern' matches a newline sequence, it must match all of the newline sequences." Like, huh? . is usually specified to not match a newline sequence.

I assume that's exactly why they phrased it the way they did, i.e., they're talking about implementing a "multiline" mode (mentioned above the list), such as this library's dot_matches_new_line. The requirement (as I read it) is that in this mode, . matches exactly the same things that $ would match.

@BurntSushi
Copy link
Member

I assume that's exactly why they phrased it the way they did, i.e., they're talking about implementing a "multiline" mode (mentioned above the list), such as this library's dot_matches_new_line. The requirement (as I read it) is that in this mode, . matches exactly the same things that $ would match.

Ah I see. Yeah, in that case, . wouldn't match \r\n, which is what UTS#18 seems to want.

@BurntSushi
Copy link
Member

OK, I'm happy to report that this feature should land once #656 is complete. I have it working. Would folks like to review the docs for it?

    /// Enable or disable the "CRLF mode" flag by default.
    ///
    /// By default this is disabled. It may alternatively be selectively
    /// enabled in the regular expression itself via the `R` flag.
    ///
    /// When CRLF mode is enabled, the following happens:
    ///
    /// * Unless `dot_matches_new_line` is enabled, `.` will match any character
    /// except for `\r` and `\n`.
    /// * When `multi_line` mode is enabled, `^` and `$` will treat `\r\n`,
    /// `\r` and `\n` as line terminators. And in particular, neither will
    /// match between a `\r` and a `\n`.

The key things to note here are:

  • CRLF mode is opt-in.
  • You'll be able to opt-in via RegexBuilder::crlf, or by using the new R inline flag.
  • \r on its own is treated as a line terminator, as suggested by @BatmanAoD above. The key trick is that ^ and $ won't match between a \r and a \n. (I would ideally rather not treat \r as a line terminator unto itself, but it's just not feasible to do.)
  • In CRLF mode, . becomes [^\r\n] instead of [^\n].

For a deeper dive, here's a smattering of tests showing CRLF mode semantics:

# This is a basic test that checks ^ and $ treat \r\n as a single line
# terminator. If ^ and $ only treated \n as a line terminator, then this would
# only match 'xyz' at the end of the haystack.
[[test]]
name = "basic"
regex = '(?mR)^[a-z]+$'
haystack = "abc\r\ndef\r\nxyz"
matches = [[0, 3], [5, 8], [10, 13]]

# Tests that a CRLF-aware '^$' assertion does not match between CR and LF.
[[test]]
name = "start-end-non-empty"
regex = '(?mR)^$'
haystack = "abc\r\ndef\r\nxyz"
matches = []

# Tests that a CRLF-aware '^$' assertion matches the empty string, just like
# a non-CRLF-aware '^$' assertion.
[[test]]
name = "start-end-empty"
regex = '(?mR)^$'
haystack = ""
matches = [[0, 0]]

# Tests that a CRLF-aware '^$' assertion matches the empty string preceding
# and following a line terminator.
[[test]]
name = "start-end-before-after"
regex = '(?mR)^$'
haystack = "\r\n"
matches = [[0, 0], [2, 2]]

# Tests that a CRLF-aware '^' assertion does not split a line terminator.
[[test]]
name = "start-no-split"
regex = '(?mR)^'
haystack = "abc\r\ndef\r\nxyz"
matches = [[0, 0], [5, 5], [10, 10]]

# Same as above, but with adjacent runs of line terminators.
[[test]]
name = "start-no-split-adjacent"
regex = '(?mR)^'
haystack = "\r\n\r\n\r\n"
matches = [[0, 0], [2, 2], [4, 4], [6, 6]]

# Same as above, but with adjacent runs of just carriage returns.
[[test]]
name = "start-no-split-adjacent-cr"
regex = '(?mR)^'
haystack = "\r\r\r"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]

# Same as above, but with adjacent runs of just line feeds.
[[test]]
name = "start-no-split-adjacent-lf"
regex = '(?mR)^'
haystack = "\n\n\n"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]

# Tests that a CRLF-aware '$' assertion does not split a line terminator.
[[test]]
name = "end-no-split"
regex = '(?mR)$'
haystack = "abc\r\ndef\r\nxyz"
matches = [[3, 3], [8, 8], [13, 13]]

# Same as above, but with adjacent runs of line terminators.
[[test]]
name = "end-no-split-adjacent"
regex = '(?mR)$'
haystack = "\r\n\r\n\r\n"
matches = [[0, 0], [2, 2], [4, 4], [6, 6]]

# Same as above, but with adjacent runs of just carriage returns.
[[test]]
name = "end-no-split-adjacent-cr"
regex = '(?mR)$'
haystack = "\r\r\r"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]

# Same as above, but with adjacent runs of just line feeds.
[[test]]
name = "end-no-split-adjacent-lf"
regex = '(?mR)$'
haystack = "\n\n\n"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]

# Tests that '.' does not match either \r or \n when CRLF mode is enabled. Note
# that this doesn't require multi-line mode to be enabled.
[[test]]
name = "dot-no-crlf"
regex = '(?R).'
haystack = "\r\n\r\n\r\n"
matches = []

It was quite gnarly to add, and in so doing, I actually uncovered a bug in the lazy DFA (present in both the status quo and my rewrite):

# A variant of the problem described here:
# /~https://github.com/google/re2/blob/89567f5de5b23bb5ad0c26cbafc10bdc7389d1fa/re2/dfa.cc#L658-L667
[[test]]
name = "alt-with-assertion-repetition"
regex = '(?:\b|%)+'
haystack = "z%"
bounds = [1, 2]
anchored = true
# This was reporting [1, 2], which is
# not the correct leftmost-first match.
matches = [[1, 1]]

BurntSushi added a commit that referenced this issue Mar 2, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Mar 4, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Mar 5, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Mar 15, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Mar 15, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Mar 15, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Mar 20, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Mar 21, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 15, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 15, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Jul 5, 2023
I usually close tickets on a commit-by-commit basis, but this refactor
was so big that it wasn't feasible to do that. So ticket closures are
marked here.

Closes #244
Closes #259
Closes #476
Closes #644
Closes #675
Closes #824
Closes #961

Closes #68
Closes #510
Closes #787
Closes #891

Closes #429
Closes #517
Closes #579
Closes #779
Closes #850
Closes #921
Closes #976
Closes #1002

Closes #656
crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue Jul 18, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](/~https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` |

---

### Release Notes

<details>
<summary>rust-lang/regex (regex)</summary>

### [`v1.9.1`](/~https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07)

[Compare Source](rust-lang/regex@1.9.0...1.9.1)

\==================
This is a patch release which fixes a memory usage regression. In the regex
1.9 release, one of the internal engines used a more aggressive allocation
strategy than what was done previously. This patch release reverts to the
prior on-demand strategy.

Bug fixes:

-   [BUG #&#8203;1027](rust-lang/regex#1027):
    Change the allocation strategy for the backtracker to be less aggressive.

### [`v1.9.0`](/~https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05)

[Compare Source](rust-lang/regex@1.8.4...1.9.0)

\==================
This release marks the end of a [years long rewrite of the regex crate
internals](rust-lang/regex#656). Since this is
such a big release, please report any issues or regressions you find. We would
also love to hear about improvements as well.

In addition to many internal improvements that should hopefully result in
"my regex searches are faster," there have also been a few API additions:

-   A new `Captures::extract` method for quickly accessing the substrings
    that match each capture group in a regex.
-   A new inline flag, `R`, which enables CRLF mode. This makes `.` match any
    Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and
    `(?m:$)` match after and before both `\r` and `\n`, respectively, but never
    between a `\r` and `\n`.
-   `RegexBuilder::line_terminator` was added to further customize the line
    terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte.
-   The `std` Cargo feature is now actually optional. That is, the `regex` crate
    can be used without the standard library.
-   Because `regex 1.9` may make binary size and compile times even worse, a
    new experimental crate called `regex-lite` has been published. It prioritizes
    binary size and compile times over functionality (like Unicode) and
    performance. It shares no code with the `regex` crate.

New features:

-   [FEATURE #&#8203;244](rust-lang/regex#244):
    One can opt into CRLF mode via the `R` flag.
    e.g., `(?mR:$)` matches just before `\r\n`.
-   [FEATURE #&#8203;259](rust-lang/regex#259):
    Multi-pattern searches with offsets can be done with `regex-automata 0.3`.
-   [FEATURE #&#8203;476](rust-lang/regex#476):
    `std` is now an optional feature. `regex` may be used with only `alloc`.
-   [FEATURE #&#8203;644](rust-lang/regex#644):
    `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave.
-   [FEATURE #&#8203;675](rust-lang/regex#675):
    Anchored search APIs are now available in `regex-automata 0.3`.
-   [FEATURE #&#8203;824](rust-lang/regex#824):
    Add new `Captures::extract` method for easier capture group access.
-   [FEATURE #&#8203;961](rust-lang/regex#961):
    Add `regex-lite` crate with smaller binary sizes and faster compile times.
-   [FEATURE #&#8203;1022](rust-lang/regex#1022):
    Add `TryFrom` implementations for the `Regex` type.

Performance improvements:

-   [PERF #&#8203;68](rust-lang/regex#68):
    Added a one-pass DFA engine for faster capture group matching.
-   [PERF #&#8203;510](rust-lang/regex#510):
    Inner literals are now used to accelerate searches, e.g., `\w+@&#8203;\w+` will scan
    for `@`.
-   [PERF #&#8203;787](rust-lang/regex#787),
    [PERF #&#8203;891](rust-lang/regex#891):
    Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`.

(There are many more performance improvements as well, but not all of them have
specific issues devoted to them.)

Bug fixes:

-   [BUG #&#8203;429](rust-lang/regex#429):
    Fix matching bugs related to `\B` and inconsistencies across internal engines.
-   [BUG #&#8203;517](rust-lang/regex#517):
    Fix matching bug with capture groups.
-   [BUG #&#8203;579](rust-lang/regex#579):
    Fix matching bug with word boundaries.
-   [BUG #&#8203;779](rust-lang/regex#779):
    Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`.
-   [BUG #&#8203;850](rust-lang/regex#850):
    Fix matching bug inconsistency between NFA and DFA engines.
-   [BUG #&#8203;921](rust-lang/regex#921):
    Fix matching bug where literal extraction got confused by `$`.
-   [BUG #&#8203;976](rust-lang/regex#976):
    Add documentation to replacement routines about dealing with fallibility.
-   [BUG #&#8203;1002](rust-lang/regex#1002):
    Use corpus rejection in fuzz testing.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](/~https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957
Reviewed-by: crapStone <crapstone01@gmail.com>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants