Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

^$ matches a phantom line at end of file #441

Closed
BurntSushi opened this issue Apr 8, 2017 · 15 comments · Fixed by #1017
Closed

^$ matches a phantom line at end of file #441

BurntSushi opened this issue Apr 8, 2017 · 15 comments · Fixed by #1017
Labels
bug A bug. help wanted Others are encouraged to work on this issue.
Milestone

Comments

@BurntSushi
Copy link
Owner

This is simple to demonstrate. Note the behavior difference between GNU grep and ripgrep:

$ cat scratch 
a
b

c


d
$ rg '^$' scratch
3:
5:
6:
8:
$ egrep -n '^$' scratch
3:
5:
6:

The ^$ pattern is a useful idiom for detecting empty lines in a file. ripgrep's behavior in particular outputs a line numbered 8, even though there are only 7 lines in the file:

$ wc -l scratch
7 scratch

More generally, ripgrep should never emit a line with a number greater than the number of lines in the file, even if the file ends with a trailing \n.

See also #416 and rust-lang/regex#355 (comment) for the root cause

@BurntSushi
Copy link
Owner Author

cc @BatmanAoD

@BatmanAoD
Copy link

Thanks for filing this. I agree that it's reasonable to keep the regex crate behavior.

(Also, I know you asked for example use-cases for this, and unfortunately, although I know I've used something like ack '^$' in the past, I can't actually remember the specifics. I did find some StackOverflow questions with numerous upvotes about how to search for empty lines, but I didn't see anyone discussing specifically why they wanted to do that. This may not be a common enough use-case to be a high-priority bug.)

@BurntSushi
Copy link
Owner Author

@BatmanAoD Right. When I asked that, I hadn't thought of "match empty lines." Once I spent a couple of minutes thinking, I remembered it. That's a strong enough use case as far as I'm concerned. :-)

@BurntSushi
Copy link
Owner Author

(I wouldn't expect someone to care about matching empty lines necessarily, but I could see someone being interested in counting them.)

@blankname
Copy link

blankname commented Apr 10, 2017

I use matching empty lines in conjunction with the --invert-match option to filter out empty lines when piping commands.

@BatmanAoD
Copy link

@blankname That actually shouldn't be a problem! You'll "erroneously" filter out a phantom line--which should have no effect.

@BurntSushi Counting seems like an interesting application indeed, but as soon as I thought about that I thought "isn't that just a sort of crumby, broken loc?"

@BurntSushi
Copy link
Owner Author

I've never heard of, or used, loc. :-)

@BatmanAoD
Copy link

loc and tokei are Rust clones of cloc, which counts the lines of code in a project and produces a nice little table where the rows are different languages (autodetected) and the columns are things like blank lines, comments, and actual code. I've used cloc a lot more than loc (it's a lot older), but I figured loc would be the right one to mention here!

@BurntSushi
Copy link
Owner Author

BurntSushi commented May 8, 2017

Here are a pair of regression tests:

// See: /~https://github.com/BurntSushi/ripgrep/issues/441
clean!(regression_441_mmap, "^$", "test", |wd: WorkDir, mut cmd: Command| {
    wd.create("test", "a\nb\n\nc\n\n\nd\n");
    cmd.arg("--mmap");

    let lines: String = wd.stdout(&mut cmd);
    assert_eq!(lines, "\n\n\n");
});

// See: /~https://github.com/BurntSushi/ripgrep/issues/441
clean!(regression_441_no_mmap, "^$", "test", |wd: WorkDir, mut cmd: Command| {
    wd.create("test", "a\nb\n\nc\n\n\nd\n");
    cmd.arg("--no-mmap");

    let lines: String = wd.stdout(&mut cmd);
    assert_eq!(lines, "\n\n\n");
});

Interestingly, the output of the mmap test is \n\n\n\n and the output of the no_mmap test is \n\n\n\n\n.

@BurntSushi BurntSushi added the help wanted Others are encouraged to work on this issue. label May 8, 2017
@BurntSushi
Copy link
Owner Author

And also a more precise comparison with grep:

$ cat scratch
a
b

c


d
$ xxd scratch
00000000: 610a 620a 0a63 0a0a 0a64 0a              a.b..c...d.
$ grep '^$' scratch | xxd
00000000: 0a0a 0a                                  ...
$ rg --mmap '^$' scratch | xxd
00000000: 0a0a 0a0a                                ....
$ rg --no-mmap '^$' scratch | xxd
00000000: 0a0a 0a0a 0a                             .....

BurntSushi added a commit that referenced this issue Oct 22, 2017
This fixes a bug where a "match" color escape was erroneously emitted
after the new line character. This is because `^` is actually allowed to
match after the end of a trailing new line, which means `^$` matches both
before and after the trailing new line when multiline mode is enabled.
The trailing match was causing the phantom escape sequence to appear,
which we don't want.

Incidentally, this is the root cause of #441 as well, although this commit
doesn't fix that issue, since the line itself is printed before we detect
the phantom match.

Fixes #599
deg4uss3r pushed a commit to deg4uss3r/ripgrep that referenced this issue Oct 23, 2017
@BurntSushi BurntSushi added this to the libripgrep milestone Jun 3, 2018
@BurntSushi
Copy link
Owner Author

I believe this bug should be fixed in libripgrep.

@BatmanAoD
Copy link

"Should" meaning "ought to be"? Since the libripgrep issue itself states that libripgrep won't be a proper "thing unto itself", do you have an idea of which specific subcomponent should handle this?

@BurntSushi
Copy link
Owner Author

libripgrep is a complete rewrite of the search components of ripgrep. As I progress through that rewrite, I'm looking at bugs in this space that I can fix.

The libripgrep milestone is tracked here: /~https://github.com/BurntSushi/ripgrep/issues?q=is%3Aopen+is%3Aissue+milestone%3Alibripgrep

Since the libripgrep issue itself states that libripgrep won't be a proper "thing unto itself"

This just means that there isn't going to be a singular ripgrep library. There will be many. The grep crate will be the focal point.

@BatmanAoD
Copy link

Ah, okay, I thought it was more of a refactoring/splitting than a complete rewrite.

Might #416 (or even a partial solution like #929) be another candidate for fixing in the rewrite?

@BurntSushi
Copy link
Owner Author

Unlikely. The fix for #416 is to fix the regex engine.

Ah, okay, I thought it was more of a refactoring/splitting than a complete rewrite.

They are both true. I don't think the distinction matters much.

BurntSushi added a commit that referenced this issue Aug 19, 2018
This commit updates the CHANGELOG to reflect all the work done to make
libripgrep a reality.

* Closes #162 (libripgrep)
* Closes #176 (multiline search)
* Closes #188 (opt-in PCRE2 support)
* Closes #244 (JSON output)
* Closes #416 (Windows CRLF support)
* Closes #917 (trim prefix whitespace)
* Closes #993 (add --null-data flag)
* Closes #997 (--passthru works with --replace)

* Fixes #2 (memory maps and context handling work)
* Fixes #200 (ripgrep stops when pipe is closed)
* Fixes #389 (more intuitive `-w/--word-regexp`)
* Fixes #643 (detection of stdin on Windows is better)
* Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird)
* Fixes #764 (coalesce color escapes)
* Fixes #922 (memory maps failing is no big deal)
* Fixes #937 (color escapes no longer used for empty matches)
* Fixes #940 (--passthru does not impact exit status)
* Fixes #1013 (show runtime CPU features in --version output)
BurntSushi added a commit that referenced this issue Aug 20, 2018
This commit updates the CHANGELOG to reflect all the work done to make
libripgrep a reality.

* Closes #162 (libripgrep)
* Closes #176 (multiline search)
* Closes #188 (opt-in PCRE2 support)
* Closes #244 (JSON output)
* Closes #416 (Windows CRLF support)
* Closes #917 (trim prefix whitespace)
* Closes #993 (add --null-data flag)
* Closes #997 (--passthru works with --replace)

* Fixes #2 (memory maps and context handling work)
* Fixes #200 (ripgrep stops when pipe is closed)
* Fixes #389 (more intuitive `-w/--word-regexp`)
* Fixes #643 (detection of stdin on Windows is better)
* Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird)
* Fixes #764 (coalesce color escapes)
* Fixes #922 (memory maps failing is no big deal)
* Fixes #937 (color escapes no longer used for empty matches)
* Fixes #940 (--passthru does not impact exit status)
* Fixes #1013 (show runtime CPU features in --version output)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. help wanted Others are encouraged to work on this issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants