Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ugrep doesn't match a disdunction when a wildcard is present in one component, but not the other in some circumstances #451

Closed
wernerk-bbd opened this issue Jan 9, 2025 · 5 comments
Labels
bug Something isn't working patch A patch to fix an issue

Comments

@wernerk-bbd
Copy link

echo "The Quick Brown Fox asdf\nasdf TQBF asdf\nasdf TWBF" | ug -i 'the.quick.brown|TWBF|TQBF'

 1: The Quick Brown Fox asdf

Seems it happens when the first character is the same and -i is used

echo "The Quick Brown Fox asdf\nasdf TQBF asdf\nasdf TWBF" | ug -i 'the.quick.brown|WBF|QBF'

1: The Quick Brown Fox asdf
2: asdf TQBF asdf
3: asdf TWBF
@genivia-inc
Copy link
Member

genivia-inc commented Jan 9, 2025

Thank you for your feedback.

This is strange and should never happen, given our extensive unit and regression test set.

Looking at the regex pattern DFA internals using RE/flex tools, there is something odd with the graph for the regex pattern the.quick.brown|TWBF|TQBF with option -i:

image

However, it is a problem that happens under very specific circumstances. If we add spaces to the disjunctions like so 'the.quick.brown| TWBF| TQBF' then matching is fine. Also with case-insensitive matching using the (?i) modifier is fine (?i)the.quick.brown|TWBF|TQBF. That should be the same as using option -i. So the only explanation is that the application of -i affects the pattern negatively, before we match the pattern. There is some issue with parsing the regex pattern in specific cases like this where case-insensitive option -i or (?i) are not applied properly to the first character T of the TWBF and TQBF parts, i.e. it is matched as a lower case t. This only happens when strings are used like TWBF and not regex like TWB.. For example, the.quick.brown|TWB.|TQB. matches fine with -i.

I'll get to the bottom of this and fix it asap.

PS. Edited for clarity.

@genivia-inc
Copy link
Member

OK. found the problem in internal DFA-merging of "string-like" regex patterns (e.g. TWBF that have no regex operators) and when the initial character is the same (e.g. t and T in this case) and when case-insensitive matching is used.

@genivia-inc
Copy link
Member

genivia-inc commented Jan 9, 2025

Looks like I goofed up. For case-insensitive patterns in some rare special cases when regex patterns are combined with "string-like" patterns that overlap at or from the start with the regex pattern, then we may encounter this problem.

It's right here in ugrep/lib/patterns.cpp in the ! orange loop:

*** 1756,1773 ****
        {
          // combine the tree DFA transitions with the regex DFA transition moves
          Chars chars;
-         for (DFA::State::Edges::iterator t = state->tnode->edges.begin(); t != state->tnode->edges.end(); ++t)
-           chars.add(t->first);
          if (opt_.i)
          {
!           for (DFA::State::Edges::iterator t = state->tnode->edges.find('a'); t != state->tnode->edges.end(); ++t)
            {
              Char c = t->first;
!             if (c > 'z')
!               break;
!             chars.add(uppercase(c));
            }
          }
          Moves::iterator i = moves.begin();
          Positions pos;
          while (i != moves.end())

The loop starts with state->tnode->edges.find('a') which assumes that transitions are on a but this can be any lower case letter like a t. This means that a T is not included as valid for the "string-like" patterns. This wasn't flagged by our existing tests that all have an a or an A in these specific pattern test cases, sigh.

The sanitized patch is:

--- 1759,1779 ----
        {
          // combine the tree DFA transitions with the regex DFA transition moves
          Chars chars;
          if (opt_.i)
          {
!           for (DFA::State::Edges::iterator t = state->tnode->edges.begin(); t != state->tnode->edges.end(); ++t)
            {
              Char c = t->first;
!             chars.add(c);
!             if (c >= 'a' && c <= 'z')
!               chars.add(uppercase(c));
            }
          }
+         else
+         {
+           for (DFA::State::Edges::iterator t = state->tnode->edges.begin(); t != state->tnode->edges.end(); ++t)
+             chars.add(t->first);
+         }
          Moves::iterator i = moves.begin();
          Positions pos;
          while (i != moves.end())
***************

A lot of automated and also manual testing goes into verifying ugrep. We actually have unit tests to test this combination (and more) in RE/flex (the ugrep regex engine), but these unit tests use sub-patterns that start with a or A and those letters do not trigger the bug! The bug may occur when case-insensitive pattern matching is used when one ore more "string-like" sub-patterns are specified that overlap from the start of their sub-string-pattern with one of the "pure" regex sub-patterns, which causes an optimization to misfire so that the additional "string-like" sub-patterns do not match the upper case occurrences in the input.

@genivia-inc genivia-inc added bug Something isn't working patch A patch to fix an issue labels Jan 9, 2025
@genivia-inc
Copy link
Member

I've committed patch 5be38e8 and will release an update soon after our battery of tests completed.

genivia-inc added a commit that referenced this issue Jan 9, 2025
minor update for #451 and year 2025
@wernerk-bbd
Copy link
Author

Great, thanks Doctor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working patch A patch to fix an issue
Projects
None yet
Development

No branches or pull requests

2 participants