ugrep doesn't match a disdunction when a wildcard is present in one component, but not the other in some circumstances #451

wernerk-bbd · 2025-01-09T07:20:54Z

echo "The Quick Brown Fox asdf\nasdf TQBF asdf\nasdf TWBF" | ug -i 'the.quick.brown|TWBF|TQBF'

 1: The Quick Brown Fox asdf

Seems it happens when the first character is the same and -i is used

echo "The Quick Brown Fox asdf\nasdf TQBF asdf\nasdf TWBF" | ug -i 'the.quick.brown|WBF|QBF'

1: The Quick Brown Fox asdf
2: asdf TQBF asdf
3: asdf TWBF

The text was updated successfully, but these errors were encountered:

genivia-inc · 2025-01-09T13:55:35Z

Thank you for your feedback.

This is strange and should never happen, given our extensive unit and regression test set.

Looking at the regex pattern DFA internals using RE/flex tools, there is something odd with the graph for the regex pattern the.quick.brown|TWBF|TQBF with option -i:

However, it is a problem that happens under very specific circumstances. If we add spaces to the disjunctions like so 'the.quick.brown| TWBF| TQBF' then matching is fine. Also with case-insensitive matching using the (?i) modifier is fine (?i)the.quick.brown|TWBF|TQBF. That should be the same as using option -i. So the only explanation is that the application of -i affects the pattern negatively, before we match the pattern. There is some issue with parsing the regex pattern in specific cases like this where case-insensitive option -i or (?i) are not applied properly to the first character T of the TWBF and TQBF parts, i.e. it is matched as a lower case t. This only happens when strings are used like TWBF and not regex like TWB.. For example, the.quick.brown|TWB.|TQB. matches fine with -i.

I'll get to the bottom of this and fix it asap.

PS. Edited for clarity.

genivia-inc · 2025-01-09T15:34:19Z

OK. found the problem in internal DFA-merging of "string-like" regex patterns (e.g. TWBF that have no regex operators) and when the initial character is the same (e.g. t and T in this case) and when case-insensitive matching is used.

genivia-inc · 2025-01-09T17:43:30Z

Looks like I goofed up. For case-insensitive patterns in some rare special cases when regex patterns are combined with "string-like" patterns that overlap at or from the start with the regex pattern, then we may encounter this problem.

It's right here in ugrep/lib/patterns.cpp in the ! orange loop:

*** 1756,1773 ****
        {
          // combine the tree DFA transitions with the regex DFA transition moves
          Chars chars;
-         for (DFA::State::Edges::iterator t = state->tnode->edges.begin(); t != state->tnode->edges.end(); ++t)
-           chars.add(t->first);
          if (opt_.i)
          {
!           for (DFA::State::Edges::iterator t = state->tnode->edges.find('a'); t != state->tnode->edges.end(); ++t)
            {
              Char c = t->first;
!             if (c > 'z')
!               break;
!             chars.add(uppercase(c));
            }
          }
          Moves::iterator i = moves.begin();
          Positions pos;
          while (i != moves.end())

The loop starts with state->tnode->edges.find('a') which assumes that transitions are on a but this can be any lower case letter like a t. This means that a T is not included as valid for the "string-like" patterns. This wasn't flagged by our existing tests that all have an a or an A in these specific pattern test cases, sigh.

The sanitized patch is:

--- 1759,1779 ----
        {
          // combine the tree DFA transitions with the regex DFA transition moves
          Chars chars;
          if (opt_.i)
          {
!           for (DFA::State::Edges::iterator t = state->tnode->edges.begin(); t != state->tnode->edges.end(); ++t)
            {
              Char c = t->first;
!             chars.add(c);
!             if (c >= 'a' && c <= 'z')
!               chars.add(uppercase(c));
            }
          }
+         else
+         {
+           for (DFA::State::Edges::iterator t = state->tnode->edges.begin(); t != state->tnode->edges.end(); ++t)
+             chars.add(t->first);
+         }
          Moves::iterator i = moves.begin();
          Positions pos;
          while (i != moves.end())
***************

A lot of automated and also manual testing goes into verifying ugrep. We actually have unit tests to test this combination (and more) in RE/flex (the ugrep regex engine), but these unit tests use sub-patterns that start with a or A and those letters do not trigger the bug! The bug may occur when case-insensitive pattern matching is used when one ore more "string-like" sub-patterns are specified that overlap from the start of their sub-string-pattern with one of the "pure" regex sub-patterns, which causes an optimization to misfire so that the additional "string-like" sub-patterns do not match the upper case occurrences in the input.

genivia-inc · 2025-01-09T17:55:12Z

I've committed patch 5be38e8 and will release an update soon after our battery of tests completed.

minor update for #451 and year 2025

wernerk-bbd · 2025-01-10T08:53:55Z

Great, thanks Doctor

genivia-inc added bug Something isn't working patch A patch to fix an issue labels Jan 9, 2025

genivia-inc closed this as completed in 5be38e8 Jan 9, 2025

genivia-inc added a commit that referenced this issue Jan 9, 2025

released 7.1.3

f07d4a6

minor update for #451 and year 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ugrep doesn't match a disdunction when a wildcard is present in one component, but not the other in some circumstances #451

ugrep doesn't match a disdunction when a wildcard is present in one component, but not the other in some circumstances #451

wernerk-bbd commented Jan 9, 2025

genivia-inc commented Jan 9, 2025 •

edited

Loading

genivia-inc commented Jan 9, 2025

genivia-inc commented Jan 9, 2025 •

edited

Loading

genivia-inc commented Jan 9, 2025

wernerk-bbd commented Jan 10, 2025

ugrep doesn't match a disdunction when a wildcard is present in one component, but not the other in some circumstances #451

ugrep doesn't match a disdunction when a wildcard is present in one component, but not the other in some circumstances #451

Comments

wernerk-bbd commented Jan 9, 2025

genivia-inc commented Jan 9, 2025 • edited Loading

genivia-inc commented Jan 9, 2025

genivia-inc commented Jan 9, 2025 • edited Loading

genivia-inc commented Jan 9, 2025

wernerk-bbd commented Jan 10, 2025

genivia-inc commented Jan 9, 2025 •

edited

Loading

genivia-inc commented Jan 9, 2025 •

edited

Loading