Skip to content

Commit

Permalink
Allow :contains() to have lists of text to match (#119)
Browse files Browse the repository at this point in the history
  • Loading branch information
facelessuser authored Feb 25, 2019
1 parent fbf98e4 commit 7774795
Show file tree
Hide file tree
Showing 10 changed files with 113 additions and 23 deletions.
1 change: 1 addition & 0 deletions docs/src/markdown/_snippets/links.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
[aspell]: /~https://github.com/GNUAspell/aspell
[bs4]: https://beautiful-soup-4.readthedocs.io/en/latest/#
[contains-draft]: https://www.w3.org/TR/2001/CR-css3-selectors-20011113/#content-selectors
[custom-extensions-1]: https://drafts.csswg.org/css-extensions-1/
[html5lib]: /~https://github.com/html5lib/html5lib-python
[lxml]: /~https://github.com/lxml/lxml
Expand Down
4 changes: 3 additions & 1 deletion docs/src/markdown/about/changelog.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Changelog

## Latest
## 1.9.0

- **NEW**: Allow `:contans()` to accept a list of text to search for.
- **FIX**: Don't install test files when installing the `soupsieve` package.
- **FIX**: Improve efficiency of `:contains()` comparison.

## 1.8.0

Expand Down
16 changes: 15 additions & 1 deletion docs/src/markdown/about/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,7 @@ Attribute | Description
`selectors` | Contains a tuple of `SelectorList` objects for each pseudo-class selector part of the compound selector: `#!css :is()`, `#!css :not()`, `#!css :has()`, etc.
`relation` | This will contain a `SelectorList` object with one `Selector` object, which could in turn chain an additional relation depending on the complexity of the compound selector. For instance, `div > p + a` would be a `Selector` for `a` that contains a `relation` for `p` (another `SelectorList` object) which also contains a relation of `div`. When matching, we would match that the tag is `a`, and then walk its relation chain verifying that they all match. In this case, the relation chain would be a direct, previous sibling of `p`, which has a direct parent of `div`. A `:has()` pseudo-class would walk this in the opposite order. `div:has(> p + a)` would verify `div`, and then check for a child of `p` with a sibling of `a`.
`rel_type` | `rel_type` is attached to relational selectors. In the case of `#!css div > p + a`, the relational selectors of `div` and `p` would get a relational type of `>` and `+` respectively. `:has()` relational `rel_type` are preceded with `:` to signify a forward looking relation.
`contains` | Contains a tuple of strings of content to match in an element.
`contains` | Contains a tuple of [`SelectorContains`](#selectorcontains) objects. Each object contains the list of text to match an element's content against.
`lang` | Contains a tuple of [`SelectorLang`](#selectorlang) objects.
`flags` | Selector flags that used to signal a type of selector is present.

Expand Down Expand Up @@ -288,6 +288,20 @@ Attribute | Description
`pattern` | Contains a `re` regular expression object that matches the desired attribute value.
`xml_type_pattern` | As the default `type` pattern is case insensitive, when the attribute value is `type` and a case sensitivity has not been explicitly defined, a secondary case sensitive `type` pattern is compiled for use with XML documents when detected.

### `SelectorContains`

```py3
class SelectorContains:
"""Selector contains rule."""

def __init__(self, text):
"""Initialize."""
```

Attribute | Description
------------------- | -----------
`text` | A tuple of acceptable text that that an element should match. An element only needs to match at least one.

### `SelectorNth`

```py3
Expand Down
7 changes: 6 additions & 1 deletion docs/src/markdown/selectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -412,10 +412,15 @@ Selects any `#!html <input type="radio"/>`, `#!html <input type="checkbox"/>`, o
input:checked
```

### `:contains`<span class="star badge"></span> {:#:contains}
### `:contains()`<span class="star badge"></span> {:#:contains}

Selects elements that contain the text provided text. Text can be found in either itself, or its descendants.

Contains was originally included in a [CSS early draft][contains-draft], but was in the end dropped from the draft.
Soup Sieve implements it how it was originally proposed in the draft with the addition that `:contains()` can accept
either a single value, or a comma separated list of values. An element needs only to match at least one of the items
in the comma separated list to be considered matching.

!!! warning "Contains"
`:contains()` is an expensive operation as it scans all the text nodes of an element under consideration, which
includes all descendants. Using highly specific selectors can reduce how often it is evaluated.
Expand Down
2 changes: 1 addition & 1 deletion soupsieve/__meta__.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,5 +186,5 @@ def parse_version(ver, pre=False):
return Version(major, minor, micro, release, pre, post, dev)


__version_info__ = Version(1, 8, 1, ".dev")
__version_info__ = Version(1, 9, 0, ".dev")
__version__ = __version_info__._get_canonical()
13 changes: 10 additions & 3 deletions soupsieve/css_match.py
Original file line number Diff line number Diff line change
Expand Up @@ -811,10 +811,17 @@ def match_contains(self, el, contains):
"""Match element if it contains text."""

match = True
for c in contains:
if c not in self.get_text(el):
content = None
for contain_list in contains:
if content is None:
content = self.get_text(el)
found = False
for text in contain_list.text:
if text in content:
found = True
break
if not found:
match = False
break
return match

def match_default(self, el):
Expand Down
30 changes: 18 additions & 12 deletions soupsieve/css_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,13 +149,13 @@
\({ws}*(?P<nth_type>{nth}|even|odd)){ws}*\)
'''.format(ws=WSC, nth=NTH)
# Pseudo class language (`:lang("*-de", en)`)
PAT_PSEUDO_LANG = r':lang\({ws}*(?P<lang>{value}(?:{ws}*,{ws}*{value})*){ws}*\)'.format(ws=WSC, value=VALUE)
PAT_PSEUDO_LANG = r':lang\({ws}*(?P<values>{value}(?:{ws}*,{ws}*{value})*){ws}*\)'.format(ws=WSC, value=VALUE)
# Pseudo class direction (`:dir(ltr)`)
PAT_PSEUDO_DIR = r':dir\({ws}*(?P<dir>ltr|rtl){ws}*\)'.format(ws=WSC)
# Combining characters (`>`, `~`, ` `, `+`, `,`)
PAT_COMBINE = r'{wsc}*?(?P<relation>[,+>~]|{ws}(?![,+>~])){wsc}*'.format(ws=WS, wsc=WSC)
# Extra: Contains (`:contains(text)`)
PAT_PSEUDO_CONTAINS = r':contains\({ws}*(?P<value>{value}){ws}*\)'.format(ws=WSC, value=VALUE)
PAT_PSEUDO_CONTAINS = r':contains\({ws}*(?P<values>{value}(?:{ws}*,{ws}*{value})*){ws}*\)'.format(ws=WSC, value=VALUE)

# Regular expressions
# CSS escape pattern
Expand All @@ -166,8 +166,8 @@
r'(?P<s1>[-+])?(?P<a>[0-9]+n?|n)(?:(?<=n){ws}*(?P<s2>[-+]){ws}*(?P<b>[0-9]+))?'.format(ws=WSC),
re.I
)
# Pattern to iterate multiple languages.
RE_LANG = re.compile(r'(?:(?P<value>{value})|(?P<split>{ws}*,{ws}*))'.format(ws=WSC, value=VALUE), re.X)
# Pattern to iterate multiple values.
RE_VALUES = re.compile(r'(?:(?P<value>{value})|(?P<split>{ws}*,{ws}*))'.format(ws=WSC, value=VALUE), re.X)
# Whitespace checks
RE_WS = re.compile(WS)
RE_WS_BEGIN = re.compile('^{}*'.format(WSC))
Expand Down Expand Up @@ -751,21 +751,27 @@ def parse_class_id(self, sel, m, has_selector):
def parse_pseudo_contains(self, sel, m, has_selector):
"""Parse contains."""

content = m.group('value')
if content.startswith(("'", '"')):
content = css_unescape(content[1:-1], True)
else:
content = css_unescape(content)
sel.contains.append(content)
values = m.group('values')
patterns = []
for token in RE_VALUES.finditer(values):
if token.group('split'):
continue
value = token.group('value')
if value.startswith(("'", '"')):
value = css_unescape(value[1:-1], True)
else:
value = css_unescape(value)
patterns.append(value)
sel.contains.append(ct.SelectorContains(tuple(patterns)))
has_selector = True
return has_selector

def parse_pseudo_lang(self, sel, m, has_selector):
"""Parse pseudo language."""

lang = m.group('lang')
values = m.group('values')
patterns = []
for token in RE_LANG.finditer(lang):
for token in RE_VALUES.finditer(values):
if token.group('split'):
continue
value = token.group('value')
Expand Down
15 changes: 15 additions & 0 deletions soupsieve/css_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
'SelectorNull',
'SelectorTag',
'SelectorAttribute',
'SelectorContains',
'SelectorNth',
'SelectorLang',
'SelectorList',
Expand Down Expand Up @@ -234,6 +235,19 @@ def __init__(self, attribute, prefix, pattern, xml_type_pattern):
)


class SelectorContains(Immutable):
"""Selector contains rule."""

__slots__ = ("text", "_hash")

def __init__(self, text):
"""Initialize."""

super(SelectorContains, self).__init__(
text=text
)


class SelectorNth(Immutable):
"""Selector nth type."""

Expand Down Expand Up @@ -324,6 +338,7 @@ def pickle_register(obj):
pickle_register(SelectorNull)
pickle_register(SelectorTag)
pickle_register(SelectorAttribute)
pickle_register(SelectorContains)
pickle_register(SelectorNth)
pickle_register(SelectorLang)
pickle_register(SelectorList)
8 changes: 4 additions & 4 deletions tests/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,9 +457,9 @@ def test_copy_pickle(self):
# We force a pattern that contains all custom types:
# `Selector`, `NullSelector`, `SelectorTag`, `SelectorAttribute`,
# `SelectorNth`, `SelectorLang`, `SelectorList`, `Namespaces`,
# and `CustomSelectors`.
# `SelectorContains`, and `CustomSelectors`.
p1 = sv.compile(
'p.class#id[id]:nth-child(2):lang(en):focus',
'p.class#id[id]:nth-child(2):lang(en):focus:contains("text", "other text")',
{'html': 'http://www.w3.org/TR/html4/'},
custom={':--header': 'h1, h2, h3, h4, h5, h6'}
)
Expand All @@ -469,15 +469,15 @@ def test_copy_pickle(self):

# Test that we pull the same one from cache
p2 = sv.compile(
'p.class#id[id]:nth-child(2):lang(en):focus',
'p.class#id[id]:nth-child(2):lang(en):focus:contains("text", "other text")',
{'html': 'http://www.w3.org/TR/html4/'},
custom={':--header': 'h1, h2, h3, h4, h5, h6'}
)
self.assertTrue(p1 is p2)

# Test that we compile a new one when providing a different flags
p3 = sv.compile(
'p.class#id[id]:nth-child(2):lang(en):focus',
'p.class#id[id]:nth-child(2):lang(en):focus:contains("text", "other text")',
{'html': 'http://www.w3.org/TR/html4/'},
custom={':--header': 'h1, h2, h3, h4, h5, h6'},
flags=0x10
Expand Down
40 changes: 40 additions & 0 deletions tests/test_extra/test_contains.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,46 @@ def test_contains_quoted_with_escaped_newline_with_carriage_return(self):
flags=util.HTML
)

def test_contains_list(self):
"""Test contains list."""

self.assert_selector(
self.MARKUP,
'body span:contains("does not exist", "that")',
['2'],
flags=util.HTML
)

def test_contains_multiple(self):
"""Test contains multiple."""

self.assert_selector(
self.MARKUP,
'body span:contains("th"):contains("at")',
['2'],
flags=util.HTML
)

def test_contains_multiple_not_match(self):
"""Test contains multiple with "not" and with a match."""

self.assert_selector(
self.MARKUP,
'body span:not(:contains("does not exist")):contains("that")',
['2'],
flags=util.HTML
)

def test_contains_multiple_not_no_match(self):
"""Test contains multiple with "not" and no match."""

self.assert_selector(
self.MARKUP,
'body span:not(:contains("that")):contains("that")',
[],
flags=util.HTML
)

def test_contains_with_descendants(self):
"""Test that contains returns descendants as well as the top level that contain."""

Expand Down

0 comments on commit 7774795

Please sign in to comment.