Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex matching for Snippets #2525

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
4 changes: 4 additions & 0 deletions docs/src/markdown/about/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Changelog

## 10.13

- **NEW**: Snippets: Snippets can now extract lines that match a regex string.

## 10.12

- **NEW**: Blocks: Blocks extensions no longer considered in beta.
Expand Down
53 changes: 41 additions & 12 deletions docs/src/markdown/extensions/snippets.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,34 @@ include.md::3
;--8<--
```

/// new | New 10.13
You can also use regex syntax to match lines, or parts of lines.
///

Regex strings must start and end with a `/`.

Assuming a file like this:
```
# this is the username
username = alice
# this is the group
groups = cool,fun,smart
```

- To extract the line that contains `username =`, use `file.md:/username =/`. This returns `username = alice`.
- To extract parts of a line, use match groups: `file.md:/username = (.*)/`. This returns `alice`.
- If you use multiple groups, they are joined together: `file.md:/groups = ([a-z]+),([a-z]+),([a-z]+)/` returns `cool fun smart`.
- The regex can match multiple lines in a file: `file.md:/=/` returns both lines that contain a `=`, but not the comments in between.
- If you set `['DOTALL']` and/or `['MULTILINE']` in the `regex_flags` option, the matches can span multiple lines.

/// tip
The regex matching uses [the python `re` library](https://docs.python.org/3/library/re.html) with [the `search` function](https://docs.python.org/3/library/re.html#re.search).
This means that if a line matches, the entire line is returned (unless there is a match group).

Please [refer to the documentation for details on how the regex flags work](https://docs.python.org/3/library/re.html#flags).
Make sure to validate and debug your regex using tools like regex101.com.
///

### Snippet Sections

/// new | New 9.7
Expand Down Expand Up @@ -254,15 +282,16 @@ appended to every to Markdown content. Each entry in the list searched for relat

## Options

Option | Type | Default | Description
---------------------- | --------------- | ---------------- |------------
`base_path` | \[string\] | `#!py3 ['.']` | A list of strings indicating base paths to be used resolve snippet locations. For legacy purposes, a single string will also be accepted as well. Base paths will be resolved in the order they are specified. When resolving a file name, the first match wins. If a file name is specified, the base name will be matched.
`encoding` | string | `#!py3 'utf-8'` | Encoding to use when reading in the snippets.
`check_paths` | bool | `#!py3 False` | Make the build fail if a snippet can't be found.
`auto_append` | \[string\] | `#!py3 []` | A list of snippets (relative to the `base_path`) to auto append to the Markdown content.
`url_download` | bool | `#!py3 False` | Allows URLs to be specified as file snippets. URLs will be downloaded and inserted accordingly.
`url_max_size` | int | `#!py3 33554432` | Sets an arbitrary max content size. If content length is reported to be larger, and exception will be thrown. Default is ~32 MiB.
`url_timeout` | float | `#!py3 10.0` | Passes an arbitrary timeout in seconds to URL requestor. By default this is set to 10 seconds.
`url_request_headers` | {string:string} | `#!py3 {}` | Passes arbitrary headers to URL requestor. By default this is set to empty map.
`dedent_subsections` | bool | `#!py3 False` | Remove any common leading whitespace from every line in text of a subsection that is inserted via "sections" or by "lines".
`restrict_base_path` | bool | `#!py True` | Ensure that the specified snippets are children of the specified base path(s). This prevents a path relative to the base path, but not explicitly a child of the base path.
Option | Type | Default | Description
---------------------- | --------------- | ----------------- |------------
`base_path` | \[string\] | `#!py3 ['.']` | A list of strings indicating base paths to be used resolve snippet locations. For legacy purposes, a single string will also be accepted as well. Base paths will be resolved in the order they are specified. When resolving a file name, the first match wins. If a file name is specified, the base name will be matched.
`encoding` | string | `#!py3 'utf-8'` | Encoding to use when reading in the snippets.
`check_paths` | bool | `#!py3 False` | Make the build fail if a snippet can't be found.
`auto_append` | \[string\] | `#!py3 []` | A list of snippets (relative to the `base_path`) to auto append to the Markdown content.
`url_download` | bool | `#!py3 False` | Allows URLs to be specified as file snippets. URLs will be downloaded and inserted accordingly.
`url_max_size` | int | `#!py3 33554432` | Sets an arbitrary max content size. If content length is reported to be larger, and exception will be thrown. Default is ~32 MiB.
`url_timeout` | float | `#!py3 10.0` | Passes an arbitrary timeout in seconds to URL requestor. By default this is set to 10 seconds.
`url_request_headers` | {string:string} | `#!py3 {}` | Passes arbitrary headers to URL requestor. By default this is set to empty map.
`dedent_subsections` | bool | `#!py3 False` | Remove any common leading whitespace from every line in text of a subsection that is inserted via "sections" or by "lines".
`restrict_base_path` | bool | `#!py True` | Ensure that the specified snippets are children of the specified base path(s). This prevents a path relative to the base path, but not explicitly a child of the base path.
`regex_flags` | \[string\] | `#!py ['NOFLAG']` | A list of flags to pass to re.search (such as `DOTALL`, `MULTILINE` and/or `IGNORECASE`).
40 changes: 39 additions & 1 deletion pymdownx/snippets.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,11 +92,44 @@ def __init__(self, config, md):
self.url_timeout = config['url_timeout']
self.url_request_headers = config['url_request_headers']
self.dedent_subsections = config['dedent_subsections']
self.regex_flags = config['regex_flags']
self.tab_length = md.tab_length
super().__init__()

self.download.cache_clear()

def extract_regex(self, regex, lines):
"""Extract the specified regex from the lines. If the regex contains groups, they will be joined together."""

new_lines = []
regex = regex[1:-1] # We expect a string wrapped in slashes. This removes the slashes.
flags = 0
if type(self.regex_flags) != list:
raise TypeError(f"regex_flags must be a list, not a {type(self.regex_flags)}. Got: {self.regex_flags}")

for flag in self.regex_flags:
flags |= getattr(re, flag) # The flags are joined together using bitwise OR as per the re module documentation.
if "MULTILINE" in self.regex_flags or "DOTALL" in self.regex_flags:
m = re.finditer(regex, "\n".join(lines), flags)
for match in m:
if match and match.groups():
new_lines.append(" ".join(match.groups()))
elif match:
new_lines.append(match[0])
else:
for line in lines:
m = re.search(regex, line, flags)
if m and m.groups():
new_lines.append(" ".join(m.groups())) # join the groups together
elif m:
new_lines.append(m[0])

if not new_lines and self.check_paths:
flagstring = f"with flag{'s' if len(self.regex_flags) > 1 else ''} {self.regex_flags}" if flags else "" # If flags is 0, we don't want to print it (re.NOFLAG == 0).
raise SnippetMissingError(f"No line matched the regex /{regex}/ {flagstring}")

return self.dedent(new_lines) if self.dedent_subsections else new_lines

def extract_section(self, section, lines):
"""Extract the specified section from the lines."""

Expand Down Expand Up @@ -328,6 +361,8 @@ def parse_snippets(self, lines, file_name=None, is_url=False):
if start is not None or end is not None:
s = slice(start, end)
s_lines = self.dedent(s_lines[s]) if self.dedent_subsections else s_lines[s]
elif section and section.startswith("/") and section.endswith("/"): # if section is a regex
s_lines = self.extract_regex(section, s_lines)
elif section:
s_lines = self.extract_section(section, s_lines)
else:
Expand All @@ -337,6 +372,8 @@ def parse_snippets(self, lines, file_name=None, is_url=False):
if start is not None or end is not None:
s = slice(start, end)
s_lines = self.dedent(s_lines[s]) if self.dedent_subsections else s_lines[s]
elif section and section.startswith("/") and section.endswith("/"): # if section is a regex
s_lines = self.extract_regex(section, s_lines)
elif section:
s_lines = self.extract_section(section, s_lines)
except SnippetMissingError:
Expand Down Expand Up @@ -396,7 +433,8 @@ def __init__(self, *args, **kwargs):
'url_max_size': [DEFAULT_URL_SIZE, "External URL max size (0 means no limit)- Default: 32 MiB"],
'url_timeout': [DEFAULT_URL_TIMEOUT, 'Defualt URL timeout (0 means no timeout) - Default: 10 sec'],
'url_request_headers': [DEFAULT_URL_REQUEST_HEADERS, "Extra request Headers - Default: {}"],
'dedent_subsections': [False, "Dedent subsection extractions e.g. 'sections' and/or 'lines'."]
'dedent_subsections': [False, "Dedent subsection extractions e.g. 'sections' and/or 'lines'."],
'regex_flags': [['NOFLAG'], "Flags to pass to re.search (such as DOTALL, MULTILINE and/or IGNORECASE) - Default: ['NOFLAG']"]
}

super().__init__(*args, **kwargs)
Expand Down