facelessuser · simonfelding · Nov 13, 2024 · Nov 13, 2024 · Nov 13, 2024 · Nov 13, 2024
diff --git a/docs/src/markdown/about/changelog.md b/docs/src/markdown/about/changelog.md
@@ -1,5 +1,9 @@
 # Changelog
 
+## 10.13
+
+-   **NEW**: Snippets: Snippets can now extract lines that match a regex string.
+
 ## 10.12
 
 -   **NEW**: Blocks: Blocks extensions no longer considered in beta.

diff --git a/docs/src/markdown/extensions/snippets.md b/docs/src/markdown/extensions/snippets.md
@@ -126,6 +126,34 @@ include.md::3
 ;--8<--
 ```
 
+/// new | New 10.13
+You can also use regex syntax to match lines, or parts of lines.
+///
+
+Regex strings must start and end with a `/`.
+
+Assuming a file like this:
+```
+# this is the username
+username = alice
+# this is the group
+groups = cool,fun,smart
+```
+
+- To extract the line that contains `username =`, use `file.md:/username =/`. This returns `username = alice`.
+- To extract parts of a line, use match groups: `file.md:/username = (.*)/`. This returns `alice`.
+- If you use multiple groups, they are joined together: `file.md:/groups = ([a-z]+),([a-z]+),([a-z]+)/` returns `cool fun smart`.
+- The regex can match multiple lines in a file: `file.md:/=/` returns both lines that contain a `=`, but not the comments in between.
+- If you set `['DOTALL']` and/or `['MULTILINE']` in the `regex_flags` option, the matches can span multiple lines.
+
+/// tip
+The regex matching uses [the python `re` library](https://docs.python.org/3/library/re.html) with [the `search` function](https://docs.python.org/3/library/re.html#re.search).
+This means that if a line matches, the entire line is returned (unless there is a match group).
+
+Please [refer to the documentation for details on how the regex flags work](https://docs.python.org/3/library/re.html#flags).
+Make sure to validate and debug your regex using tools like regex101.com.
+///
+
 ### Snippet Sections
 
 /// new | New 9.7
@@ -254,15 +282,16 @@ appended to every to Markdown content. Each entry in the list searched for relat
 
 ## Options
 
-Option                 | Type            | Default          | Description
----------------------- | --------------- | ---------------- |------------
-`base_path`            | \[string\]      | `#!py3 ['.']`    | A list of strings indicating base paths to be used resolve snippet locations. For legacy purposes, a single string will also be accepted as well. Base paths will be resolved in the order they are specified. When resolving a file name, the first match wins. If a file name is specified, the base name will be matched.
-`encoding`             | string          | `#!py3 'utf-8'`  | Encoding to use when reading in the snippets.
-`check_paths`          | bool            | `#!py3 False`    | Make the build fail if a snippet can't be found.
-`auto_append`          | \[string\]      | `#!py3 []`       | A list of snippets (relative to the `base_path`) to auto append to the Markdown content.
-`url_download`         | bool            | `#!py3 False`    | Allows URLs to be specified as file snippets. URLs will be downloaded and inserted accordingly.
-`url_max_size`         | int             | `#!py3 33554432` | Sets an arbitrary max content size. If content length is reported to be larger, and exception will be thrown. Default is ~32 MiB.
-`url_timeout`          | float           | `#!py3 10.0`     | Passes an arbitrary timeout in seconds to URL requestor. By default this is set to 10 seconds.
-`url_request_headers`  | {string:string} | `#!py3 {}`       | Passes arbitrary headers to URL requestor. By default this is set to empty map.
-`dedent_subsections`   | bool            | `#!py3 False`    | Remove any common leading whitespace from every line in text of a subsection that is inserted via "sections" or by "lines".
-`restrict_base_path`   | bool            | `#!py True`      | Ensure that the specified snippets are children of the specified base path(s). This prevents a path relative to the base path, but not explicitly a child of the base path.
+Option                 | Type            | Default           | Description
+---------------------- | --------------- | ----------------- |------------
+`base_path`            | \[string\]      | `#!py3 ['.']`     | A list of strings indicating base paths to be used resolve snippet locations. For legacy purposes, a single string will also be accepted as well. Base paths will be resolved in the order they are specified. When resolving a file name, the first match wins. If a file name is specified, the base name will be matched.
+`encoding`             | string          | `#!py3 'utf-8'`   | Encoding to use when reading in the snippets.
+`check_paths`          | bool            | `#!py3 False`     | Make the build fail if a snippet can't be found.
+`auto_append`          | \[string\]      | `#!py3 []`        | A list of snippets (relative to the `base_path`) to auto append to the Markdown content.
+`url_download`         | bool            | `#!py3 False`     | Allows URLs to be specified as file snippets. URLs will be downloaded and inserted accordingly.
+`url_max_size`         | int             | `#!py3 33554432`  | Sets an arbitrary max content size. If content length is reported to be larger, and exception will be thrown. Default is ~32 MiB.
+`url_timeout`          | float           | `#!py3 10.0`      | Passes an arbitrary timeout in seconds to URL requestor. By default this is set to 10 seconds.
+`url_request_headers`  | {string:string} | `#!py3 {}`        | Passes arbitrary headers to URL requestor. By default this is set to empty map.
+`dedent_subsections`   | bool            | `#!py3 False`     | Remove any common leading whitespace from every line in text of a subsection that is inserted via "sections" or by "lines".
+`restrict_base_path`   | bool            | `#!py True`       | Ensure that the specified snippets are children of the specified base path(s). This prevents a path relative to the base path, but not explicitly a child of the base path.
+`regex_flags`          | \[string\]      | `#!py ['NOFLAG']` | A list of flags to pass to re.search (such as `DOTALL`, `MULTILINE` and/or `IGNORECASE`).
diff --git a/pymdownx/snippets.py b/pymdownx/snippets.py
@@ -92,11 +92,44 @@ def __init__(self, config, md):
         self.url_timeout = config['url_timeout']
         self.url_request_headers = config['url_request_headers']
         self.dedent_subsections = config['dedent_subsections']
+        self.regex_flags = config['regex_flags']
         self.tab_length = md.tab_length
         super().__init__()
 
         self.download.cache_clear()
 
+    def extract_regex(self, regex, lines):
+        """Extract the specified regex from the lines. If the regex contains groups, they will be joined together."""
+
+        new_lines = []
+        regex = regex[1:-1] # We expect a string wrapped in slashes. This removes the slashes.
+        flags = 0
+        if type(self.regex_flags) != list:
+            raise TypeError(f"regex_flags must be a list, not a {type(self.regex_flags)}. Got: {self.regex_flags}")
+
+        for flag in self.regex_flags:
+            flags |= getattr(re, flag) # The flags are joined together using bitwise OR as per the re module documentation.
+        if "MULTILINE" in self.regex_flags or "DOTALL" in self.regex_flags:
+            m = re.finditer(regex, "\n".join(lines), flags)
+            for match in m:
+                if match and match.groups():
+                    new_lines.append(" ".join(match.groups()))
+                elif match:
+                    new_lines.append(match[0])
+        else:
+            for line in lines:
+                m = re.search(regex, line, flags) 
+                if m and m.groups():
+                    new_lines.append(" ".join(m.groups())) # join the groups together
+                elif m:
+                    new_lines.append(m[0])
+
+        if not new_lines and self.check_paths:
+            flagstring = f"with flag{'s' if len(self.regex_flags) > 1 else ''} {self.regex_flags}" if flags else "" # If flags is 0, we don't want to print it (re.NOFLAG == 0).
+            raise SnippetMissingError(f"No line matched the regex /{regex}/ {flagstring}")
+
+        return self.dedent(new_lines) if self.dedent_subsections else new_lines
+
     def extract_section(self, section, lines):
         """Extract the specified section from the lines."""
 
@@ -328,6 +361,8 @@ def parse_snippets(self, lines, file_name=None, is_url=False):
                             if start is not None or end is not None:
                                 s = slice(start, end)
                                 s_lines = self.dedent(s_lines[s]) if self.dedent_subsections else s_lines[s]
+                            elif section and section.startswith("/") and section.endswith("/"): # if section is a regex
+                                s_lines = self.extract_regex(section, s_lines)
                             elif section:
                                 s_lines = self.extract_section(section, s_lines)
                     else:
@@ -337,6 +372,8 @@ def parse_snippets(self, lines, file_name=None, is_url=False):
                             if start is not None or end is not None:
                                 s = slice(start, end)
                                 s_lines = self.dedent(s_lines[s]) if self.dedent_subsections else s_lines[s]
+                            elif section and section.startswith("/") and section.endswith("/"): # if section is a regex
+                                s_lines = self.extract_regex(section, s_lines)
                             elif section:
                                 s_lines = self.extract_section(section, s_lines)
                         except SnippetMissingError:
@@ -396,7 +433,8 @@ def __init__(self, *args, **kwargs):
             'url_max_size': [DEFAULT_URL_SIZE, "External URL max size (0 means no limit)- Default: 32 MiB"],
             'url_timeout': [DEFAULT_URL_TIMEOUT, 'Defualt URL timeout (0 means no timeout) - Default: 10 sec'],
             'url_request_headers': [DEFAULT_URL_REQUEST_HEADERS, "Extra request Headers - Default: {}"],
-            'dedent_subsections': [False, "Dedent subsection extractions e.g. 'sections' and/or 'lines'."]
+            'dedent_subsections': [False, "Dedent subsection extractions e.g. 'sections' and/or 'lines'."],
+            'regex_flags': [['NOFLAG'], "Flags to pass to re.search (such as DOTALL, MULTILINE and/or IGNORECASE) - Default: ['NOFLAG']"]
         }
 
         super().__init__(*args, **kwargs)