Optimize raw HTML post-processor #1510

pawamoy · 2025-02-21T15:34:10Z

pawamoy · 2025-02-21T15:53:35Z

Hmm, the list->set change could be seen as breaking. We can instead create a new Markdown._block_level_elements attribute, and use that in isblocklevel(). Let me know if you think that's best.

pawamoy · 2025-02-22T22:54:14Z

markdown/postprocessors.py

+            else:
+                key = m.group(2)
+                wrapped = False
+            if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks):


Could use html_counter instead:

Suggested change

if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks):

if (key := int(key)) >= self.md.htmlStash.html_counter:

I don't see any reason to recompute the length when we already have it stored.

What's not clear to me is why we are reassigning key here. I understand the need to convert a string to an integer, but why do that here? Why not when it is initially assigned on lines 76 and/or 79?

Oh, just an oversight, we can definitely coerce to int when first assigning.

Ah I remember now, that's because I'm doing if key := m.group(1), and casting to int here would fail when m.group(1) is None.

markdown/postprocessors.py

waylan

This looks good generally. However, I haven't tested it or done any performance comparisons of my own. From a quick reading of the code only the few things you pointed out stand out to me. I'll need to take a closer look when I have more time.

markdown/postprocessors.py

waylan · 2025-02-26T16:39:52Z

markdown/postprocessors.py

+            else:
+                key = m.group(2)
+                wrapped = False
+            if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks):


I don't see any reason to recompute the length when we already have it stored.

What's not clear to me is why we are reassigning key here. I understand the need to convert a string to an integer, but why do that here? Why not when it is initially assigned on lines 76 and/or 79?

markdown/postprocessors.py

facelessuser · 2025-02-26T16:53:07Z

I'll likely try this branch on some projects to inspect for regressions, just to be sure

waylan · 2025-02-26T16:59:57Z

Hmm, the list->set change could be seen as breaking. We can instead create a new Markdown._block_level_elements attribute, and use that in isblocklevel(). Let me know if you think that's best.

This is a valid concern. I'm not sure what to think about it.

facelessuser · 2025-02-26T17:09:43Z

Yes, I saw the change of blocks using a list to set and meant to comment on it. I'm pretty sure it'll break some of my plugins, so keeping the original but deprecating it in favor of a new structure is probably the way to go.

pawamoy · 2025-02-26T17:19:22Z

Thanks for confirming @facelessuser! I'll change it.

pawamoy · 2025-02-26T17:27:30Z

I just added a new private _block_level_elements attribute for now, but we should probably deprecate the old one and make the new one public. I'll wait for your input on this @waylan.

facelessuser · 2025-02-26T17:39:09Z

markdown/core.py

@@ -114,6 +114,7 @@ def __init__(self, **kwargs):
        """ List of characters which get the backslash escape treatment. """

        self.block_level_elements: list[str] = BLOCK_LEVEL_ELEMENTS.copy()
+        self._block_level_elements: set[str] = set(BLOCK_LEVEL_ELEMENTS)


I would prefer a non-private version of the sets as I imagine we can and should deprecate the old one and eventually remove it in the future. I'm likely to begin referencing the new one as soon as I can, I just don't want it to break things when it releases.

I agree. Lets make it public.

There is one other concern. Some extensions are likely to alter the list (by adding or removing elements). Therefore, the new attribute could get out-of-sync with the pre-existing attribute. How should we address that?

One possible solution is to create a custom class which subclasses set and also provides the common list methods (which should issue a deprecation warning when called). Then there are not two sequences to keep in sync and both the old and new APIs continue to work. This also removes the need to come up with a new attribute name on the Markdown class. At some future point the list methods could instead raise an error when called. Then at a later point, the custom class could be removed.

Sounds good! list has the following methods in common with set:

copy: ~~nothing to do~~ make sure to return a copy of the current instance (not a raw set)

pop: the list method accepts an index argument, and raises IndexError instead of KeyError. We can catch KeyError and re-raise IndexError (and emit a warning?) during the deprecation period. As for index, I think we should call set's pop but emit a warning that it pops an arbitrary item.

remove: both accept a single positional argument, but list raises ValueError instead of KeyError. We can catch and re-raise again (and emit a warning?) during the deprecation period.

clear: nothing to do

...and the following additional methods:

append: easy, call self.add

sort: doesn't make sense for a set, so just pass

reverse: doesn't make sense for a set, so just pass

insert: easy, call self.add

index: doesn't make sense for a set, return 0?

extend: easy, call self.update

count: 1 if value in self else 0

I'll proceed with this, let me know if you'd like to change a few things 🙂

That looks good with two exceptions.

A common action would be to remove items. I imagine code exists which finds the index of an object and removes or pops by index. set.remove and set.pop accept the object as an argument rather than the index. We could return the object from index so that it works when passed into remove or pop, but that seems weird in all other respects. Not sure how best to approach this.

Presumably count could return self.__len__().

So, something like a plain class with two hidden structs (list and set), with all accompanying methods?

class _BlockLevelElements: def __init__(self, elements: list[str]) -> None: self._list = elements.copy() self._set = set(self.list) def __contains__(self, value): return value in self._set def append(...): ... def add(...): ... ...

I pushed something in d4813dc, let me know what you think.

@pawamoy I think something like that may work, it needs to be tested well though. I saw in your link that things like __add__ only add to the list but don't add to the set.

Never mind, it is creating a new object and you are returning a pure list but providing a warning. Things may break if they set it back, but I understand what's going on.

Yeah I figured new lists/sets created this way would likely not be shared across multiple extensions, but assigning it back would create such a scenario. That's also something @waylan remarked, so to be more robust we can just create a new _BlockLevelElements instance 👍

Using a set allows for better performances when checking for membership of a tag within block level elements. Issue-1507: Python-Markdown#1507

Previously, the raw HTML post-processor would precompute all possible replacements for placeholders in a string, based on the HTML stash. It would then apply a regular expression substitution using these replacements. Finally, if the text changed, it would recurse, and do all that again. This was inefficient because placeholders were re-computed each time it recursed, and because only a few replacements would be used anyway. This change moves the recursion into the regular expression substitution, so that: 1. the regular expression does minimal work on the text (contrary to re-scanning text already scanned in previous frames); 2. but more importantly, replacements aren't computed ahead of time anymore (and even less *several times*), and only fetched from the HTML stash as placeholders are found in the text. The substitution function relies on the regular expression groups ordering: we make sure to match `<p>PLACEHOLDER</p>` first, before `PLACEHOLDER`. The presence of a wrapping `p` tag indicates whether to wrap again the substitution result, or not (also depending on whether the substituted HTML is a block-level tag). Issue-1507: Python-Markdown#1507

pawamoy · 2025-02-27T12:33:11Z

markdown/util.py

+        warnings.warn(
+            "Using block level elements as a list is deprecated, use it as a set instead.",
+            DeprecationWarning,
+        )


I emit the same warning every time, so that users only get it once. The message goes straight to the point: don't use it as a list.

pawamoy · 2025-02-27T12:35:11Z

markdown/util.py

+            DeprecationWarning,
+        )
+        # Using `+` means user expects a list back.
+        return self._list + other


Methods with no side-effects return a list or set directly, not a copy of our hybrid instance. The reason is we assume that as soon as a user explicitly uses it as a list (or set), they will only use it as a list (or set), so no need to continue providing both interfaces.

Not necessarily. The same object is likely to be accessed by multiple extensions. It is reasonable to assume that some newly updated extensions will assume a set and some older (not yet updated) extensions will assume a list. Whatever solution we chose needs to continue to work with both throughout the life of the Markdown class instance.

Yeah I thought about that, though since these methods return a new object (that's what I meant with "no side-effect", which wasn't super clear), I figured there are very few chances multiple extensions would actually share references to such newly created lists/sets (each new list/set would probably be isolated to the extension that created it). Anyway, I can definitely change that to return a _BlockLevelElements instance, to be more robust 🙂

Sorry, I missed that this comment was in a specific method which returns a modified object. Yeah, in this case, returning the same object type makes sense.

No worries, I wasn't very clear about it. @facelessuser expressed the same concern, so we might want to return a new instance of _BlockLevelElements, just to be safe. Let me know what you prefer.

markdown/util.py

pawamoy · 2025-02-27T12:37:50Z

markdown/util.py

+
+
+# Type it as `set[str]` to express our intent for it to be used as such.
+BLOCK_LEVEL_ELEMENTS: set[str] = _BlockLevelElements([


I can't guarantee I didn't make subtle mistakes, so it should probably need proper testing.

If this all seems too complicated, happy to try something simpler (it was implemented quickly, no problem with discarding all this 😄).

It should definitely get tested. Eventually it'll go away as well. There's probably ways in which people can still mess things up, but I think this tries to provide a reasonable attempt to keep stuff from breaking. I'm open to other ideas if someone has them as well.

Alright, I'll start writing tests 🙂 Just wanted initial feedback before tackling them 😊

waylan · 2025-02-27T20:30:00Z

markdown/util.py

-BLOCK_LEVEL_ELEMENTS: list[str] = [
+# TODO: Raise errors from list methods in the future.
+# Later, remove this class entirely and use a regular set.
+class _BlockLevelElements:


Elsewhere the type is specified as set. However, as this is not a subclass of set, that would be incorrect. Also, later when we remove this, then we will have a set. Perhaps this class should properly reflect that now. Unfortunately, there are no specialized container datatypes for sets in collections. And sets are not a Sequence Type either. Not really sure how best to address this.

My feeling was that it was OK to lie about the type, so that users actually get type warnings when they use it as a list instead of a set. That's one more (very effective) way of communicating the change. And since we implement all set functionality, the lie is not too big 🙂

We could do that, but some users run type validation on their code and this could result in bug reports. Either we need to fully document our reasoning for the inconsistency in the code comments or we need to make a change to the code.

pawamoy

First batch of tests added. It's pretty basic. A second batch should be added to test lists with duplicate elements within them. I updated the code to try and reduce the possibility of duplicates but it's not thorough.

pawamoy · 2025-02-28T14:49:13Z

markdown/util.py

+        self._list.clear()
+        self._list.extend(sorted(self._set))


I took a shortcut, but we should probably maintain the list order, while preventing duplicates.

pawamoy · 2025-02-28T14:49:51Z

markdown/util.py

+        self._list.clear()
+        self._list.extend(sorted(self._set))


Same shortcut, though we should maintain list order by first removing then adding elements, preventing duplicates, probably.

pawamoy · 2025-02-28T14:50:42Z

markdown/util.py

+        self._list.clear()
+        self._list.extend(sorted(self._set))


Shortcut again, we should maintain order and only append elements, preventing duplicates.

waylan · 2025-02-28T18:30:24Z

markdown/extensions/md_in_html.py

@@ -42,7 +42,7 @@ class HTMLExtractorExtra(HTMLExtractor):

    def __init__(self, md: Markdown, *args, **kwargs):
        # All block-level tags.
-        self.block_level_tags = set(md.block_level_elements.copy())
+        self.block_level_tags = md.block_level_elements.copy()


I just noticed this, but this was already being converted to a set in this extension. Maybe we should keep that behavior rather than leave it as our custom type.

waylan · 2025-02-28T18:33:36Z

tests/test_block_level_elements.py

+Tests for block level elements.
+===============================
+
+Tests specific to the hybrid list/set container for block level elements.


Let's add a comment here outlining the deprecation path for this file, specifically noting when it can be removed. We should add a similar comment on the markdown.util._BlockLevelElements class as well which reminds us to also remove these tests when the class is removed from the code.

pawamoy force-pushed the optimize-rawhtml-postprocessor branch from 6113aad to fc9acc0 Compare February 21, 2025 15:38

This comment was marked as resolved.

Sign in to view

pawamoy commented Feb 22, 2025

View reviewed changes

pawamoy commented Feb 23, 2025

View reviewed changes

markdown/postprocessors.py Outdated Show resolved Hide resolved

waylan requested changes Feb 26, 2025

View reviewed changes

waylan added needs-review Needs to be reviewed and/or approved. requires-changes Awaiting updates after a review. labels Feb 26, 2025

pawamoy force-pushed the optimize-rawhtml-postprocessor branch from ce408be to 0841396 Compare February 26, 2025 17:17

pawamoy force-pushed the optimize-rawhtml-postprocessor branch from 0841396 to e05cce7 Compare February 26, 2025 17:26

facelessuser reviewed Feb 26, 2025

View reviewed changes

pawamoy added 3 commits February 26, 2025 21:20

Use set instead of list for block level elements

4334dee

Using a set allows for better performances when checking for membership of a tag within block level elements. Issue-1507: Python-Markdown#1507

Add changelog entry for improved raw HTML post-processor perfs

1bc8c54

pawamoy force-pushed the optimize-rawhtml-postprocessor branch from e05cce7 to 1bc8c54 Compare February 26, 2025 20:20

fixup! Use set instead of list for block level elements

d4813dc

pawamoy commented Feb 27, 2025

View reviewed changes

waylan reviewed Feb 27, 2025

View reviewed changes

fixup! Use set instead of list for block level elements

6a4bfcb

pawamoy commented Feb 28, 2025

View reviewed changes

waylan reviewed Feb 28, 2025

View reviewed changes

	if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks):
	if (key := int(key)) >= self.md.htmlStash.html_counter:



		# Type it as `set[str]` to express our intent for it to be used as such.
		BLOCK_LEVEL_ELEMENTS: set[str] = _BlockLevelElements([

Optimize raw HTML post-processor #1510

Are you sure you want to change the base?

Optimize raw HTML post-processor #1510

Conversation

pawamoy commented Feb 21, 2025

pawamoy commented Feb 21, 2025

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pawamoy Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

waylan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facelessuser commented Feb 26, 2025

waylan commented Feb 26, 2025

facelessuser commented Feb 26, 2025

pawamoy commented Feb 26, 2025

pawamoy commented Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waylan Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

pawamoy Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pawamoy Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pawamoy Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facelessuser Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pawamoy Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pawamoy left a comment

Choose a reason for hiding this comment

pawamoy Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

pawamoy Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pawamoy Feb 26, 2025 •

edited

Loading

pawamoy commented Feb 26, 2025 •

edited

Loading

waylan Feb 26, 2025 •

edited

Loading

pawamoy Feb 26, 2025 •

edited

Loading

pawamoy Feb 27, 2025 •

edited

Loading

pawamoy Feb 27, 2025 •

edited

Loading

facelessuser Feb 27, 2025 •

edited

Loading

pawamoy Feb 27, 2025 •

edited

Loading

pawamoy Feb 28, 2025 •

edited

Loading

pawamoy Feb 28, 2025 •

edited

Loading