Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize raw HTML post-processor #1510

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

pawamoy
Copy link
Contributor

@pawamoy pawamoy commented Feb 21, 2025

Closes #1507

@pawamoy pawamoy force-pushed the optimize-rawhtml-postprocessor branch from 6113aad to fc9acc0 Compare February 21, 2025 15:38
@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 21, 2025

Hmm, the list->set change could be seen as breaking. We can instead create a new Markdown._block_level_elements attribute, and use that in isblocklevel(). Let me know if you think that's best.

@pawamoy

This comment was marked as resolved.

else:
key = m.group(2)
wrapped = False
if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use html_counter instead:

Suggested change
if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks):
if (key := int(key)) >= self.md.htmlStash.html_counter:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason to recompute the length when we already have it stored.

What's not clear to me is why we are reassigning key here. I understand the need to convert a string to an integer, but why do that here? Why not when it is initially assigned on lines 76 and/or 79?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just an oversight, we can definitely coerce to int when first assigning.

Copy link
Contributor Author

@pawamoy pawamoy Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I remember now, that's because I'm doing if key := m.group(1), and casting to int here would fail when m.group(1) is None.

Copy link
Member

@waylan waylan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good generally. However, I haven't tested it or done any performance comparisons of my own. From a quick reading of the code only the few things you pointed out stand out to me. I'll need to take a closer look when I have more time.

else:
key = m.group(2)
wrapped = False
if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason to recompute the length when we already have it stored.

What's not clear to me is why we are reassigning key here. I understand the need to convert a string to an integer, but why do that here? Why not when it is initially assigned on lines 76 and/or 79?

@facelessuser
Copy link
Collaborator

I'll likely try this branch on some projects to inspect for regressions, just to be sure

@waylan
Copy link
Member

waylan commented Feb 26, 2025

Hmm, the list->set change could be seen as breaking. We can instead create a new Markdown._block_level_elements attribute, and use that in isblocklevel(). Let me know if you think that's best.

This is a valid concern. I'm not sure what to think about it.

@waylan waylan added needs-review Needs to be reviewed and/or approved. requires-changes Awaiting updates after a review. labels Feb 26, 2025
@facelessuser
Copy link
Collaborator

Yes, I saw the change of blocks using a list to set and meant to comment on it. I'm pretty sure it'll break some of my plugins, so keeping the original but deprecating it in favor of a new structure is probably the way to go.

@pawamoy pawamoy force-pushed the optimize-rawhtml-postprocessor branch from ce408be to 0841396 Compare February 26, 2025 17:17
@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 26, 2025

Thanks for confirming @facelessuser! I'll change it.

@pawamoy pawamoy force-pushed the optimize-rawhtml-postprocessor branch from 0841396 to e05cce7 Compare February 26, 2025 17:26
@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 26, 2025

I just added a new private _block_level_elements attribute for now, but we should probably deprecate the old one and make the new one public. I'll wait for your input on this @waylan.

markdown/core.py Outdated
@@ -114,6 +114,7 @@ def __init__(self, **kwargs):
""" List of characters which get the backslash escape treatment. """

self.block_level_elements: list[str] = BLOCK_LEVEL_ELEMENTS.copy()
self._block_level_elements: set[str] = set(BLOCK_LEVEL_ELEMENTS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a non-private version of the sets as I imagine we can and should deprecate the old one and eventually remove it in the future. I'm likely to begin referencing the new one as soon as I can, I just don't want it to break things when it releases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Lets make it public.

Copy link
Member

@waylan waylan Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one other concern. Some extensions are likely to alter the list (by adding or removing elements). Therefore, the new attribute could get out-of-sync with the pre-existing attribute. How should we address that?

One possible solution is to create a custom class which subclasses set and also provides the common list methods (which should issue a deprecation warning when called). Then there are not two sequences to keep in sync and both the old and new APIs continue to work. This also removes the need to come up with a new attribute name on the Markdown class. At some future point the list methods could instead raise an error when called. Then at a later point, the custom class could be removed.

Copy link
Contributor Author

@pawamoy pawamoy Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! list has the following methods in common with set:

  • copy: nothing to do make sure to return a copy of the current instance (not a raw set)
  • pop: the list method accepts an index argument, and raises IndexError instead of KeyError. We can catch KeyError and re-raise IndexError (and emit a warning?) during the deprecation period. As for index, I think we should call set's pop but emit a warning that it pops an arbitrary item.
  • remove: both accept a single positional argument, but list raises ValueError instead of KeyError. We can catch and re-raise again (and emit a warning?) during the deprecation period.
  • clear: nothing to do

...and the following additional methods:

  • append: easy, call self.add
  • sort: doesn't make sense for a set, so just pass
  • reverse: doesn't make sense for a set, so just pass
  • insert: easy, call self.add
  • index: doesn't make sense for a set, return 0?
  • extend: easy, call self.update
  • count: 1 if value in self else 0

I'll proceed with this, let me know if you'd like to change a few things 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good with two exceptions.

  1. A common action would be to remove items. I imagine code exists which finds the index of an object and removes or pops by index. set.remove and set.pop accept the object as an argument rather than the index. We could return the object from index so that it works when passed into remove or pop, but that seems weird in all other respects. Not sure how best to approach this.
  2. Presumably count could return self.__len__().

Copy link
Contributor Author

@pawamoy pawamoy Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, something like a plain class with two hidden structs (list and set), with all accompanying methods?

class _BlockLevelElements:
    def __init__(self, elements: list[str]) -> None:
        self._list = elements.copy()
        self._set = set(self.list)

    def __contains__(self, value):
        return value in self._set
    def append(...): ...
    def add(...): ...
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed something in d4813dc, let me know what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pawamoy I think something like that may work, it needs to be tested well though. I saw in your link that things like __add__ only add to the list but don't add to the set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, it is creating a new object and you are returning a pure list but providing a warning. Things may break if they set it back, but I understand what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I figured new lists/sets created this way would likely not be shared across multiple extensions, but assigning it back would create such a scenario. That's also something @waylan remarked, so to be more robust we can just create a new _BlockLevelElements instance 👍

Using a set allows for better performances when checking for membership of a tag within block level elements.

Issue-1507: Python-Markdown#1507
Previously, the raw HTML post-processor would precompute all possible replacements for placeholders in a string, based on the HTML stash.

It would then apply a regular expression substitution using these replacements.

Finally, if the text changed, it would recurse, and do all that again.

This was inefficient because placeholders were re-computed each time it recursed, and because only a few replacements would be used anyway.

This change moves the recursion into the regular expression substitution, so that:

1. the regular expression does minimal work on the text (contrary to re-scanning text already scanned in previous frames);
2. but more importantly, replacements aren't computed ahead of time anymore (and even less *several times*), and only fetched from the HTML stash as placeholders are found in the text.

The substitution function relies on the regular expression groups ordering: we make sure to match `<p>PLACEHOLDER</p>` first, before `PLACEHOLDER`. The presence of a wrapping `p` tag indicates whether to wrap again the substitution result, or not (also depending on whether the substituted HTML is a block-level tag).

Issue-1507: Python-Markdown#1507
@pawamoy pawamoy force-pushed the optimize-rawhtml-postprocessor branch from e05cce7 to 1bc8c54 Compare February 26, 2025 20:20
Comment on lines +55 to +58
warnings.warn(
"Using block level elements as a list is deprecated, use it as a set instead.",
DeprecationWarning,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I emit the same warning every time, so that users only get it once. The message goes straight to the point: don't use it as a list.

DeprecationWarning,
)
# Using `+` means user expects a list back.
return self._list + other
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Methods with no side-effects return a list or set directly, not a copy of our hybrid instance. The reason is we assume that as soon as a user explicitly uses it as a list (or set), they will only use it as a list (or set), so no need to continue providing both interfaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. The same object is likely to be accessed by multiple extensions. It is reasonable to assume that some newly updated extensions will assume a set and some older (not yet updated) extensions will assume a list. Whatever solution we chose needs to continue to work with both throughout the life of the Markdown class instance.

Copy link
Contributor Author

@pawamoy pawamoy Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about that, though since these methods return a new object (that's what I meant with "no side-effect", which wasn't super clear), I figured there are very few chances multiple extensions would actually share references to such newly created lists/sets (each new list/set would probably be isolated to the extension that created it). Anyway, I can definitely change that to return a _BlockLevelElements instance, to be more robust 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed that this comment was in a specific method which returns a modified object. Yeah, in this case, returning the same object type makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I wasn't very clear about it. @facelessuser expressed the same concern, so we might want to return a new instance of _BlockLevelElements, just to be safe. Let me know what you prefer.



# Type it as `set[str]` to express our intent for it to be used as such.
BLOCK_LEVEL_ELEMENTS: set[str] = _BlockLevelElements([
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't guarantee I didn't make subtle mistakes, so it should probably need proper testing.

If this all seems too complicated, happy to try something simpler (it was implemented quickly, no problem with discarding all this 😄).

Copy link
Collaborator

@facelessuser facelessuser Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should definitely get tested. Eventually it'll go away as well. There's probably ways in which people can still mess things up, but I think this tries to provide a reasonable attempt to keep stuff from breaking. I'm open to other ideas if someone has them as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll start writing tests 🙂 Just wanted initial feedback before tackling them 😊

BLOCK_LEVEL_ELEMENTS: list[str] = [
# TODO: Raise errors from list methods in the future.
# Later, remove this class entirely and use a regular set.
class _BlockLevelElements:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere the type is specified as set. However, as this is not a subclass of set, that would be incorrect. Also, later when we remove this, then we will have a set. Perhaps this class should properly reflect that now. Unfortunately, there are no specialized container datatypes for sets in collections. And sets are not a Sequence Type either. Not really sure how best to address this.

Copy link
Contributor Author

@pawamoy pawamoy Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling was that it was OK to lie about the type, so that users actually get type warnings when they use it as a list instead of a set. That's one more (very effective) way of communicating the change. And since we implement all set functionality, the lie is not too big 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, but some users run type validation on their code and this could result in bug reports. Either we need to fully document our reasoning for the inconsistency in the code comments or we need to make a change to the code.

Copy link
Contributor Author

@pawamoy pawamoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch of tests added. It's pretty basic. A second batch should be added to test lists with duplicate elements within them. I updated the code to try and reduce the possibility of duplicates but it's not thorough.

Comment on lines +250 to +251
self._list.clear()
self._list.extend(sorted(self._set))
Copy link
Contributor Author

@pawamoy pawamoy Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a shortcut, but we should probably maintain the list order, while preventing duplicates.

Comment on lines +294 to +295
self._list.clear()
self._list.extend(sorted(self._set))
Copy link
Contributor Author

@pawamoy pawamoy Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same shortcut, though we should maintain list order by first removing then adding elements, preventing duplicates, probably.

Comment on lines +304 to +305
self._list.clear()
self._list.extend(sorted(self._set))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shortcut again, we should maintain order and only append elements, preventing duplicates.

@@ -42,7 +42,7 @@ class HTMLExtractorExtra(HTMLExtractor):

def __init__(self, md: Markdown, *args, **kwargs):
# All block-level tags.
self.block_level_tags = set(md.block_level_elements.copy())
self.block_level_tags = md.block_level_elements.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this, but this was already being converted to a set in this extension. Maybe we should keep that behavior rather than leave it as our custom type.

Tests for block level elements.
===============================

Tests specific to the hybrid list/set container for block level elements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment here outlining the deprecation path for this file, specifically noting when it can be removed. We should add a similar comment on the markdown.util._BlockLevelElements class as well which reminds us to also remove these tests when the class is removed from the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-review Needs to be reviewed and/or approved. requires-changes Awaiting updates after a review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimizing the raw HTML post-processor
3 participants