-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize raw HTML post-processor #1510
base: master
Are you sure you want to change the base?
Optimize raw HTML post-processor #1510
Conversation
6113aad
to
fc9acc0
Compare
Hmm, the list->set change could be seen as breaking. We can instead create a new |
This comment was marked as resolved.
This comment was marked as resolved.
markdown/postprocessors.py
Outdated
else: | ||
key = m.group(2) | ||
wrapped = False | ||
if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use html_counter
instead:
if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks): | |
if (key := int(key)) >= self.md.htmlStash.html_counter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason to recompute the length when we already have it stored.
What's not clear to me is why we are reassigning key
here. I understand the need to convert a string to an integer, but why do that here? Why not when it is initially assigned on lines 76 and/or 79?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, just an oversight, we can definitely coerce to int when first assigning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I remember now, that's because I'm doing if key := m.group(1)
, and casting to int
here would fail when m.group(1)
is None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good generally. However, I haven't tested it or done any performance comparisons of my own. From a quick reading of the code only the few things you pointed out stand out to me. I'll need to take a closer look when I have more time.
markdown/postprocessors.py
Outdated
else: | ||
key = m.group(2) | ||
wrapped = False | ||
if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason to recompute the length when we already have it stored.
What's not clear to me is why we are reassigning key
here. I understand the need to convert a string to an integer, but why do that here? Why not when it is initially assigned on lines 76 and/or 79?
I'll likely try this branch on some projects to inspect for regressions, just to be sure |
This is a valid concern. I'm not sure what to think about it. |
Yes, I saw the change of blocks using a list to set and meant to comment on it. I'm pretty sure it'll break some of my plugins, so keeping the original but deprecating it in favor of a new structure is probably the way to go. |
ce408be
to
0841396
Compare
Thanks for confirming @facelessuser! I'll change it. |
0841396
to
e05cce7
Compare
I just added a new private |
markdown/core.py
Outdated
@@ -114,6 +114,7 @@ def __init__(self, **kwargs): | |||
""" List of characters which get the backslash escape treatment. """ | |||
|
|||
self.block_level_elements: list[str] = BLOCK_LEVEL_ELEMENTS.copy() | |||
self._block_level_elements: set[str] = set(BLOCK_LEVEL_ELEMENTS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a non-private version of the sets as I imagine we can and should deprecate the old one and eventually remove it in the future. I'm likely to begin referencing the new one as soon as I can, I just don't want it to break things when it releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Lets make it public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one other concern. Some extensions are likely to alter the list (by adding or removing elements). Therefore, the new attribute could get out-of-sync with the pre-existing attribute. How should we address that?
One possible solution is to create a custom class which subclasses set
and also provides the common list
methods (which should issue a deprecation warning when called). Then there are not two sequences to keep in sync and both the old and new APIs continue to work. This also removes the need to come up with a new attribute name on the Markdown
class. At some future point the list
methods could instead raise an error when called. Then at a later point, the custom class could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! list
has the following methods in common with set
:
copy
:nothing to domake sure to return a copy of the current instance (not a raw set)pop
: thelist
method accepts anindex
argument, and raisesIndexError
instead ofKeyError
. We can catch KeyError and re-raise IndexError (and emit a warning?) during the deprecation period. As forindex
, I think we should call set'spop
but emit a warning that it pops an arbitrary item.remove
: both accept a single positional argument, butlist
raisesValueError
instead ofKeyError
. We can catch and re-raise again (and emit a warning?) during the deprecation period.clear
: nothing to do
...and the following additional methods:
append
: easy, callself.add
sort
: doesn't make sense for a set, so justpass
reverse
: doesn't make sense for a set, so justpass
insert
: easy, callself.add
index
: doesn't make sense for a set, return0
?extend
: easy, callself.update
count
: 1 if value in self else 0
I'll proceed with this, let me know if you'd like to change a few things 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good with two exceptions.
- A common action would be to remove items. I imagine code exists which finds the index of an object and removes or pops by index.
set.remove
andset.pop
accept the object as an argument rather than the index. We could return the object fromindex
so that it works when passed intoremove
orpop
, but that seems weird in all other respects. Not sure how best to approach this. - Presumably
count
could returnself.__len__()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, something like a plain class with two hidden structs (list and set), with all accompanying methods?
class _BlockLevelElements:
def __init__(self, elements: list[str]) -> None:
self._list = elements.copy()
self._set = set(self.list)
def __contains__(self, value):
return value in self._set
def append(...): ...
def add(...): ...
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed something in d4813dc, let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pawamoy I think something like that may work, it needs to be tested well though. I saw in your link that things like __add__
only add to the list but don't add to the set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, it is creating a new object and you are returning a pure list but providing a warning. Things may break if they set it back, but I understand what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a set allows for better performances when checking for membership of a tag within block level elements. Issue-1507: Python-Markdown#1507
Previously, the raw HTML post-processor would precompute all possible replacements for placeholders in a string, based on the HTML stash. It would then apply a regular expression substitution using these replacements. Finally, if the text changed, it would recurse, and do all that again. This was inefficient because placeholders were re-computed each time it recursed, and because only a few replacements would be used anyway. This change moves the recursion into the regular expression substitution, so that: 1. the regular expression does minimal work on the text (contrary to re-scanning text already scanned in previous frames); 2. but more importantly, replacements aren't computed ahead of time anymore (and even less *several times*), and only fetched from the HTML stash as placeholders are found in the text. The substitution function relies on the regular expression groups ordering: we make sure to match `<p>PLACEHOLDER</p>` first, before `PLACEHOLDER`. The presence of a wrapping `p` tag indicates whether to wrap again the substitution result, or not (also depending on whether the substituted HTML is a block-level tag). Issue-1507: Python-Markdown#1507
e05cce7
to
1bc8c54
Compare
warnings.warn( | ||
"Using block level elements as a list is deprecated, use it as a set instead.", | ||
DeprecationWarning, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I emit the same warning every time, so that users only get it once. The message goes straight to the point: don't use it as a list.
DeprecationWarning, | ||
) | ||
# Using `+` means user expects a list back. | ||
return self._list + other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Methods with no side-effects return a list
or set
directly, not a copy of our hybrid instance. The reason is we assume that as soon as a user explicitly uses it as a list (or set), they will only use it as a list (or set), so no need to continue providing both interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily. The same object is likely to be accessed by multiple extensions. It is reasonable to assume that some newly updated extensions will assume a set and some older (not yet updated) extensions will assume a list. Whatever solution we chose needs to continue to work with both throughout the life of the Markdown
class instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about that, though since these methods return a new object (that's what I meant with "no side-effect", which wasn't super clear), I figured there are very few chances multiple extensions would actually share references to such newly created lists/sets (each new list/set would probably be isolated to the extension that created it). Anyway, I can definitely change that to return a _BlockLevelElements
instance, to be more robust 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed that this comment was in a specific method which returns a modified object. Yeah, in this case, returning the same object type makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I wasn't very clear about it. @facelessuser expressed the same concern, so we might want to return a new instance of _BlockLevelElements
, just to be safe. Let me know what you prefer.
|
||
|
||
# Type it as `set[str]` to express our intent for it to be used as such. | ||
BLOCK_LEVEL_ELEMENTS: set[str] = _BlockLevelElements([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't guarantee I didn't make subtle mistakes, so it should probably need proper testing.
If this all seems too complicated, happy to try something simpler (it was implemented quickly, no problem with discarding all this 😄).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should definitely get tested. Eventually it'll go away as well. There's probably ways in which people can still mess things up, but I think this tries to provide a reasonable attempt to keep stuff from breaking. I'm open to other ideas if someone has them as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'll start writing tests 🙂 Just wanted initial feedback before tackling them 😊
BLOCK_LEVEL_ELEMENTS: list[str] = [ | ||
# TODO: Raise errors from list methods in the future. | ||
# Later, remove this class entirely and use a regular set. | ||
class _BlockLevelElements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere the type is specified as set
. However, as this is not a subclass of set, that would be incorrect. Also, later when we remove this, then we will have a set
. Perhaps this class should properly reflect that now. Unfortunately, there are no specialized container datatypes for sets in collections. And sets are not a Sequence Type either. Not really sure how best to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling was that it was OK to lie about the type, so that users actually get type warnings when they use it as a list instead of a set. That's one more (very effective) way of communicating the change. And since we implement all set functionality, the lie is not too big 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but some users run type validation on their code and this could result in bug reports. Either we need to fully document our reasoning for the inconsistency in the code comments or we need to make a change to the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First batch of tests added. It's pretty basic. A second batch should be added to test lists with duplicate elements within them. I updated the code to try and reduce the possibility of duplicates but it's not thorough.
self._list.clear() | ||
self._list.extend(sorted(self._set)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a shortcut, but we should probably maintain the list order, while preventing duplicates.
self._list.clear() | ||
self._list.extend(sorted(self._set)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same shortcut, though we should maintain list order by first removing then adding elements, preventing duplicates, probably.
self._list.clear() | ||
self._list.extend(sorted(self._set)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shortcut again, we should maintain order and only append elements, preventing duplicates.
@@ -42,7 +42,7 @@ class HTMLExtractorExtra(HTMLExtractor): | |||
|
|||
def __init__(self, md: Markdown, *args, **kwargs): | |||
# All block-level tags. | |||
self.block_level_tags = set(md.block_level_elements.copy()) | |||
self.block_level_tags = md.block_level_elements.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed this, but this was already being converted to a set in this extension. Maybe we should keep that behavior rather than leave it as our custom type.
Tests for block level elements. | ||
=============================== | ||
|
||
Tests specific to the hybrid list/set container for block level elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment here outlining the deprecation path for this file, specifically noting when it can be removed. We should add a similar comment on the markdown.util._BlockLevelElements
class as well which reminds us to also remove these tests when the class is removed from the code.
Closes #1507