Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make inline parsing extensible #113

Closed
robinst opened this issue Nov 10, 2017 · 8 comments
Closed

Make inline parsing extensible #113

robinst opened this issue Nov 10, 2017 · 8 comments

Comments

@robinst
Copy link
Collaborator

robinst commented Nov 10, 2017

Similar to how custom block parser can be registered today, it would be good to make inline parsing extensible as well. One use would be to improve autolinking, see #112.

The way this could work:

  • When we match a character during inline parsing, the last fallback is to treat it as text. Before we do that, if there are custom inline parsers, we could call each one (with some priority maybe?) and give them the opportunity to handle that.
  • In a custom inline parser, it could parse the input and return/add Nodes. It could also advance the current parsing position. So inline parsing needs to expose some API (or the custom inline parser would return instructions).

To support autolinking, a custom inline parser would also need to be able to get the text before the matching character, probably all the plain text characters up to that point. It would also need a way to detect if any brackets are active or not ("are we inside a link?").

@robinst
Copy link
Collaborator Author

robinst commented Nov 10, 2017

I'm pretty sure I'm not gonna pick this up in the near future. But if someone else wants to have a go at it, I'm willing to mentor. Help wanted :).

@HiuKwok
Copy link

HiuKwok commented Apr 20, 2018

Hi @robinst, I am willing to work on it. However as a newbie of both codebase and Commonmark spec. Just wanna clarify several points first to make sure I m on the same page as you guys.

  • By make inline parsing extendable, do you mean first refactor inlineParserImpl one central class into multiples inlineParserImpl for all inline components that listed on spec. So each of the components impl can have chance to parser the content in a whole line lv instead of how it handle at the moment where InlinePraserImpl.parseInline() ?
  • Also by this mean, interface BlockParser.parserInlines() would have a new signature which take a list of InlineParser as a input?

@HiuKwok
Copy link

HiuKwok commented Apr 25, 2018

Hi everyone, several commit have been made for let Both parser and doc parser class to accept multiple instances of InlineParser.
01f4149
9b28f86
e145fc5

My next thought would be alter InlineParser interface to make it return a parsing progress object which is similar to BlockContinue class, so when block parser attempt to parser it content via lists of InlineParser. Every inlineParser would be able to pick where last InlineParser left up to.

Am I on the right track? Plz advise

@robinst
Copy link
Collaborator Author

robinst commented Apr 26, 2018

Every inlineParser would be able to pick where last InlineParser left up to.

Yes, that's the rough idea (similar to block parsers).

By make inline parsing extendable, do you mean first refactor inlineParserImpl one central class into multiples inlineParserImpl for all inline components that listed on spec. So each of the components impl can have chance to parser the content in a whole line lv instead of how it handle at the moment where InlinePraserImpl.parseInline() ?

Ideally yes.

Also by this mean, interface BlockParser.parserInlines() would have a new signature which take a list of InlineParser as a input?

Not sure about this bit, I think it would still be useful to have a single class for parsing all the inlines. Not sure what to call that though, if we use InlineParser for the thing that knows how to parse a single inline element syntax.. Maybe InlineContentParser?

@HiuKwok
Copy link

HiuKwok commented Apr 26, 2018

@robinst First of all, thx for the reply.

I think it would still be useful to have a single class for parsing all the inlines.

So more like a new parser InlineContentParser created which only know how to parse a specific inline object type?

Let say having a CodeSpanInlineContentParser which implement from InlineContentParser and only responsible to parse Code Span item and nothing more than that. Then return a InlineContinue Object which would let the next InlineContentParser know where to resume the parsing.

While all of above is happening under InlineParser.parse( ) so in one way InlineParser method signature stay unchanged. And the parsing process which be able to split into multiple class which implements InlineContentParser.

@HiuKwok
Copy link

HiuKwok commented Apr 27, 2018

Hi everyone,

I drafted a sample of how a InlineContentParser can be implements (32478ca).

Basically what have done are first created a new InlineContentParse which have a parse() method. Which then implemented by AbstractInlineContentParse.

Then from this point onward a specific inlineContentParser class can be created to parse specific type of inlineItem. For this example, created a class call NewLineBreakParser to handle soft & hard LineBreak parsing which originally handled by InlineParserImpl internally.

@noties
Copy link
Contributor

noties commented Dec 11, 2019

Hello everyone who might be interested in this!

I've been able to achieve the customization of inline parsing in my library recently. I won't say it is perfect but it is fairly simple and implemented in a way so it's relatively easy to keep it upstream. The implementation keeps all current logic and the way inline parsing is done in general but introduces InlineProcessor abstraction. Each InlineProcessor handles unique char and is a direct copy of what can be found in specific methods of InlineParserImpl. You can see the code here

At this point I don't think commonmark-java should jump at having one inline parsing implementation, extensible or not. But having an alternative is always a nice thing. Unfortunately InlineParserImpl is using some classes that are in the internal package (although they are public and can be accessed) which makes it questionable for replication. They are:

import org.commonmark.internal.Bracket;
import org.commonmark.internal.Delimiter;
import org.commonmark.internal.ReferenceParser;
import org.commonmark.internal.util.Escaping;
import org.commonmark.internal.util.Html5Entities;
import org.commonmark.internal.util.Parsing;
import org.commonmark.internal.inline.AsteriskDelimiterProcessor;
import org.commonmark.internal.inline.UnderscoreDelimiterProcessor;

There is also the StaggeredDelimiterProcessor that is not available outside of its package which forces possible replicators of InlineParserImpl to copy the source code.

So, I think that it would be a good thing to move mentioned above classes to public surface. At least it would make possible custom implementations of InlineParser

Any feedback on my solution is welcome and if someone finds it helpful I can convert it to a Java module (currently it is published as an Android library)

@robinst
Copy link
Collaborator Author

robinst commented Sep 18, 2024

Support for custom inline content parsers has been released in 0.23.0 now, see customInlineContentParserFactory in Parser.Builder. Full changelog: /~https://github.com/commonmark/commonmark-java/blob/main/CHANGELOG.md#0230---2024-09-16

@robinst robinst closed this as completed Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants