Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow lexers to use custom source types #222

Closed
Kixiron opened this issue Sep 1, 2021 · 13 comments
Closed

Allow lexers to use custom source types #222

Kixiron opened this issue Sep 1, 2021 · 13 comments

Comments

@Kixiron
Copy link

Kixiron commented Sep 1, 2021

I'm currently using ropey in my LSP where logos is the backing lexer, and I'd like to be able to use a Rope natively within my generated lexer. The Source trait looks perfect, except there's no way I can find to be able to use arbitrary implements with logos::Lexer

@maciejhirsz
Copy link
Owner

Fixed in #225.

@kmicklas
Copy link
Contributor

Unless I'm misunderstanding the Source API, this still isn't possible for non-contiguous types like ropes.

#225 (comment) notes that it might be possible using Ropey's chunks API, but you don't get to choose the size of those. Source::read lets the caller request any sized contiguous Chunk. The docs seem to imply that read should return None only when hitting EOF. Is that not accurate? Can we return None until a small enough Chunk is requested?

@jeertmans
Copy link
Collaborator

@kmicklas a more recent issue #324 discussed the problem of reader by chunks, which I assume Rope to be the same. Maybe it is worth moving your question there to centralize things?

I don’t know how much related it is

@kmicklas
Copy link
Contributor

@jeertmans I think the concern there is a bit different since rope types should generally have no problems with random access.

@kmicklas
Copy link
Contributor

I put together a basic crate demoing the possibility of using Ropey and Logos together: /~https://github.com/kmicklas/logos-ropey

However, I can't publish it on crates.io until #332 and #333 are released.

@jeertmans
Copy link
Collaborator

Oh that's pretty nice @kmicklas! I think Ropey's Rope is so important that it might be worth to implement that directly within the logos crate, to avoid creating an intermediate struct, and hide this behind some feature like ropey.

What do you think?

@kmicklas
Copy link
Contributor

Oh that's pretty nice @kmicklas! I think Ropey's Rope is so important that it might be worth to implement that directly within the logos crate, to avoid creating an intermediate struct, and hide this behind some feature like ropey.

That would be nice! I'm happy to make a PR for that if you think it's a better approach.

@jeertmans
Copy link
Collaborator

Super! I just merged the two PRs :)

@kmicklas
Copy link
Contributor

Okay, I did some testing, and #222 (comment) is definitely a problem. (It's not actually totally trivial to trigger a chunk size greater than 1.)

To support non-contiguous sources like ropes, I can think of three possibilities:

  • We can alter the generated algorithm to try smaller chunk sizes when a larger one doesn't work. This would likely have some performance impact on contiguous sources, although maybe we can get around it by not ever using None from read to indicate EOF, and instead check EOF ahead of time so that None always means we're just handling non-contiguity.
  • We can alter the read/Chunk API so that Chunk is an owned value (i.e. [u8; N]) instead of a view into (assumed) contiguous bytes from the input (&'src [u8; N] as it is now). My guess is that N is usually small enough for this to be okay. For example the tests don't trigger an N bigger than 8, which would probably be more efficient to put directly in a word with no indirection anyway. However given a pathological grammar it could involve an arbitrary amount of copying.
  • We can invert the control of chunk reading to take a closure which gets a local reference to a contiguous chunk. This would allow a rope source to stack allocate a contiguous chunk to copy data into if necessary, but use the underlying data without copying in the common case. This seems like the most general solution that doesn't harm the string case, but I don't understand the generated code well enough to understand if this will easily work.

@kmicklas
Copy link
Contributor

kmicklas commented Aug 19, 2023

I thought of another option that might work well and be easy to implement. Instead of Chunk being a trait provided by the caller, it can be a GAT defined by the source:

type Chunk<'a, const N: usize>: Into<&'a [u8; N]>
    where Self: 'a

The contiguous string sources would just define this as &'a [u8; N]. Rope-like sources could use something like Either<&'a [u8; N], [u8; N]>. (It's like Cow but that won't work here because of ToOwned.)

This avoids any branching or copying in the string case and lets the rope case use borrowed data when possible.

@jeertmans
Copy link
Collaborator

Nice finding @kmicklas! I think the best is to test that out, and include some benchmarks if possible.

If we can change the current traits without degrading performances on &str and &[u8], that would be super!

@flyingmutant
Copy link

Is there any particular direction your decided to pursue to make logos work with non-contiguous types, any area where it is possible to help? I am interested in using rope or gap buffer source as well.

@jeertmans
Copy link
Collaborator

Hello @flyingmutant, well almost nothing was done on that topic, but you can surely help :-)

There is also a list of "future work" that @maciejhirsz mentioned in the last release: /~https://github.com/maciejhirsz/logos/releases/tag/v0.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants