-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON rewrite #1029
JSON rewrite #1029
Conversation
miparnisari
commented
Nov 3, 2018
Fixes #819 |
I hope one of the members of this project finds the time to review this |
@miparnisari I'm sorry this has been outstanding for so long :( I'm working my way through the backlog and hope to get to this soon! |
For the json lexer in particular, performance on large inputs is important. Would you mind testing with larger data? I believe if you search the issues for "json" you will find a few issues that prompted us to make the lexer extremely simple. Also @pyrmont for avoiding future performance issues I wonder if it might be useful to have some kind of |
I had a look through the issues and while I think I found the ones that @jneen is referring to (see #41, #114, #244), they don't actually include an example of the JSON being tested. I made one using Mockaroo and have put it in a gist. It's not as large as the 200KB one referenced in #41 but you could always just double it if a larger file is desired. I'm personally curious what the speed difference is with newer versions of Ruby and will try to give it a spin a bit later. |
@jneen Yeah, a |
Now that I look closely, I think #525 is relevant (the actual issue there was with the JSON lexer) |
the offending case was a string that contained a large number of backslashes: /~https://github.com/jneen/rouge/files/410135/bad_note.txt |
Can you please edit the README to include instructions on how to measure performance of two lexers? Or even in the PR template. |
@miparnisari wrote:
I don't think generally we'd expect submissions to include speedtests; this felt like a special case because of: (1) the prevalence of JSON; and (2) the fact JSON files might realistically be quite large. That said, providing some 'official' support for benchmarking doesn't sound like a bad idea. I was planning on using the Shootout library mentioned in #114 and will try to see how simple it can be made to test different versions of Rouge (it's really designed to benchmark Rouge against different syntax highlighting libraries which isn't quite what I'm concerned with). To compare versions of Rouge, I was anticipating switching between git branches and then running the tests. Oh, and as a bonus, I realised the 200KB JSON file is in there. |
Agree. I know my company has this logic in place - if the file is too big, it bails out and renders it as plain text.
Agree. It could be an optional input of the PR.
I tried the 200 kb file. When I use |
I got Shootout working and ran it against both this PR and the current master.
There's definitely a slow down: terminal is 14% slower and HTML is 17% slower. That's not great but I'm inclined to think this is acceptable given the fact this has been requested multiple times and does seem like an overall usability win. What do you think @jneen? |
The code for rouge.jneen.net (in this same org) contains a class that can be used to run different versions of rouge in the same runtime, which might be helpful. |
@miparnisari Thanks for keeping this up to date. I plan to have a look tonight at the case @jneen linked to earlier. I never did actually check that. |
@miparnisari Sorry, I didn't mean to submit two commits in quick succession but as I was writing a message to explain why the code didn't support highlighting what I'm calling 'bare' key-value pairs for performance reasons, I realised there was a way to do it that didn't involve excess backtracking. In summary, these commits makes several changes to the proposed lexer:
Hopefully that all makes sense. I'll try to add some more examples tonight to the visual sample to demonstrate how this works. |
@miparnisari My changes aren't working for subsequent bare key-value pairs. Need to have more of a think about this one. |
This commit makes several changes to the proposed lexer. First, it changes the tokens used in JavaScript objects. It was proposed that keys be tokenised as `Str::Double` and that string values be tokenised as `Name::Tag`. The token `Name::Label` seems a more appropriate token for keys while `Str::Double` should be used for string values. Second, it remoeves the tokenisation of escape sequences inside object keys. While this is correct insofar as the keys are merely strings, it fits awkwardly with the tokenisation of keys with a non-string token. This commit tokenises all characters within the key as `Name::Label`. Third, it adds support for 'bare' key-value pairs, that is text of the form <string>: <value>. These 'bare' key-value pairs are not tokenised the same way as they would be within objects. Rather, keys are tokenised as ordinary strings. This preserves backwards compatability with how the JSON lexer previously worked.
This commit provides support for tokenising bare keys as `Name::Label` without introducing backtracking for non-key strings.
@miparnisari Sorry for overwriting your repository. It was the only way I could see to get the branch caught up to the latest version. I needed to do that because the HOCON lexer (which did not exist at the time you created this branch) subclasses the JSON lexer. Since we're pretty radically changing how it works, that broke that lexer. It's all fixed now and looks to be making all those keys look nice and pretty :) |