From df73f3aef24da597d708d764e79b99f428e2658d Mon Sep 17 00:00:00 2001 From: Michael Camilleri Date: Sat, 22 Jun 2019 05:32:52 +0900 Subject: [PATCH] Correct lexer development guide (#1145) This commit fixes mistakes in the lexer development guide. --- .yardopts | 2 +- docs/LexerDevelopment.md | 221 +++++++++++++++++++++++++-------------- 2 files changed, 143 insertions(+), 80 deletions(-) diff --git a/.yardopts b/.yardopts index 8b38633e6c..248dec3afb 100644 --- a/.yardopts +++ b/.yardopts @@ -2,4 +2,4 @@ --protected --markup-provider=redcarpet --markup=markdown -- docs/LexerDevelopment.md +- docs/*.* diff --git a/docs/LexerDevelopment.md b/docs/LexerDevelopment.md index 4891ee7a0d..4ad9e79351 100644 --- a/docs/LexerDevelopment.md +++ b/docs/LexerDevelopment.md @@ -22,25 +22,31 @@ The remainder of this document explains how to develop a lexer for Rouge. ## Getting Started -This guide assumes a familiarity with git. If you're new to git, GitHub has -[documentation][gh-git] that will help get you started. +### Development Environment -[gh-git]: https://help.github.com/en/articles/git-and-github-learning-resources +To develop a lexer, you need to have set up a development environment. If you +haven't done that yet, we've got {file:docs/DevEnvironment.md a guide} that can +help. -Rouge automatically loads lexers saved in the `lib/rouge/lexers/` directory and -so if you're submitting a new lexer, that's the right place to put it. +The rest of this guide assumes that you have set up such an environment and, +importantly, that you have installed the gems on which Rouge depends to a +directory within the repository (we recommend `vendor/`). -Your lexer needs to be a subclass of the {Rouge::Lexer Lexer} abstract class. -Most lexers are in fact subclassed from {Rouge::RegexLexer RegexLexer} as the -simplest way to define the states of a lexer is to use rules consisting of -regular expressions. The remainder of this guide assumes your lexer is -subclassed from {Rouge::RegexLexer RegexLexer}. +### File Location -You can learn a lot by reading through some of the existing lexers. A good -example that's not too long is [the JSON lexer][json-lexer]. +Rouge automatically loads lexers saved in the `lib/rouge/lexers/` directory and +so if you're submitting a new lexer, that's the right place to put it. The +filename should match the name of your lexer, with the Ruby filename extension +`.rb` appended. If the name of your language is `Example`, the lexer would be +saved as `lib/rouge/lexers/example.rb`. -[json-lexer]: -/~https://github.com/rouge-ruby/rouge/blob/master/lib/rouge/lexers/json.rb +### Subclassing `RegexLexer` + +Your lexer needs to be a subclass of the {Rouge::Lexer} abstract class. Most +lexers are in fact subclassed from {Rouge::RegexLexer} as the simplest way to +define the states of a lexer is to use rules consisting of regular expressions. +The remainder of this guide assumes your lexer is subclassed from +{Rouge::RegexLexer}. ## How to Structure @@ -53,8 +59,12 @@ Basically, a lexer consists of two parts: There are some additional features that a lexer can implement and we'll cover those at the end. -Now, using [the JSON lexer][json-lexer] as an example, let's look at each of -these parts in turn. +For the remainder of this guide, we'll use [the JSON lexer][json-lexer] as an +example. The lexer is relatively simple and is for a language with which many +people will at least have some level of familiarity. + +[json-lexer]: +/~https://github.com/rouge-ruby/rouge/blob/master/lib/rouge/lexers/json.rb ### Lexer Properties @@ -68,13 +78,12 @@ To be usable by Rouge, a lexer should declare a **title**, a **description**, a title "JSON" ``` -The title of the lexer. It is declared using the {Rouge::Lexer.title -Lexer.title} method. +The title of the lexer. It is declared using the {Rouge::Lexer.title} method. -Note: As a subclass of {Rouge::RegexLexer RegexLexer}, the JSON lexer inherits -this method (and its inherited methods) into its namespace and can call those -methods without needing to prefix each with `Rouge::Lexer`. This is the case -with all of the property defining methods. +Note: As a subclass of {Rouge::RegexLexer}, the JSON lexer inherits this method +(and its inherited methods) into its namespace and can call those methods +without needing to prefix each with `Rouge::Lexer`. This is the case with all +of the property defining methods. #### Description @@ -82,8 +91,8 @@ with all of the property defining methods. desc "JavaScript Object Notation (json.org)" ``` -The description of the lexer. It is declared using the {Rouge::Lexer.desc -Lexer.desc} method. +The description of the lexer. It is declared using the {Rouge::Lexer.desc} +method. #### Tag @@ -91,8 +100,8 @@ Lexer.desc} method. tag "json" ``` -The tag associated with the lexer. It is declared using the {Rouge::Lexer.tag -Lexer.tag} method. +The tag associated with the lexer. It is declared using the {Rouge::Lexer.tag} +method. A tag provides a way to specify the lexer that should apply to text within a given code block. In various flavours of Markdown, it's used after the opening @@ -110,8 +119,8 @@ /~https://github.com/rouge-ruby/rouge/blob/master/lib/rouge/lexers/ruby.rb #### Aliases The aliases associated with a lexer. These are declared using the -{Rouge::Lexer.aliases Lexer.aliases} method. Aliases are alternative ways that -the lexer can be identified. +{Rouge::Lexer.aliases} method. Aliases are alternative ways that the lexer can +be identified. The JSON lexer does not define any aliases but [the Ruby one][ruby-lexer] does. We can see how it could be used by looking at another example in Markdown. This @@ -129,7 +138,7 @@ filenames "*.json" ``` The filename(s) associated with a lexer. These are declared using the -{Rouge::Lexer.filenames Lexer.filenames} method. +{Rouge::Lexer.filenames} method. Filenames are declared as "globs" that will match a particular pattern. A "glob" may be merely the specific name of a file (eg. `Rakefile`) or it could @@ -142,25 +151,25 @@ mimetypes "application/json", "application/vnd.api+json", "application/hal+json" ``` The mimetype(s) associated with a lexer. These are declared using the -{Rouge::Lexer.mimetypes Lexer.mimetypes} method. +{Rouge::Lexer.mimetypes} method. ### Lexer States The other major element of a lexer is the collection of one or more states. -For lexers that subclass {Rouge::RegexLexer RegexLexer}, a state will consist +For lexers that subclass {Rouge::RegexLexer}, a state will consist of one or more rules with a rule consisting of a regular expression and an action. The action yields tokens and manipulates the _state stack_. #### The State Stack -The state stack represents the series of states through which the lexer has -passed. States are added and removed from the "top" of the stack. The oldest -state is on the bottom of the stack and the newest state is on the top. +The state stack represents an ordered sequence of states the lexer is currently +processing. States are added and removed from the "top" of the stack. The +oldest state is on the bottom of the stack and the newest state is on the top. The initial (and therefore bottommost) state is the `:root` state. The lexer works by looking at the rules that are in the state that is on top of the stack. These are tried _in order_ until a match is found. At this point, the -action defined in the rule is run, the match is removed from the input stream +action defined in the rule is run, the head of the input stream is advanced and the process is repeated with the state that is now on top of the stack. Now that we've explained the concepts, let's look at how you actually define @@ -174,14 +183,14 @@ state :root do end ``` -A state is defined using the {Rouge::RegexLexer.state RegexLexer.state} method. +A state is defined using the {Rouge::RegexLexer.state} method. The method consists of the name of the state as a `Symbol` and a block specifying the rules that Rouge will try to match as it parses the text. #### Rules -A rule is defined using the {Rouge::RegexLexer::StateDSL#rule StateDSL#rule} -method. The `rule` method can define either "simple" rules or "complex" rules. +A rule is defined using the {Rouge::RegexLexer::StateDSL#rule} method. The +`rule` method can define either "simple" rules or "complex" rules. *Simple Rules* @@ -232,9 +241,9 @@ The block called can take one argument, usually written as `m`, that contains the regular expression match object. These kind of rules allow for more fine-grained control of the state stack. -Inside a complex rule's block, it's possible to {Rouge::RegexLexer#push push}, -{Rouge::RegexLexer#pop! pop}, {Rouge::RegexLexer#token yield a token} and -{Rouge::RegexLexer#delegate delegate to another lexer}. +Inside a complex rule's block, it's possible to call {Rouge::RegexLexer#push}, +{Rouge::RegexLexer#pop!}, {Rouge::RegexLexer#token} and +{Rouge::RegexLexer#delegate}. You can see an example of these more complex rules in [the Ruby lexer][ruby-lexer]. @@ -254,21 +263,24 @@ end Rouge will attempt to guess the appropriate lexer if it is not otherwise clear. If Rouge is unable to do this on the basis of any tag, associated filename or -associated mimetype, it will try to detect the appopriate lexer on the basis of +associated mimetype, it will try to detect the appropriate lexer on the basis of the text itself (the source). This is done by calling `self.detect?` on the -possible lexer (a default `self.detect?` method is defined in {Rouge::Lexer -Lexer} and simply returns `false`). +possible lexer (a default `self.detect?` method is defined in {Rouge::Lexer} +and simply returns `false`). + +A lexer can implement its own `self.detect?` method that takes a +{Rouge::TextAnalyzer} object as a parameter. If the `self.detect?` method +returns true, the lexer will be selected as the appropriate lexer. -A lexer can implement its own `self.detect?` method that takes as a parameter a -{Rouge::TextAnalyzer TextAnalyzer} object. If the `self.detect?` method returns -true, the lexer will be selected as the appropriate lexer. +It is important to note that `self.detect?` should _only_ return `true` if it +is 100% sure that the language is detected. The most common ways for source +code to identify the language it's written in is with a shebang or a doctype +and Rouge provides the {Rouge::TextAnalyzer#shebang} method and the +{Rouge::TextAnalyzer#doctype} method specifically for use with `self.detect?` +to make these checks easy to perform. -The `self.detect?` method is intended to work by looking at the shebang or -doctype that identifies a piece of text. To make this easier, Rouge provides -the {Rouge::TextAnalyzer#shebang TextAnalyzer#shebang} method and the -{Rouge::TextAnalyzer#doctype TextAnalyzer#doctype} method. For more general -disambiguation between different lexers, see [Conflicting Filename -Globs][conflict-globs] below. +For more general disambiguation between different lexers, see [Conflicting +Filename Globs][conflict-globs] below. [conflict-globs]: #Conflicting_Filename_Globs @@ -280,7 +292,7 @@ for these words easier, many lexers will put the applicable keywords in an array and make them available in a particular way (be it as a local variable, an instance variable or what have you). -We recommend lexers use a class method: +For performance and safety, we strongly recommend lexers use a class method: ```rb module Rouge @@ -289,7 +301,7 @@ module Rouge ... def self.keywords - @keywords ||= %w(key words used in this language) + @keywords ||= Set.new %w(key words used in this language) end ... @@ -297,10 +309,29 @@ module Rouge end ``` -These keywords can then be included in a regular expression like so: +These keywords can then be used like so: ```rb -rule /(#{keywords.join('|')})\b/, Keyword +rule /\w+/ do |m| + if self.class.keywords.include?(m[0]) + token Keyword + elsif + token Name + end +end +``` + +In some cases, you may want to interpolate your keywords into a regular +expression. **We strongly recommend you avoid doing this.** Having a large +number of rules that are searching for particular words is not as performant as +a rule with a generic pattern with a block that checks whether the pattern is a +member of a predefined set and assigns tokens, pushes new states, etc. + +If you do need to use interpolation, be careful to use the `\b` anchor to avoid +inadvertently matching part of a longer word (eg. `if` matching `iff`):: + +```rb +rule /\b(#{keywords.join('|')})\b/, Keyword ``` #### Startup @@ -312,19 +343,17 @@ start do end ``` -The {Rouge::RegexLexer.start RegexLexer.start} method can take a block that -will be called when the lexer commences lexing. This provides a way to enter -into a special state "before" entering into the `:root` state (the `:root` -state is still the bottommost state in the state stack; the state pushed by -`start` sits "on top" but is the state in which the lexer begins. +The {Rouge::RegexLexer.start} method can take a block that will be called when +the lexer commences lexing. This provides a way to enter into a special state +"before" entering into the `:root` state (the `:root` state is still the +bottommost state in the state stack; the state pushed by `start` sits "on top" +but is the state in which the lexer begins. Why would you want to do this? In some languages, there may be language -structures that can appear at the beginning of a file. {Rouge::RegexLexer.start -RegexLexer.start} provides a way to parse these structures. An example is a -preprocessor directive in C. You can see how these are lexed in [the C -lexer][c-lexer]. - -[c-lexer]: /~https://github.com/rouge-ruby/rouge/blob/master/lib/rouge/lexers/c.rb +structures that can appear at the beginning of a file. +{Rouge::RegexLexer.start} provides a way to parse these structures without +needing a special rule in your `:root` state that has to keep track of whether +you are processing things for the first time. ### Subclassing @@ -340,13 +369,12 @@ lexer][cpp-lexer] and [the JSX lexer][jsx-lexer] for examples. #### Conflicting Filename Globs If two or more lexers define the same filename glob, this will cause an -{Rouge::Guesser::Ambiguous Ambiguous} error to be raised by certain guessing -methods (including the one used by the `assert_guess` method used in your -spec). +{Rouge::Guesser::Ambiguous} error to be raised by certain guessing methods +(including the one used by the `assert_guess` method used in your spec). The solution to this is to define a disambiguation procedure in the -{Rouge::Guessers::Disambiguation Disambiguation} class. Here's the procedure -for the `*.pl` filename glob as an example: +{Rouge::Guessers::Disambiguation} class. Here's the procedure for the `*.pl` +filename glob as an example: ```rb disambiguate "*.pl" do @@ -369,13 +397,17 @@ end ``` ## How to Test -When submitting a lexer, it is important to include files to test it. There are -three files that should be included: +When developing a lexer, it is important to have ways to test it. Rouge provides +support for three types of test files: 1. a **spec** that will run as part of Rouge's test suite; 2. a **demo** that will be tested as part of Rouge's test suite; and; 3. a **visual sample** of the various language constructs. +When you submit a lexer, you must also include these test files. + +Before we look at how to run these tests, let's look at the files themselves. + ### Specs A spec is a list of expectations that are tested as part of the test suite. @@ -391,6 +423,8 @@ guessing algorithm. In particular, you should check: * the associated mimetypes; and * the associated sources (if any). +Your spec must be saved to `spec/lexers/_spec.rb`. + #### Filenames ```rb @@ -431,18 +465,43 @@ returns true should be tested. The demo file is tested automatically as part of Rouge's test suite. The file should be able to be parsed without producing any `Error` tokens. +The demo is also used on [rouge.jneen.net][hp] as the default text to display +when a lexer is chosen. It should be short (less than 20 lines if possible). + +[hp]: http://rouge.jneen.net/ + +Your demo must be saved to `lib/rouge/demos/`. Please note +that there is no file extension. + ### Visual Samples -While the visual sample is tested by the testing suite to ensure that it does -not raise any errors, the primary purpose is to allow the user to quickly scan -through a large sample of text in a particular language and make sure that the -highlighting looks correct. +A visual sample is a file that includes a representive sample of the syntax of +your language. The sample should be long enough to reasonably demonstrate the +correct lexing of the language but does not need to offer complete coverage. +While it can be tempting to copy and paste code found online, please refrain +from doing this. If you need to copy code, indicate in a comment (using the +appropriate syntax for your lexer's language) the source of the code. Avoid +including code that is duplicative of the other code in the sample. If you are adding or fixing rules in the lexer, please add some examples of the expressions that will be highlighted differently to the visual sample if they're not already present. This greatly assists in reviewing your lexer submission. +Your visual sample must be saved to `spec/visual/sample/`. +As with the demo file, there is no file extension. + +### Running the Tests + +The spec and the demo can be run using the `rake` command. You can run this by +typing `bundle exec rake` at the command line. If everything works, you should +see a series of dots. If you have an error, this will appear here, too. + +To see your visual sample, launch Rouge's visual test app by running `bundle +exec rackup`. You can choose your sample from the complete list by going to +. ot have a file extension. To start the test suite, run +`bundle exec rake`. + ## How to Submit So you've developed a lexer (or fixed an existing one)—that's great! The basic @@ -463,3 +522,7 @@ documentation][gh-pr] that will help you get accustomed to the workflow. [gh-pr]: https://help.github.com/en/articles/about-pull-requests We're looking forward to seeing your code! + +You can learn a lot by reading through some of the existing lexers. A good +example that's not too long is [the JSON lexer][json-lexer]. +