A parser generator for parsing expression grammar (PEG) that use inline macros to specify PEG operations.
This is how it looks like:
use peggen::*;
#[derive(Debug, ParseImpl, Space, Num, EnumAstImpl)]
pub enum Json {
#[rule(r"null")]
Null,
#[rule(r"{0:`false|true`}")]
Bool(bool),
#[rule(r"{0:`-?(0|[1-9][0-9]*)\.([0-9]+)`}")]
Flt(f32),
#[rule("{0:`0|-?[1-9][0-9]*`}")]
Num(i32),
#[rule(r#""{0:`[^"]*`}""#)]
Str(String),
}
fn main() {
let json = Parser::<Json>::parse("{x: 1, y: 2}").unwrap();
println!("{json:?}");
}
Help needed! There is so much to do! Contact Me
- Optimizations:
- Rule dispatch: filter rules by the first symbol, instead of trying each of them.
- Thinner tag: currently each tag in internal representation is 3-pointers wide, I want to make them thinner.
- Error Handling:
- Custom final error handlers when custom error capturing fails.
- Documentation:
- Demonstrate features like precedence climbing, error handling, repetition, custom
FromStr
, arean allocation, and left recursion handling.
- Demonstrate features like precedence climbing, error handling, repetition, custom
/ | Conceptual | User Experience | Performance & Code Quality | Error Handling |
---|---|---|---|---|
PEST | PEST only annotates text. Peggen generates AST directly from your text. |
In most cases, you still want rust enum s for your AST, which is directly provided by Peggen, but you have to manually create enums from PEST rules. |
PEST use an optimizer to memorize your grammar rules, and use memorization for better performance; Peggen doesn't use memorization, arguably this gives better performance over memorization for most grammars. | / |
Chumsky | Chumsky provides parser combinators. Peggen is a parser generator. | Both Chumsky and Peggen provides ast directly. However, Peggen supports arena allocation. | Chumsky deallocates successful sub-rules when a rule fails; Peggen uses a internal representation to eliminate deallocation. Besides, Peggen handles left recursion, while in Chumsky left recursion causes in stack overflow. | / |
LALRPOP | Peggen is PEG-based; LALRPOP uses LR(1) grammar. | Peggen is more intuitive to use than LALRPOP; LR(1) grammar is hard to extend and debug. | LALRPOP has better performance over Peggen. | LR(1) grammar can report errors far away from normally percepted cause; Peggen allows you to capture errors from customary cause. |
I roughly tested the peggen on a sample json file against chumsky.
CPU Model: Intel(R) Core(TM) i7-14700HX
Suprisingly, Peggen is faster than Chumsky.
Here are some numbers:
- Peggen : 867913 ns/iter
- Chumsky: 1464737 ns/iter
Before we start this tutorial, let's look at how it looks like after your first try.
#[derive(Debug, ParseImpl, Space, Num, EnumAstImpl)]
pub enum Json {
#[rule(r"null")]
Null,
#[rule(r"{0:`false|true`}")]
Bool(bool),
#[rule(r"{0:`-?(0|[1-9][0-9]*)\.([0-9]+)`}")]
Flt(f32),
#[rule("{0:`0|-?[1-9][0-9]*`}")]
Num(i32),
#[rule(r#""{0:`[^"]*`}""#)]
Str(String),
#[rule(r#"\{ [*0: "{0:`[^"]*`}" : {1} , ][?0: "{0:`[^"]*`}" : {1} ] \}"#)]
Obj(RVec<(String, Json)>),
#[rule(r"\[ [*0: {0} , ][?0: {0} ] \]")]
Arr(RVec<Json>)
}
We have a Parser
type that help you use annotated enum
.
use peggen::*;
fn main() {
let json = Parser::<Json>::parse("{x: 1, y: 2}").unwrap();
println!("{json:?}");
}
A rule
attribute looks like the following:
#[rule("...")]
The string ensembles rust formatting string that you use in println!()
. Rust formatting string represents a sequence of chars/structures printed one after another. Peggen formatting string represents a sequence of chars/parsers that eat the input string one after another.
For example, the following statement prints:
- A boolean
false
as the first argument - A token
and
- An integer
19
as the second argument
println!("{0} and {1}", false, 19);
You can write a parser that parses <bool> and <int>
using a rule
attribute. However, Peggen needs to know what kind of string can be recognized as bool
and what kind of string can be recognized as i64
. For this purpose, we can write regular expressions.
#[derive(Debug, ParseImpl, Space, Num, EnumAstImpl)]
#[rule("{0:`false|true`} and {1:`[0-9]+`}")]
struct BoolAndInt(bool, i64);
Question
What will the following statement print?
println!("{:?}", Parser::<BoolAndInt>::parse("false and 19").unwrap());
Answer
BoolAndInt(false, 19);
An enum
is a collection of rules, during parsing, the rules declared in an enum
is tried one by one until one of them matches.
#[derive(Debug, ParseImpl, Space, Num, EnumAstImpl)]
pub enum Json {
#[rule(r"null")]
Null,
#[rule(r"{0:`false|true`}")]
Bool(bool),
#[rule(r"{0:`-?(0|[1-9][0-9]*)\.([0-9]+)`}")]
Flt(f32),
#[rule("{0:`0|-?[1-9][0-9]*`}")]
Num(i32),
#[rule(r#""{0:`[^"]*`}""#)]
Str(String),
}
Question
- How to parse a string with
"
escaped to\"
? - For example:
"\"a string\""
.
Answer
#[rule(r#""{0:`([^"]|\\")*`}""#)]
Str(String)
Question
Given that you can have multiple rules on the same enum variant, what is the alternative way of writing the Bool(bool)
operation?
Answer
#[rule(r#""{0:`false`}"#)]
#[rule(r#""{0:`true`}"#)]
Bool(bool)
TODO