Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for case insensitivity #198

Merged
merged 6 commits into from
Feb 1, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 35 additions & 12 deletions logos-derive/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -132,14 +132,35 @@ pub fn logos(input: TokenStream) -> TokenStream {
}
};

let bytes = definition.literal.to_bytes();
let then = graph.push(
leaf(definition.literal.span())
.priority(definition.priority.unwrap_or(bytes.len() * 2))
.callback(definition.callback),
);

ropes.push(Rope::new(bytes, then));
if definition.ignore_flags.is_empty() {
let bytes = definition.literal.to_bytes();
let then = graph.push(
leaf(definition.literal.span())
.priority(definition.priority.unwrap_or(bytes.len() * 2))
.callback(definition.callback),
);

ropes.push(Rope::new(bytes, then));
} else {
let mir = definition
.literal
.escape_regex()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Contributor Author

@gymore-io gymore-io Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as I'm calling .to_mir right after, we need to ensure the literal will actually be used as a string and not some regex semantics.

For example, #[token("[Some Header]", ignore(case))] needs to be properly escaped so that it does not turn into Mir::Alternation('S' ... 'r') but Mir::Concat('[', 'S', ... , 'r', ']').

Originally, I had written a function that directly converts any literal into a case insensitive MIR but I later found out about the case_insensitive function on HIR's ParserBuilder. I thought it would be clearer to use it instead of writing it myself (and it is also less error-prone) For this reason, I used the to_mir function along with a escape_regex.

.to_mir(
&Default::default(),
definition.ignore_flags,
&mut parser.errors,
)
.expect("The literal should be perfectly valid regex");

let then = graph.push(
leaf(definition.literal.span())
.priority(definition.priority.unwrap_or_else(|| mir.priority()))
.callback(definition.callback),
);
let id = graph.regex(mir, then);

regex_ids.push(id);
}
}
"regex" => {
let definition = match parser.parse_definition(attr) {
Expand All @@ -149,16 +170,18 @@ pub fn logos(input: TokenStream) -> TokenStream {
continue;
}
};
let mir = match definition
.literal
.to_mir(&parser.subpatterns, &mut parser.errors)
{
let mir = match definition.literal.to_mir(
&parser.subpatterns,
definition.ignore_flags,
&mut parser.errors,
) {
Ok(mir) => mir,
Err(err) => {
parser.err(err, definition.literal.span());
continue;
}
};

let then = graph.push(
leaf(definition.literal.span())
.priority(definition.priority.unwrap_or_else(|| mir.priority()))
Expand Down
20 changes: 20 additions & 0 deletions logos-derive/src/mir.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,15 @@ impl Mir {
Mir::try_from(ParserBuilder::new().build().parse(source)?)
}

pub fn utf8_ignore_case(source: &str) -> Result<Mir> {
Mir::try_from(
ParserBuilder::new()
.case_insensitive(true)
.build()
.parse(source)?,
)
}

pub fn binary(source: &str) -> Result<Mir> {
Mir::try_from(
ParserBuilder::new()
Expand All @@ -37,6 +46,17 @@ impl Mir {
)
}

pub fn binary_ignore_case(source: &str) -> Result<Mir> {
Mir::try_from(
ParserBuilder::new()
.allow_invalid_utf8(true)
.unicode(false)
.case_insensitive(true)
.build()
.parse(source)?,
)
}

pub fn priority(&self) -> usize {
match self {
Mir::Empty | Mir::Loop(_) | Mir::Maybe(_) => 0,
Expand Down
53 changes: 48 additions & 5 deletions logos-derive/src/parser/definition.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,15 @@ use crate::error::{Errors, Result};
use crate::leaf::Callback;
use crate::mir::Mir;
use crate::parser::nested::NestedValue;
use crate::parser::{Parser, Subpatterns};
use crate::parser::{IgnoreFlags, Parser, Subpatterns};

use super::ignore_flags::ascii_case::MakeAsciiCaseInsensitive;

pub struct Definition {
pub literal: Literal,
pub priority: Option<usize>,
pub callback: Option<Callback>,
pub ignore_flags: IgnoreFlags,
}

pub enum Literal {
Expand All @@ -24,6 +27,7 @@ impl Definition {
literal,
priority: None,
callback: None,
ignore_flags: IgnoreFlags::Empty,
}
}

Expand Down Expand Up @@ -67,6 +71,12 @@ impl Definition {
("callback", _) => {
parser.err("Expected: callback = ...", name.span());
}
("ignore", NestedValue::Group(tokens)) => {
self.ignore_flags.parse_group(name, tokens, parser);
}
("ignore", _) => {
parser.err("Expected: ignore(<flag>, ...)", name.span());
}
(unknown, _) => {
parser.err(
format!(
Expand All @@ -92,11 +102,44 @@ impl Literal {
}
}

pub fn to_mir(&self, subpatterns: &Subpatterns, errors: &mut Errors) -> Result<Mir> {
let value = subpatterns.fix(self, errors);
pub fn escape_regex(&self) -> Literal {
match self {
Literal::Utf8(_) => Mir::utf8(&value),
Literal::Bytes(_) => Mir::binary(&value),
Literal::Utf8(string) => Literal::Utf8(LitStr::new(
regex_syntax::escape(&string.value()).as_str(),
self.span(),
)),
Literal::Bytes(bytes) => Literal::Bytes(LitByteStr::new(
regex_syntax::escape(&bytes_to_regex_string(bytes.value())).as_bytes(),
self.span(),
)),
}
}

pub fn to_mir(
&self,
subpatterns: &Subpatterns,
ignore_flags: IgnoreFlags,
errors: &mut Errors,
) -> Result<Mir> {
let value = subpatterns.fix(self, errors);

if ignore_flags.contains(IgnoreFlags::IgnoreAsciiCase) {
match self {
Literal::Utf8(_) => {
Mir::utf8(&value).map(MakeAsciiCaseInsensitive::make_ascii_case_insensitive)
}
Literal::Bytes(_) => Mir::binary_ignore_case(&value),
}
} else if ignore_flags.contains(IgnoreFlags::IgnoreCase) {
match self {
Literal::Utf8(_) => Mir::utf8_ignore_case(&value),
Literal::Bytes(_) => Mir::binary_ignore_case(&value),
}
} else {
match self {
Literal::Utf8(_) => Mir::utf8(&value),
Literal::Bytes(_) => Mir::binary(&value),
}
}
}

Expand Down
Loading