A simple pattern matching library for fluently extracting patterned data from text.
I frequently find myself trying to pull a bit of structured data out of a stream of text. Regex is an amazingly powerful tool, but I have always struggled to get it to return structured results, and frequently just end up writing my own little parser to extract the information I want. Patter is a simple library to describe seeking and grabbing the chunks of data you want without the complexity of Regex.
- Is it more powerful than Regex? Absolutely not, if Regex is your jam, use Regex.
- Is it easier to read and get data out? In many cases (and in my honest opinion) yes.
You use PatterBuilder to define a Pattern which knows how to extract data from a string.
For example:
var pattern = new PatternBuilder<string>()
.SeekPast("<foo>")
.CaptureUntil("</foo>")
.Build();
var results = pattern.Matches("Show <foo>one</foo> and <foo>two</foo>");
returns a list of strings,
["one","two"]
Let's say you want to extract anchors from a blob of textinto an object (Alink):
public class ALink
{
public string Text {get;set;}
public Uri Url {get;set;}
}
And then define a pattern using PatternBuilder with ALink as the result type:
// define a patter to return enumeration of ALink objects.
var pattern = new PatternBuilder<ALink>()
// seek to <a
.Seek("<a")
// seek past href attribute
.SeekPast("href=")
// skip quotes if there any
.Skip(Chars.Quotes)
// Capture everything up to closing tag or end quote, and convert it a Uri and store in Alink.Url
.CaptureUntil(">'\"".ToArray(), (context) => context.Match.Url = new Uri(context.MatchText))
// skip quotes if there any
.Skip(Chars.Quotes)
// seek past end of opening tag
.SeekPast(">")
// capture everything up to the close </a tag and put it into the Alink.Text
.CaptureUntil("</a", (context) => context.Match.Text = context.MatchText.Trim())
.Build();
var matches = pattern.Matches("this is a <a href=\"http://foo.com\">link1</a> <a href=http://bar.com>link2</a>").ToList();
Debug.WriteLine(JsonConvert.SeriializeObject(matches));
This will extract the text and urls from the tags. It's an enumerable, so you can use LINQ statements to further manipulate the results.
[
{
"Text":"link1"
"Url":"http://foo.com"
},
{
"Text":"link2"
"Url":"http://bar.com"
}
]
Method | Description |
---|---|
Seek(text) | Move the cursor to next instance of text |
Seek(char[]) | Move the cursor to next instance of one of the chars |
SeekPast(text) | Move the cursor to just past the next instance of text |
SeekPast(char[]) | Move the cursor to just first instance of set of chars and then to first instance of not the chars |
Skip(char[]) | Move the cursor to first char that is not in the set of chars |
Capture(char[], func) | Capture chars while they are in the set of chars, call func(context) to give you ability to extra info from the context.MatchText and put into context.Match |
CaptureUntil(text, func) | Capture characters until text is found, then call func(context) to give you ability to extract info from the context.MatchText and put into the context.Match |
CaptureUntil(char[], func) | Capture characters until one of chars is found, call func(context) to give you ability to extract info from the context.MatchText and put into the context.Match |
CaptureUntilPast(text, func) | Capture characters until text is found including text, then call func(context) to give you ability to extract info from the context.MatchText and put into the context.Match |
CaptureUntilPast(char[], func) | Capture characters until one of chars is found, including all chars, call func(context) to give you ability to extract info from context.MatchText and put into context.Match |
Call(func) | Let's you write a custom pattern operation, you are responsible for changing context properties directly (Pos, MatchText, Match, HasMatch etc) |
The PatternContext
object represents the current state of parser and is passed to It has the following properties of interest
Property | Description |
---|---|
Pos | The current index into the string. It will be -1 when you are past the end of the string. |
Text | The full text of the string that is being worked on |
MatchText | The current matched text for a **CaptureXXX() ** method |
HasMatch | Indicates that there is a match to be returned in the enumeration. At the end of enumerating the operations if there is a HasMatch context.Match is yielded to the caller. |
Match | The object of type T that is yielded to the caller. You modify this object to build up the object that is yielded to the caller as a match. |
CurrentChar | Shortcut for the current char value for the current Pos. If it has Pos == -1 it will be (char)0 |
Memory | A Property bag scoped to all matches. This is useful for custom actions to track data across all matches |
MatchMemory | A Property bag scoped to each match. It is reset when a sequence of operations is completed and a match is returned to caller. |
The Chars class defines classes of useful characters for matching:
Name | Description |
---|---|
Chars.Digits | Digits - 0..9 |
Chars.Letters | Alphabetical ascii letters |
Chars.LettersOrDigits | Digits and Letters combined |
Chars.Quotes | Single and Double quotes |
Chars.SingleQuote | Single Quotes |
Chars.DoubleQuote | Double Quotes |
Chars.Whitespace | Whitespace chars (tab, space, EOL, etc.) |
Chars.EOL | End of line chars (\r, \n) |
Example:
var pattern = new PatternBuilder<string>()
.SeekPast("Name:")
.Skip(Chars.Whitespace)
.Capture(Chars.LettersOrDigits)
.SkipPast(Chars.EOL)
.Build();
Patterns are 100% reusable and thread safe (meaning multiple threads can be evaluating a Patter pattern against strings safely).
Version was bumped to major 2.x for semantic versioning rules, aka it has breaking changes which clean up the usage around character matching methods.
- Switched to PatternBuilder().Build() => Pattern(), which makes it clearer when you are defining the pattern versus using the pattern. Only Pattern(T)() has Matches() method.
- functions were simplified to simply using char[] as the signature to know it's character based pattern, renaming methods like SeekChars() => Seek(char[] )
- char[] methods as appropriate use
params
nomenclature, so you can write.Skip('x','y','z')