SequenceMatcher

2015-12-19 --> 2021-03-05

Find/replace a pattern in a sequence of things.

SequenceMatcher is part of the universe framework.

Install

Using the planet installer via light-cli

lt install Ling.SequenceMatcher

Using the uni command.

uni import Ling/SequenceMatcher

Sometimes, you have a sequence of things and you need to find a pattern of those things in the sequence.

If those things were characters, then the sequence of things would be a string, and you could use the php regex functions.

Regex only applies to strings though, so what if you need to parse a more abstract sequence of things?

That's when the SequenceMatcher can become handy.

Although it doesn't have all the bells and whistles of the regex, it has two or three useful features that allow basic searching through a sequence of things.

In other words, the SequenceMatcher is an simplified abstraction of the regex engine, which works with things, not only characters.

Concepts

A sequence contains an arbitrary number of things.

A thing can be anything: a number, a character, an array, an object, ...

When you want to match a particular combination of things in the sequence, you first create what's called a model.

A model is like a pattern in the regex model: it's basically a blue print used to match against the things in the sequence.

A model is composed of elements.

There are three types of elements:

entity
group (of elements)
alternate (group of elements)

The entity is the smallest and most fundamental element.

It's an intelligent object which has a match method (i.e. it is able to tell whether or not it matches against a given thing).

The group and alternate group are syntactic element.

The group is a container for other elements,

and the alternate group allows the parallelization of groups (useful in some cases where you want to match one of different alternatives).

The model is then given to the SequenceMatcher.

The SequenceMatcher can only work with one model and one sequence at the time.

The SequenceMatcher parses the sequence of things, and, using the different elements, finds whether or not the model matches the sequence of things.

The model may be found zero, one or more times.

You can add listeners to the SequenceMatcher, in order to do something useful when a match is found.

There are two main things you can do:

accessing the matched entities; this is useful if you want to extract some information out of a sequence (think preg_match)
replace the matched entities by a subsequence of your choice (think preg_replace)

Modificators

Beside the elements, the SequenceMatcher also provides the following modificators (inspired by regex):

? modificator
- modificator
- modificator

Modificators can be applied to all elements.

Some notes about the implementation

An element has a __toString method, because it eases the testing phase (and coding this object was hard for me...).

I have to say, the implementation is quite weak: and since I was not confident in the code, I compensated with some tests.

Basically, I built up the tool until I could use it, but it might not work for your cases.

The way it's implemented, I believe the best way to extend it is to read the tests and use the "test first" method (writing your tests before you implement a new functionality or when you find a bug).

The tests are in the btests directory of this repository.

Good luck.

Where do I found some example code?

I made this code for the only purpose of being able to extract the translation information out of files.

I could have used regex, but I remembered having a bad experience (a long time ago) trying to cope with both the double quote OR the single quote escaping that can occur in my case, something like the example below:

<?php

echo __("This is an example of string I want to extract", "optionalContext");
echo __("This is an example of string I want to extract");
echo __("This is an {example} of string I want to extract", null, [
    'example' => $something,
]);

So in a way, I found the approach of parsing tokens more elegant, although in the end I'm not a big fan of my implementation, as I said earlier.

But anyway, this example is used in the nullos admin's Linguist module.

Here is the code that extracts those translations, for your convenience:

// app-nullos/class-modules/Linguist/Util/LinguistScanner.php

public static function scanTranslationsByFile($file)
{

    $ret = [];

    $tokens = token_get_all(file_get_contents($file));

    $model = Model::create()
        ->addElement(TokenEntity::create(T_STRING, '__'))
        ->addElement(TokenEntity::create(T_WHITESPACE, null), '?')
        ->addElement(TokenEntity::create(null, '('))
        ->addElement(TokenEntity::create(T_WHITESPACE, null), '?')
        ->addElement(TokenEntity::create(T_CONSTANT_ENCAPSED_STRING, null), null, 'id')
        ->addElement(TokenEntity::create(T_WHITESPACE, null), '?')
        ->addElement(Group::create(null)
                ->addElement(TokenEntity::create(null, ','))
                ->addElement(TokenEntity::create(T_WHITESPACE, null), '?')
                ->addElement(TokenEntity::create(T_CONSTANT_ENCAPSED_STRING, null), null, 'context')
                ->addElement(TokenEntity::create(T_WHITESPACE, null), '?')
            , '?'
        )
        ->addElement(TokenGreedyEntity::create(null, ')'), '*')
        ->addElement(TokenEntity::create(null, ')'));

    $sequence = $tokens;

    $markers = [];
    SequenceMatcher::create()
        ->match($sequence, $model, function (array $matchedElements, array $matchedThings, array $_markers = null) use (&$markers) {
            $markers[] = TokensSequenceMatcherUtil::detokenizeMarkers($_markers);
        });

    foreach ($markers as $info) {
        $arr = [
            'id' => array_shift($info['id']),
        ];
        if (array_key_exists('context', $info)) {
            $arr['context'] = array_shift($info['context']);
        }
        $ret[] = $arr;
    }
    return $ret;
}

Note that the code above uses an intermediary TokenEntity wrapper, which basically allows matching tokens.

The SequenceMatcher code is totally abstract and can match any object, and the TokenEntity is just an example of specialization of that EntityInterface.

The TokenEntity itself is found in the Tokens library.

I also created a very useful TokenGreedyEntity, which basically matches any token EXCEPT the one given.

Another example: fetch the use statements in a file

Here is another useful example of how to use the SequenceMatcher.

Again, it involves php tokens, and so we use the TokenEntity from the Tokens planet as the concrete thing that is iterated.

<?php


use Ling\SequenceMatcher\Model;
use \SequenceMatcher\SequenceMatcher;
use Ling\Tokens\SequenceMatcher\Element\TokenAlternateEntity;
use Ling\Tokens\SequenceMatcher\Element\TokenEntity as TokenEntity;
use Ling\Tokens\SequenceMatcher\Element\TokenGreedyEntity;
use Ling\Tokens\Tokens;

require_once "bigbang.php";


$file = __FILE__;
$tokens = token_get_all(file_get_contents($file));
//az(Tokens::explicitTokenNames($tokens));

$model = Model::create()
    ->addElement(TokenEntity::create(T_USE, null))
    ->addElement(TokenEntity::create(T_WHITESPACE, null), '?')
    ->addElement(TokenAlternateEntity::create([T_STRING, T_NS_SEPARATOR]), null, 'a')
    ->addElement(TokenGreedyEntity::create(null, ';'), '*', 'a')
    ->addElement(TokenEntity::create(null, ';'));

$sequence = $tokens;

$markers = [];
SequenceMatcher::create()
    ->match($sequence, $model, function (array $matchedElements, array $matchedThings, array $_markers = null) use (&$markers) {
        $markers[] = Tokens::concatenate($_markers['a']);
    });

a($markers);
/**
 * array (size=6)
 * 0 => string 'SequenceMatcher\Model' (length=21)
 * 1 => string '\SequenceMatcher\SequenceMatcher' (length=32)
 * 2 => string 'Tokens\SequenceMatcher\Element\TokenAlternateEntity' (length=51)
 * 3 => string 'Tokens\SequenceMatcher\Element\TokenEntity as TokenEntity' (length=57)
 * 4 => string 'Tokens\SequenceMatcher\Element\TokenGreedyEntity' (length=48)
 * 5 => string 'Tokens\Tokens' (length=13)
 */

Replace things

This is not implemented yet, because I didn't need that functionality at the time I wrote this tool, and I was in a hurry.

Still, the intent of this planet is to provide the replace functionality, maybe some day?

History Log

1.0.4 -- 2021-05-31
- Removing trailing plus in lpi-deps file (to work with Light_PlanetInstaller:2.0.0 api
1.0.3 -- 2021-03-05
- update README.md, add install alternative
1.0.2 -- 2020-12-08
- Fix lpi-deps not using natsort.
1.0.1 -- 2020-12-04
- Add lpi-deps.byml file
1.0.0 -- 2016-12-19
- initial commit

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Element		Element
Exception		Exception
Util		Util
btests/TokenRegex/RegexEngine/match		btests/TokenRegex/RegexEngine/match
Model.php		Model.php
README.md		README.md
SequenceMatcher.php		SequenceMatcher.php
dependencies.byml		dependencies.byml
lpi-deps.byml		lpi-deps.byml
meta-info.byml		meta-info.byml
robots.txt		robots.txt
sitemap.txt		sitemap.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SequenceMatcher

Install

Concepts

Modificators

Some notes about the implementation

Where do I found some example code?

Another example: fetch the use statements in a file

Replace things

History Log

About

Releases

Packages

Languages

lingtalfi/SequenceMatcher

Folders and files

Latest commit

History

Repository files navigation

SequenceMatcher

Install

Concepts

Modificators

Some notes about the implementation

Where do I found some example code?

Another example: fetch the use statements in a file

Replace things

History Log

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages