The purpose of this repo is to store all the code for my current Medium project.
This project, currently entitled Compiler Generator, is a series of project/articles covering my attempt at making a metacompiler from scratch in Golang, following along with the book Engineering a Compiler.
- Compiler Structure
- Progress so Far
- The Scanner
- Spec File Format
- IMPORTANT: Controlling spec syntax tokens clashing with target language
- TODO
- A short demo
- How to run it
- Changes to Execution on the way
This project is divided into three parts;
-The Frontend - The part of the compiler responsible for injesting a raw source code and extracting valid tokens and verifying accurate semantics of the language.
-The middleware - Intermediate code generation that converts source code into an intermediate pseudocode-like form. Optimization happens here.
-The Backend - The part of the compiler responsible for converting the intermediate representation into the target system's assmebly architecture.
The aim is to create a metacompiler that, once a language's specification is ingested alongside valid source file(s) of the language to be compiled, the language is successfully compiled into the target architecture of a specified system.
Currently, an automated scanner generator is implemented. Similar to the lex/flex
utilities, this one runs on a spec file mainly based on the usage of regex pattern matching.
A C language spec file (see the full file in specfiles/c.spec
):
# Classifier
alphabet [_a-zA-Z]
digit [0-9]
number {digit}+
newline %NEWLINE
word {alphabet}({alphabet}|{digit})*
symbols [-+/\*&!\|\{\}=><:^;,]
equ ([+-/=*!&\|]|((>)?>)|((<)?<))?=
left (<)?<
right (>)?>
brackets [\[\]\(\)]
comment //.*{newline}
mcomment /\*.*\*/
float [0-9]+((\.[0-9]*)|e((\+|-)?[0-9]+))
hex 0[xX][a-fA-Z0-9]+
string ".*"
char '[(\')(\t)(\n)]|(.*)'
arrow ->
%%
# Delim
' {char}
" {string}
// {comment}
/\* {mcomment}
%%
# TokenType
{string} STRING
{number} INTEGER
{word} ID
{char} CH
char CHAR
int INT
long LONG
void VOID
unsigned UNSIGNED
* STAR
...
The spec file is akin to modified version of the .l
format used to compile lex
parsers. Unlike lex
which generates .c
source files that compile into the parser themselves, spec
is the first part of the metacompiler and directly outputs a Go
list of structs that contain each token. With a spec
file set, it reads input files and tokenizes the input file going ooff the rules in the spec file.
Similar to the lex
file format, the spec
format has 3 sections, each divided by a pair of %%
. The spec design is based on
table driven scanner design.
- Note: a table driven scanner makes use of 3 tables, one with a collection of acceptable DFAs used by the language, a second for classifying input types and a third that holds the valid tokens generated by accepting DFA states. The spec file format is designed to mimick this behaviour with a few modifications to simplify the programming process.
Each section of the spec file's data is laid out as follows:
- Classifier list - This has a list of all the language's regexes. These, alongside helper functions inside the scanner, simulate the concept of DFA state traversal, the saving of accepting states and rollbacks in case of failed states.
- Delim list - This table is for all language constructs that rely on delimiting tokens to end their parsing. Tokens that go here are those for strings and comments.
- TokenType list - These are all the valid tokens that are accepted by the specified language.
A slight drawback with this format's design is the heavy reliance on regex knowledge to simulate the DFAs needed for the target language. An optional add-on will be an init file that auto-generates common regexes for common language constructs to speed up spec file development.
A glaring issue I ran into was the differentiaion of tokens like {
and }
between the ones to be parsed and the ones that are part of the spec file's syntax. This is handled by variables written in all-caps and prefixed with %
tokens. Ones that are currently set by default are:
%NEWLINE in place of \n
%TAB in place of \t
%PC in place of %
%LBC in place of {
%RBC in place of }
%HSH in plaec of #
Should you need to define \n
or \t
as part of your language's spec, use these variables in place of their literals to avoid your tokens being treated as spec syntax.
- Add functionality for error logging and reporting. This will be quite monolithic as this must span entire commpiler.
- Complete unittesting
- More reading for the Automated Parsing phase
- Resolve clashing tokens -- IN TESTING
Note: this demo's slow as I didn't realize until I'd done the work of extracting it and saving it that my mouse scrolling wasn't tracket (I forgot session recording works :'| ) so all the still/empty space is me scrolling... I'd recommend you give it a go if it looks interesting but this kinda gives you an idea of how it works anyway :).
Start by compiling the project
go build .
Then, set your spec file in the config
file. Current spec (for demo purposes) can be found in ./specfiles
.
Inside config
config:[specfile path]
And then run the scanner against a target source file
./compiler [source file]
- The option of passing in a config spec file will be added as a command line parameter
Example:
./compiler --config [spec file]
- The option to output a token file with all your tokenized values with a commmand line parameter
Example:
./compiler --outfile [filename]
- A default
init
option for spec file creation that will initialize an empty spec file with all common regex patterns already set.
Example:
$>./compiler --init-spec
# Classifier
alphabet [_a-zA-Z]
digit [0-9]
number {digit}+
newline %NEWLINE
word {alphabet}({alphabet}|{digit})*
symbols [-+/\*&!\|=><:^;,]
lbrace %LBRC
rbrace %RBRC
equ ([+-/=*!&\|]|((>)?>)|((<)?<))?=
left (<)?<
right (>)?>
brackets [\[\]\(\)]
float [0-9]+((\.[0-9]*)|e((\+|-)?[0-9]+))
hex 0[xX][a-fA-Z0-9]+
string ".*"
char '[(\')(\t)(\n)]|(.*)'
%%
# Delims
' {char}
" {string}
0[xX] {hex}
%%
# TokenType
# tokens go here
$>