Format of Tables
The .automaton files could be of interest to someone who wants to:
implement alternative rule-following strategies.
write specialized post-processing tools.
write a driver in another host language than Python.
be curious for curiosity’s own sake.
What’s here describes the most current version of these files.
General Structure
An .automaton
file is a JSON (JavaScript Object Notation) object
with keys as follows:
description: A string. Currently always
"MacroParse Automaton"
, but in principle could be taken from the grammar definition.parser: Another JSON object described below.
scanner: Another JSON object described below.
source: The base-name of the file from which the grammar definition came.
version: An array of three small integers meant to obey the conventions of semantic versioning.
Scanner Table
Overview
Booze-Tools scanners are based on deterministic finite state machines. To scan a lexeme, it begins in the initial state for its current scan condition, then follows a delta function of (state, character) until no further progress is possible. At that point, the scanner follows the rule corresponding to the leftmost-longest-match. Rather, the states are annotated such that the leftmost-longest-match is selected in this process.
The source code for the algorithm is pretty short and illustrative.
Start with the function scan_one_raw_lexeme
in module boozetools.scanning.recognition
.
Top-Level Fields
action: a table of information telling the scan algorithm what to do once it’s found a match.
alphabet: encodes a function from character code-point to character class.
dfa: encodes the deterministic finite state automaton which the scanner follows.
Fields of the DFA
delta: encodes the transition function from <state, a character class> to <subsequent state>.
final: a list of accepting states.
rule: a corresponding list of rule numbers as accepted by those states.
initial: a mapping from scan-condition string (such as
"INITIAL"
) to a pair initial states.
The exact form of the delta
table is based on a lot of experimentation to do a good job compacting otherwise-large tables.
Probably a future version will allow to build a much simpler format for tables that are inherently fairly small to begin with.
Parser Table
Booze-Tools parsers are based on pushdown automata.
A simplified version of the algorithm is the function trial_parse
in the module boozetools.parsing.shift_reduce
. That version handles
neither semantic values nor error recovery.
When you feel ready for the rest, check out the parse
function in the same file.
Top-Level Fields
- action
Tells the parser how to respond to a terminal symbol, and when the right-hand side of a production rule is recognized.
- breadcrumbs
Used in error reporting; maps states to the symbols that reach those states.
- goto
Tells how to respond once a non-terminal symbol has been synthesized.
- initial
Tells where to start for each supported language.
- nonterminals
A list of the non-terminal symbols used in defining the grammar. This list is useful principally for error-reporting.
- rule
Tells how to synthesize a non-terminal symbol from a sequence of symbols from the stack. These are described in greater detail below.
- terminals
A list of the string representations of the terminals from the grammar. Useful in error reporting, but you might use it to prepare a reserved-word table.
Encoding of Rules
The format of the parser’s rule
object is probably sub-optimal, but is also:
- constructor
- List of distinct possible messages from the ends of production rules.
null
means to just create a tuple of the (non-void) right-hand-side symbols.Strings that begin with:
are considered mid-rule actions.All other strings are the names of ordinary constructor messages. - line_number
List of the grammar definition source line number for each rule.
- rules
List of 4-tuples for each right-hand side.
The 4-tuples for a parser rule are:
- Non-terminal index.
This can index into the
nonterminals
list, mentioned earlier.- Length of the rule’s right-hand side.
- This number of symbols get popped before pushing the non-terminal symbol.Zero is a valid size: Epsilon rules and mid-rule actions both use it.
- Constructor index:
- If negative: the stack offset of the semantic value.If zero or positive: an index into the
constructor
list, mentioned earlier. - Capture list.
This list of integer offsets from top-of-stack (before any popping) describe where to find the arguments for the constructor. In case of a bracketing rule (negative constructor index), the capture-list is meaningless.
Note
Note that all stack-offsets are negative, with -1 being the top-of-stack.