Welcome to booze-tools documentation!¶
Booze-Tools aims to be the ultimate Swiss-Army Knife of language-processing tasks. It does quite a bit so far, and has some features not found in other such tools. It happens to be written in Python.
- Project Maturity Level:
The deep algorithmic stuff works like a champ, but there are some rough edges. Some APIs may change, but the general concept to keep the grammar definition independent of the target-language is not going anywhere. (In concept, this should allow a compiler, a pretty-printer, and an IDE to all run from the same grammar definition.) It’s used in a few different projects and generates feedback.
Introduction¶
Booze-Tools aims to be the ultimate Swiss-Army Knife of language-processing tasks. It does quite a bit so far. and has some features not found in other such tools. It happens to be written in Python.
Documentation Migration in Progress¶
At this very moment, most of what documentation there is lives on the wiki page. In time, more should migrate over.
Why and Wherefore?¶
Yes, there are other language tools out there, but this is a journey and I plan to enjoy it.
The original pipe-dream was for this library to become the complete programming-language development workbench of choice, but written in a high-level language (not C or C++, like GCC or LLVM). Therefore, the Booze-Tools will grow whatever additional bits make that happen. It does not stop with lexical analysis and context-free parsing.
The next goal was to explore solutions to annoying features
of other popular language-processing subsystems. That’s why,
for instance, macroparse
supports macro-expansion of context-free
production rules. (This turns out to be more expressive than EBNF.)
I wanted a single system that could in principle be used to generate scanners/parsers for other host-languages than just Python. That’s why there’s a JSON output format and a system of messages rather than embedded code: current scripting languages can read JSON directly, and independent tools can turn it into (for instance) C code. By implementing a small driver, you can re-use a language definition that was originally made for some other purpose. This feature may keep an ecosystem of tools in sync with each other.
A once-and-future goal is that the project source code should end up clear and instructive. Therefore, there are heavy comments in parts that deal in the classical algorithms and data structures peculiar to the field of compilers. If you see something unclear, please speak up. Opacity is a bug.
To that same end, I may eventually add on some alternative constructions,
as (perhaps) for SLR and LR(0) as ends in themselves, with heavy commentary.
At one point I was digging deep into the issues involved with ambiguous grammars,
which is why the parsing.general
submodule exists. The experience
contributed greatly to the error-recovery mechanisms in the main parsing algorithm.
Finally, it seems that many data analysis tasks are greatly simplified if you can treat them as parsing with a bit extra. I want to make it absolutely convenient to develop and deploy such intermixed applications.
Invoking the Booze-Tools Metacompiler¶
Let’s say you looked at the GitHub repository at the examples, and decided the concept was pretty cool. How shall we proceed?
All of the examples are coded to build their grammar into a parser at run-time. That’s alright for a demonstration, but how would you deal with this in a real project?
You could choose to copy a line of code, but actually there are a number of other handy features available on the command-line.
Extremely Short Version¶
On the command line:
D:\GitHub\booze-tools>py -m boozetools example\pascal.md
Wrote automaton in JSON format to:
example\pascal.automaton
Then later, in Python code, something like:
import json
from boozetools.macroparse import runtime
tables = json.load('example/pascal.automaton')
class MyParser(runtime.TypicalApplication):
def __init__(self):
super().__init__(tables)
def scan_this(self, yy):
yy.token('this', yy.match())
def parse_that(self, left, middle, right):
return That(left, middle, right)
syntax_tree = MyParser().parse("some big long text")
A Bit More Detail¶
You have lots of options about how you invoke this:
D:\GitHub\booze-tools>py -m boozetools -h
usage: py -m boozetools [-h] [-f] [-o OUTPUT] [-i] [--pretty] [--csv] [--dev] [--dot] [-m {LALR,CLR,LR1}] [-v]
source_path
Compile a macroparse grammar/scanner definition from a markdown document into a set of parsing and scanning tables in
JSON format. The resulting tables are suitable for use with the included runtime modules. Pypi:
https://pypi.org/project/booze-tools/ GitHub: https://github.com/kjosib/booze-tools/wiki
ReadTheDocs: https://boozetools.readthedocs.io/en/latest/
positional arguments:
source_path path to input file
optional arguments:
-h, --help show this help message and exit
-f, --force allow to write over existing file
-o OUTPUT, --output OUTPUT
path to output file
-i, --indent indent the JSON output for easier reading.
--pretty Display uncompressed tables in attractive grid format on STDOUT.
--csv Generate CSV versions of uncompressed tables, suitable for inspection.
--dev Operate in "development mode" -- which changes from time to time.
--dot Create a .dot file for visualizing the parser via the Graphviz package.
-m {LALR,CLR,LR1}, --method {LALR,CLR,LR1}
Which parser table construction method to use.
-v, --verbose Squawk, mainly about the table compression stats.
Error Handling?¶
There is support for that.
- Error Rules:
You can write error production-rules using the metatoken
$error$
. The machinery surrounding this takes pains to do well. It may not be the fastest concept, but it’s smarter than the average parser.- On-Error Call-Backs:
For everything else, there are error call-backs. If you look in
boozetools/macroparse/runtime.py
(here) you’ll findclass AbstractTypical
which defines default behavior for situations in which (a) the parser’s error-rule mechanism was unable to resolve, or (b) a stuck scanner.
Format of Tables¶
The .automaton files could be of interest to someone who wants to:
implement alternative rule-following strategies.
write specialized post-processing tools.
write a driver in another host language than Python.
be curious for curiosity’s own sake.
What’s here describes the most current version of these files.
General Structure¶
An .automaton
file is a JSON (JavaScript Object Notation) object
with keys as follows:
description: A string. Currently always
"MacroParse Automaton"
, but in principle could be taken from the grammar definition.parser: Another JSON object described below.
scanner: Another JSON object described below.
source: The base-name of the file from which the grammar definition came.
version: An array of three small integers meant to obey the conventions of semantic versioning.
Scanner Table¶
Overview¶
Booze-Tools scanners are based on deterministic finite state machines. To scan a lexeme, it begins in the initial state for its current scan condition, then follows a delta function of (state, character) until no further progress is possible. At that point, the scanner follows the rule corresponding to the leftmost-longest-match. Rather, the states are annotated such that the leftmost-longest-match is selected in this process.
The source code for the algorithm is pretty short and illustrative.
Start with the function scan_one_raw_lexeme
in module boozetools.scanning.recognition
.
Top-Level Fields¶
action: a table of information telling the scan algorithm what to do once it’s found a match.
alphabet: encodes a function from character code-point to character class.
dfa: encodes the deterministic finite state automaton which the scanner follows.
Fields of the DFA¶
delta: encodes the transition function from <state, a character class> to <subsequent state>.
final: a list of accepting states.
rule: a corresponding list of rule numbers as accepted by those states.
initial: a mapping from scan-condition string (such as
"INITIAL"
) to a pair initial states.
The exact form of the delta
table is based on a lot of experimentation to do a good job compacting otherwise-large tables.
Probably a future version will allow to build a much simpler format for tables that are inherently fairly small to begin with.
Parser Table¶
Booze-Tools parsers are based on pushdown automata.
A simplified version of the algorithm is the function trial_parse
in the module boozetools.parsing.shift_reduce
. That version handles
neither semantic values nor error recovery.
When you feel ready for the rest, check out the parse
function in the same file.
Top-Level Fields¶
- action
Tells the parser how to respond to a terminal symbol, and when the right-hand side of a production rule is recognized.
- breadcrumbs
Used in error reporting; maps states to the symbols that reach those states.
- goto
Tells how to respond once a non-terminal symbol has been synthesized.
- initial
Tells where to start for each supported language.
- nonterminals
A list of the non-terminal symbols used in defining the grammar. This list is useful principally for error-reporting.
- rule
Tells how to synthesize a non-terminal symbol from a sequence of symbols from the stack. These are described in greater detail below.
- terminals
A list of the string representations of the terminals from the grammar. Useful in error reporting, but you might use it to prepare a reserved-word table.
Encoding of Rules¶
The format of the parser’s rule
object is probably sub-optimal, but is also:
- constructor
- List of distinct possible messages from the ends of production rules.
null
means to just create a tuple of the (non-void) right-hand-side symbols.Strings that begin with:
are considered mid-rule actions.All other strings are the names of ordinary constructor messages. - line_number
List of the grammar definition source line number for each rule.
- rules
List of 4-tuples for each right-hand side.
The 4-tuples for a parser rule are:
- Non-terminal index.
This can index into the
nonterminals
list, mentioned earlier.- Length of the rule’s right-hand side.
- This number of symbols get popped before pushing the non-terminal symbol.Zero is a valid size: Epsilon rules and mid-rule actions both use it.
- Constructor index:
- If negative: the stack offset of the semantic value.If zero or positive: an index into the
constructor
list, mentioned earlier. - Capture list.
This list of integer offsets from top-of-stack (before any popping) describe where to find the arguments for the constructor. In case of a bracketing rule (negative constructor index), the capture-list is meaningless.
Note
Note that all stack-offsets are negative, with -1 being the top-of-stack.
The Allegedly Minimal LR(1)¶
Knuth described a simple and correct way to generate an LR(1) parse table, or indeed an LR(k) table for any k. You just do a subset construction on parse-items augmented with a follower-terminal (or k of them). Well and good, but he wrote this down in the 1960s when computer was still an honorable profession and the electronic ones were in no danger of carrying out Knuth’s algorithm for nontrivial grammars.
Shortly afterwards, the LALR method was discovered. LALR(1) is a clever extension of LR(0) to resolve many shift-reduce conflicts by working out follower-sets. By itself LALR is still a bit clunky, but with a dose of operator-precedence and associativity declarations to resolve remaining ambiguities, the resulting systems were generally quite pleasant. For decades, the pragmatic language hacker relied on LALR because it was fast, acceptably-accurate most-of-the-time, and produced compact tables.
But LALR is not perfect. A cottage industry sprang up for debugging grammars subjected to the LALR treatment.
Then in 2010 someone published the IELR(1) algorithm, which made it practical to use full LR(1) grammars augmented with the same precedence and associativity declarations. There’s only one problem: IELR is downright painful to try to understand.
What “Minimal” even means¶
Smallest feasible number of states consistent with reasonable generator performance.
I seem to recall reading somewhere that optimizing LR parse tables is hard, in the sense that it’s difficult to prove there’s not some alternative table with fewer states that recognizes the same language.
Therefore, we shall have to be content with an argument that, if such a thing does indeed exist, then it is infeasible to find.
My General Concept¶
My approach to building LR(1) tables without the size blow-up associated with canonical (Knuth-style) LR(1) stems from a simple observation:
In practice with LALR, conflicts that would be resolved in LR(1) are generally confined to small parts of a grammar and relatively few parse states.
Furthermore, with typical grammars most parse-states have no conflicts even in the LR(0) view.
Even with LR(1)-style parse-items, there are equivalence-classes of terminals with similar behavior.
If we could somehow perform LR(0) everywhere that it was good enough, and switch to a more powerful method for those parts of the grammar where strictly necessary, then we should see a nice compromise of speed and size while achieving the full recognition power of the stronger method.
Case in Point¶
Consider this grammar:
S -> a X a | b X b | a Y b | b Y a
X -> c X'
Y -> c Y'
X' -> c
Y' -> c
It has the following LR(0) automaton:

Note
Booze-Tools does not currently support symbols named like X'
or Y'
but this
is how the example came to me, graphics and all. So this is what I’ll run with.
A Tainted Approach¶
Have a look at state 13: It’s conflicted. It happens to be a reduce/reduce conflict, but this is beside the point.
It happens also to be conflicted on a particular set of LR(0) parse items.
Specifically, those of X' -> c.
and Y' -> c.
With a moment’s thought, we can trace back through the table and realize that the head parse-items for these conflicts are introduced in LR(0) state 6.
- Head Parse-Item
A parse-item where the dot is in the leftmost position.
So we can call state 6 the head state for this conflict. In general there can be more than one head-state for a given conflict, because LR(0) states can have multiple predecessors.
So let us say that symbols X'
and Y'
are now officially
tainted specifically in the context of state 6.
Somewhere during the process of elaborating from core-item sets to item-closure,
we ran across the need for these particular non-terminals.
In fact it’s easy to determine if you have the entire item-closure handy:
Any item where the dot is just before a tainted symbol (within the same LR(0) state)
is also tainted.
- Transparent Parse Item:
Parse items for which the portion of the rule after the dot can possibly produce the empty string. (Transparently, a state with a transparent parse-item has a reduction, and conversely.)
If the successor-item to a parse-item tainted in state Q is transparent, then the head of that parse-item (looking back from state Q) acquires the same taint.
If we take these rules to their logical conclusion (i.e. least fixed-point) then in time we determine that the core-items of states 2 and 3 are all tainted. However, the contagion spreads no further, because the successors to these items are not transparent.
Another Round¶
We have no further use for the idea of a tainted symbol but the tainted head parse-items associated with LR(0) states are of particular interest.
Let us have another go at the LR-style subset construction. But this time, we have a special rule:
When considering a core item-set for a parse state, we must first determine the LR(0) iso-state and from there the set of tainted head parse-items.
- Iso-State
Consider the set of iso-items to the core-items in an LR(1) state. This is guaranteed to correspond with some particular LR(0) state’s core item set.
- Iso-Item
Given a parse-item in this round may have a look-ahead token, the Iso-Item is that but without the look-ahead, if any.
Now, the rule is that if an LR(0) head-item in the iso-state is tainted, then we have to construct a set of LR(1)-style head parse-items with a token of look-ahead for this LR(1) state. But if the corresponding head-item in the corresponding LR(0) state is not tainted, then evidently this particular head-item does not need the extra resolving power that LR(1)-style items provide.
Finally, we may come to states that are still conflicted. That’s OK: We delegate these to the precedence-and-associativity doodad in the usual manner.
Working Out the Example¶
Forward-Correctness¶
Let us assume for the moment that the tainting mechanism described above actually does the right thing. (It needs a proof, but this section should clarify in detail what the proof-obligation exactly is.)
We end up with LR(1) states 2 and 3 like:
State 2:
---------
S -> a . X a
S -> a . Y b
X -> . c X' / a
Y -> . c Y' / b
State 3:
---------
S -> b . X b
S -> b . Y a
X -> . c X' / b
Y -> . c Y' / a
Note
I use the /
(slash) symbol to set off the look-ahead associated with a parse-item,
consistent with Grune and Jacobs.
And clearly this splits LR(0) state 6 into two LR(1) states:
State 6a:
----------
X -> c . X' / a
Y -> c . Y' / b
X' -> . c / a
Y' -> . c / b
State 6b:
----------
X -> c . X' / b
Y -> c . Y' / a
X' -> . c / b
Y' -> . c / a
It should be clear that if we continue to proceed in this manner, then we end up with a sufficient portion of the handle-finding automaton built according to LR(1) principles, so that the resulting parser can do no wrong.
An Interesting Property¶
LR(1) parsing decides to shift or reduce based solely on a sequence of terminal symbols. But goto actions, for non-terminal symbols, do not consider look-ahead at all.
So, let’s consider the story around LR(0) states 11 and 12. These states are adequate reducing states in the LR(0) graph. What should become of them in LR(1)?
If we proceed as described above, sooner or later we end up with:
State 11a:
-----------
X -> x X' . / a
State 11b:
-----------
X -> x X' . / b
And similar for state 12.
Because the LR(0) iso-state is both adequate and reducing, a parser need not consider look-ahead tokens in this state. Such a parser eagerly reduces out of this kind of state before seeing the look-ahead token. This is desirable for good interactive properties and smaller tables. Eventually, an error token lurking in the input stream will get caught in some less-special state.
So, an adequate-reducing LR(0)-state need not be split. If an LR(1) core’s iso-state has this property, then avoid creating redundant LR(1) states.
There is probably a way to build on this observation to find a larger class of states – or perhaps parse-items – where it becomes safe to drop the look-ahead. But that’s not today’s problem.
Tying Up Loose Ends¶
At this point, I think it’s clear what I need to prove for correctness:
- CLAIM:
The set of “tainted” parse-items in the LR(0) automaton are necessary and sufficient to make this monstrosity work.
Trivially, you could taint everything and you’d get canonical LR(1). Somewhere there’s a line. I’ll take “necessary” on faith for a couple paragraphs, because a miss there still yields a correct parse table, even if a larger one. The “sufficient” part is more interesting. I approach the proof by proving the converse: untainted items have some valid reason why they’re fine.
I clarify this below, but mainly:
If a parse-item’s end-state is LR(0)-adequate, there are no problems.
If a parse-item’s end-state is not adequate, then its head-item is created with proper relevant look-ahead tokens.
If the head-item has a set of look-ahead tokens, these carry through to the item’s LR(1) end-state.
- CLAIM:
It’s safe to use a precedence-and-associativity strategy to disambiguate shift-reduce conflicts. Such a system would do the right thing in all circumstances.
By the logic above, any shift-reduce conflict (and then some) will get worked out with precise look-ahead detail. If the conflict still exists in the resulting automaton, then yes of course you can use P&A on it. But if the strategy uses LR(0) items leading up to some state, then there was not a conflict on that state.
- CLAIM:
It will be possible to conjure up the correct set of followers for LR(1)-mode head-items.
The algorithm for that part has almost nothing to do with the followers of the core item-sets. The only relevance is when the successor-to-what’s-followed is transparent, in which case canonical LR(1) includes that item’s followers as followers of the new head-item. By construction, this algorithm will have those followers at hand, because the corresponding core-item’s head-item will have been tainted, so the core-item will be dragging a follower-set along.
Beyond LR(1)¶
It seems possible to try a variant of this idea to expand beyond LR(1). Basically, conflicts still present in the resulting LR(1) handle-finding automaton would result in a second cycle of tainting and then recomputing where specific LR(1) head-items must get LR(2) treatment – and so forth.
A parse engine for LR(k) would be rather more complicated. Again presumably most states don’t require the full k tokens of look-ahead. And now the look-ahead is sort of a circular buffer that you alternately examine and pull from.
We don’t want giant enormous wide parse tables just because there are n^k
possible k-token look-aheads.
Most likely you’d do a prefix tree.
An Actual Algorithm¶
The algorithm overall has three high-level phases:
Compute the LR(0) graph, and also keep the item-sets (both core and closure) for each LR(0) state.
Perform the tainting step.
Build the final automaton using the careful mix of LR(0) and LR(1) items as explained.
That tainting step in the middle is a bit tricky. So I propose a multi-step process for it:
Associate each LR(0) state with each of its predecessors, thus to save a lot of time later.
Starting from the reducing parse-items in each LR(0)-inadequate state, propagate the taint.
Treat the propagation step as a transitive-closure problem.
It will be handy to have a quick way to tell which rule (or its left-hand side) is associated with any given parse-item, and the offset from a parse-item to the beginning of its rule.
For any further detail than this, see the code.
Results and Next Steps¶
I have implemented the algorithm above instrumented my parser generator to compare the size of parse tables generated with LALR, Canonical LR(1), and this “Minimal” LR(1). Current results are that this algorithm produces significantly fewer states than Canonical, but often a good number more than LALR even for grammars that are LALR-compatible:
Decaf:
Canonical: 550 States
This Thing: 360 States
LALR: 197 States
Pascal:
Canonical: 1485 States
This Thing: 702 States
LALR: 275 States
I see two major approaches to improving on these results. Both try to be more precise about where canonical parse items are necessary.
- Propagate follow-sets more precisely.
Right now if a head parse-item is tainted, effectively its entire body is tainted. But the taint contagion thing can afflict a parse-item in its middle, which perhaps might mean carrying follow-sets into portions of the automaton where it is not strictly needed.
There is a one-off hack specific to LR(0)-adequate reducing states, but some of their predecessors might be splitting needlessly in the present scheme.
- Taint fewer items: Perhaps only those where LALR shows a potential conflict.
Presently any LR(0) parse conflict results in a taint. If there is no LALR conflict on any particular token, then such a state need not be considered tainted. The difference is how the parser behaves in the event of error.
A LALR table table can sometimes reduce a few rules before discovering that it really cannot shift the next token after all. In other words, it lacks the property of detecting errors immediately. This may not be that big a deal, but it’s something to think about if you want to handle errors nicely.
Finally, there are techniques like unit-rule elimination and shift-reduce instructions. These optimizations can apply regardless of how you construct your table, so I don’t consider them specially in this document. They have not been applied in the statistics above.
- Why the odd name?
A few programming languages were named for beverages.
Booze is a beverage.
I could find no programming language called Booze.
If anything is unclear, please feel free to file an issue at GitHub or contact me via e-mail. Feedback is always welcome. My username in most places is “kjosib” and I have a gmail account.