# A Pint-sized Earley Parser

Many people have said that convenient parser generation is a
game-changing technology. In his talk *To Trap a Better Mouse*,
Ian Piumarta suggested
that the Earley algorithm is a good place to start because it handles
full context-free grammars and is fairly trivial to implement. But some
parts of the algorithm (particularly the construction of the parse
forest) don't seem to have good descriptions which are easily accessible
to non-experts.

So this is an attempt to fill that gap: a trivial realization of the algorithm, suitable for implementation in an afternoon, and with an annotated version for easy understanding and porting. I'm deliberately being fairly concise: for more information try Loup Vaillant-David's Earley Parsing Explained which goes into far more detail.

This is annotated pseudocode for compactness and as an attempt at language independence. If you want to look at real code, the JavaScript source is available at the git repository for this project.

## The Recognizer

Note: This part of the algorithm is well-covered: the presentation in Aycock and Horspool's Practical Earley Parsing is excellent, and Wikipedia's entry isn't too bad. But I am using a modified formulation which builds only one set at once, adds additional items which will be useful for the parser, and does not make a distinction between terminal and non-terminal symbols, so you probably want to read this section anyway.

### Data Structures

Let's start with some basic data structures. A **Rule**
has a symbol on the left hand side, and an array of symbols on the
right. A **Grammar** has a start symbol and a set of rules
indexed by symbol. Note that there may be multiple rules for the same
symbol: this is the only way that we support alternatives in rules.

```
Rule : { String symbol, Array production }
Grammar : {
String start,
Dictionary([Rule]) rules
// { "S": [S ::= X Y, S ::= A A b], ...}
}
```

An **LR(0) item** represents a partially matched rule
and is generally written with a dot showing the end of the match, so
`S ::= Expression "+" • Term`

would
indicate that an `Expression`

and `"+"`

have been
matched.

LR0 : { Rule rule, Integer dot }
next_symbol(lr0) =
if(lr0.dot < lr0.rule.production.length) lr0.rule.production[lr0.dot]

In this implementation we never use complete LR(0) items: we always promote them to a symbol so that all parse trees for a symbol will be combined under one Earley item no matter which production they derive from.

```
advance(lr0) =
if(lr0.dot == lr0.rule.production.length) lr0.rule.symbol
else new LR0(lr0.rule, lr0.dot+1)
```

Earley's algorithm works by keeping a table of all possible matches,
both partial and complete. These are organized into **Earley
sets** by the input position where the match ends. My sets also
contain a couple of indices and a reference to the grammar so we can
quickly find the items and rules we need to look at.

An **Earley item** consists of a symbol (for complete
matches) or an LR(0) item (for partial matches) plus references to the
sets (and hence input positions) where the match starts and ends.

```
Item : { Symbol|LR0 tag, Set start, Set end }
Set : {
Grammar grammar,
Integer position, // index into the input sequence
Array items, // items for iteration/processing
Dictionary idx, // items by tag and start for uniqueness
Dictionary wants, // incomplete items by next symbol
}
append_item(tag, start, end) =
item = new Item(tag, start, end)
end.items.push(item)
end.idx[tag][start] = item
if(tag.is_lr0) end.wants[tag.rule.symbol].push(item)
item // return item
```

To preserve uniqueness we only call `append_item`

if the
item doesn't already exist.

```
add_item(tag, start, end) =
end.idx[tag][start] || append_item(tag, start, end)
```

### The Algorithm

We start with an initial set containing rules for the start symbol and any items derived from those. Then we build a set for each successive input symbol. When we reach the end of the string, if we have a complete item for the start symbol matching the entire string then the parse has succeeded.

```
parse(grammar, input) =
s0 = process(predict(new Set(grammar, 0), grammar.start))
sN = foldl(parse_symbol, s0, input)
sN.idx[grammar.start][0]
parse_symbol(set, sym) = process(scan(set, sym))
```

Like most parsers, Earley's works from both ends. It
*predicts* which rules to match from the top down, but
*completes* matches from the bottom up.

Prediction makes new LR(0) items (with the dot at the beginning) from each of a symbol's rules.

```
predict(set, sym) =
for rule in set.grammar.rules[sym]
add_item(new LR0(rule, 0), set, set)
```

When an item is complete, we look back to the set where it started and advance any partial items there which were looking for the completed symbol.

```
complete(set, complete) =
for item in complete.start.wants[complete.tag]
add_item(advance(item.tag), item.start, complete.end)
```

To initialize a new set, we *scan* an input symbol, adding an
item for that symbol. Note that the traditional Earley algorithm does
*not* have items for terminals, and the scan operation advances
all items which are looking for the input symbol. But it's convenient
to have an item for the input symbol when creating the parse forest.
And since we don't distinguish between terminal and non-terminal
symbols, our scan can just add a single item for the input symbol and
let the completer do all the advancing.

```
scan(set1, sym) =
set2 = new Set(set1.grammar, set1.position + 1)
add_item(sym, set1, set2)
set2 // return the new set
```

Then we process items recursively with `predict()`

and
`complete()`

. We predict rules for the next symbol of any
partial matches, and to look back and advance items which match any
newly completed items.

```
process_once(set) =
for(item in set.items) // including newly-added ones
if(item.tag.is_lr0) predict(set, next_symbol(item.tag))
else complete(item)
```

If there are rules which produce the empty string, this may fail to produce all the necessary items (see Aycock and Horspool's Practical Earley Parsing for more info). There are better ways to deal with this, but we're just going to keep trying again until it fails to add any items.

```
process(set) =
old = set.items.length
do process_once(set) while(set.items.length > old)
set // return the set for foldl's convenience
```

## The Parser

To extend the parser to a recognizer, we add a shared packed parse forest (SPPF) node to each Earley item, as described in Elizabeth Scott's paper SPPF-style Parsing from Earley Recognizers. She uses a pointer to a separate object, but I am using the Earley items themselves as parse forest nodes.

Scanning produces input symbol nodes which are their own derivation.
Predictions start with no derivations. The completer appends the
derivation of the newly completed symbol to the derivation of the item
it is advancing. So we just need to augment the Earley items with two
pointers: a *left* pointer to the previous Earley item (if any)
and a *right* pointer to the new item.

The `scan`

and `predict`

routines are
unchanged, since they add nodes with no derivation. But
`complete`

needs to (potentially) add a new derivation,
regardless of whether it adds a new item.

```
complete(set, complete) =
for item in complete.start.wants[complete.tag]
a = add_item(advance(item.tag), item.start, complete.end)
add_derivation(a, item, complete)
```

The contents of the parse pointers fall into three classes:

Links | Meaning |
---|---|

null/null | no derivations |

null/item item/item | single derivation |

list/null | multiple derivations |

Items with ambiguous parses have a linked list of objects, each of which represents one derivation.

`Derivation : { Item left, Item right, Derivation next }`

`add_derivation`

needs to handle each of the three
classes. It also removes some unnecessary nodes on the left edge of the
tree. If the left child is an LR(0) item with one or no children, then
we skip it and use its child (if any) directly.

```
add_derivation(item, left, right) {
if(!(left || right)) return // empty derivation
// remove trivial nodes on the left
if(left.tag.is_lr0 and left.tag.dot <= 1) left = left.right
if(!(item.left || item.right)) set_derivation(item, left, right)
else if(item.right) add_second_derivation(item, left, right)
else add_another_derivation(item, left, right)
```

If there aren't any derivations yet, we just set the new one. But if
there *are* existing derivations we'll need to check whether the
new one is already present.

```
set_derivation(item, left, right) =
item.left = left
item.right = right
same_derivation(item, left, right) =
item.left == left && item.right == right
```

If there is one existing derivation, we need to turn it into a list of one item and then add the new one.

```
add_second_derivation(item, left, right) =
if(!same_derivation(item, left, right))
old = new Derivation(item.left, item.right, null)
item.left = new Derivation(left, right, old)
item.right = null
```

Finally, if there are multiple existing derivations, we need to search the list to see if the derivation is present.

```
add_another_derivation(item, left, right) =
d = item.left; while(d)
if(same_derivation(d, left, right)) return
d = d.next
item.left = new Derivation(left, right, item.left)
```

## Walking the Forest

It's a little bit awkward to walk the forest, since the links go back
to the left. But collecting the children into their proper order isn't
*too* bad. You basically just do a post-order traversal,
accumulating the children before performing the node's action, and for
intermediate nodes (those tagged with an LR(0) item) you do nothing
*but* accumulate the children.

So we can provide a forest walker which takes a set of handler functions to process a derivation (array of derivations for each symbol), a set of choices (array of derivations), and a completed symbol (with its derivation).

```
WalkerFns : {
Function(Array) derivation,
Function(Array) choices,
Function(Symbol, Array) symbol
}
process_forest(f, fns) =
if(!f) return
result = null
if(f is ambiguous)
c = map(fns.derivation(collect_children(_, fns)),
choices)
result = fns.choices(c)
else if(f is intermediate)
result = collect_children(f, fns)
else
d = fns.derivation(collect_children(f, fns));
result = fns.symbol(f.tag, d)
result
```

The helper function pretty much just puts the children into an array, except that if the left child is an intermediate node, it uses that array directly rather than creating a new level of nesting.

```
collect_children(f, fns) =
left = process_forest(f.left, fns)
right = process_forest(f.right, fns)
if(left)
result = (f.left is intermediate) ? left : [left]
if(right) result.push(right)
else if(right) result = [right]
else result = null
result
```

Then we can easily visualize a parse forest as nested HTML tables.

```
html_derivation(d) = tr(map(td, d))
html_choices(c) = table(map(tr . td, c))
html_symbol(name, derivation) =
if(derivation)
symbol = tr(td(name, colSpan = derivation.length))
table(symbol, derivation)
else name
```

And here it is in action.

```
g = Grammar('S',
S ::= a X X c
X ::= X b
X ::=
)
parse(g, 'abbc')
```

## Choosing a Parse Tree

It's also fairly easy to walk the forest and reduce it to a single tree. I'm using rule priorities (earlier rules have higher priority) to decide. This gives you the nice ordered choice behavior of PEGs without the strange behaviors PEGs can exhibit (e.g. “S ::= a S a / a a” matching only strings of a's whose length is a power of two, instead of all even length ones).

I think this ordered choice behavior gives you enough control for most practical use. For instance, applying it to our example above gives you longest-match behavior, while reversing the order of the rules for X would give you shortest-match behavior instead.

To implement this, we need to modify several things. Obviously we need to modify our grammar initialization code to assign priority numbers.

We need to make sure we can tell which rule an item derives from.
This isn't directly accessible when we have a complete item: they just
have the symbol. You could probably look back to the left to find an
LR(0) item, or look it up in the grammar for null derivations. But I
have chosen to add a rule reference to each item and derivation, and to
update `add_second_derivation`

so that it moves the rule from
the (now ambiguous) item to the derivation.

Then we simply walk the tree, choosing one option for each ambiguous derivation.

```
disambiguate(f, gt) =
if(f && is_ambiguous(f))
best = f.left, d = f.left
while(d = d.next)
best = (cmp(d, best, gt) > 0) ? d : best
f = best
// handle our children's derivations
if(f.left) disambiguate(f.left, choose)
if(f.right) disambiguate(f.right, choose)
is_ambiguous(f) = f && f.left && !f.right
```

The tricky stuff is in the actual comparison function. We make the decision based on the rule. Only if the rules are equal do we need to recurse deeper to make a decision. So we know that we are only ever comparing two items of the same type (partial, complete, input symbol). So:

- If they're null, then they're equal
- Otherwise we disambiguate them so we're dealing with single parse trees on each side. Then:
- If their rules are equal, then they're equal (in JavaScript,
this handles the case of input symbols: their rules are both
`undefined`

, so they're equal. - If their rules are not equal, we pass them to a user-defined rule comparison function which tells us which one is preferred.

```
cmp(a, b, gt) =
if(!(a || b)) result = 0
else
disambiguate(a, gt)
disambiguate(b, gt)
if(a.rule == b.rule)
result = cmp(a.left, b.left, gt)
|| cmp(a.right, b.right, gt)
else if(gt(a.rule, b.rule)) result = 1
else result = -1 // must be less - we know they're not equal
result
```

Then disambiguating by priority is trivial.

```
higher_priority(a, b) = a.priority > b.priority
prioritized_tree(f) = disambiguate(f, higher_priority)
```

## Adding Parse Actions

Then adding parse actions turns out to be trivial: just add an action to each rule and call them in the tree walker.

```
eval_fns = {
choice: error,
derivation: function(d) = d,
symbol: function(item, derivation) =
if(item.rule) item.rule.action(derivation)
else if(derivation) derivation
else item.tag // input symbol; just return it.
}
```

Let's try it.

(Note the way the priority is "backwards": since all the rules are going to be part of the parse, the priority just chooses which ones are closer to the top of the tree. But operator precedence works the other way: higher precedence operators bind more tightly to the leaves.)

```
expr = new Grammar('E',
E ::= E + E → d[0] + d[2]
E ::= E * E → d[0] * d[2]
E ::= 2 → 2
E ::= 3 → 3
E ::= 5 → 5
E ::= 7 → 7
tree = disambiguate(parse(expr, '2*3+5*7'))
process_forest(tree, eval_fns)
```

## Future Work

I'm excited about finally getting this algorithm figured out and running, so I'm writing it up and posting it right away. Since I first posted it I've been working on it quite a bit. I think that to be useful it mostly just needs a means to enter grammars in a convenient notation. I got bogged down deciding how far to go with making BNF syntax, so I've set it aside until I have some use cases and know what I want...