A Pint-sized Earley Parser

Many people have said that convenient parser generation is a game-changing technology. In his talk To Trap a Better Mouse, Ian Piumarta suggested that the Earley algorithm is a good place to start because it handles full context-free grammars and is fairly trivial to implement. But some parts of the algorithm (particularly the construction of the parse forest) don't seem to have good descriptions which are easily accessible to non-experts.

So this is an attempt to fill that gap: a trivial realization of the algorithm, suitable for implementation in an afternoon, and with an annotated version for easy understanding and porting. I'm deliberately being fairly concise: for more information try Loup Vaillant-David's Earley Parsing Explained which goes into far more detail.

This is annotated pseudocode for compactness and as an attempt at language independence. If you want to look at real code, the JavaScript source is available at the git repository for this project.

The Recognizer

Note: This part of the algorithm is well-covered: the presentation in Aycock and Horspool's Practical Earley Parsing is excellent, and Wikipedia's entry isn't too bad. But I am using a modified formulation which builds only one set at once, adds additional items which will be useful for the parser, and does not make a distinction between terminal and non-terminal symbols, so you probably want to read this section anyway.

Data Structures

Let's start with some basic data structures. A Rule has a symbol on the left hand side, and an array of symbols on the right. A Grammar has a start symbol and a set of rules indexed by symbol. Note that there may be multiple rules for the same symbol: this is the only way that we support alternatives in rules.

Rule : { String symbol, Array production }

Grammar : {
    String start,
    Dictionary([Rule]) rules
        // { "S": [S ::= X Y, S ::= A A b], ...}

An LR(0) item represents a partially matched rule and is generally written with a dot showing the end of the match, so S ::= Expression "+" • Term would indicate that an Expression and "+" have been matched.

LR0 : { Rule rule, Integer dot }
next_symbol(lr0) =
    if(lr0.dot < lr0.rule.production.length) lr0.rule.production[lr0.dot]

In this implementation we never use complete LR(0) items: we always promote them to a symbol so that all parse trees for a symbol will be combined under one Earley item no matter which production they derive from.

advance(lr0) =
    if(lr0.dot == lr0.rule.production.length) lr0.rule.symbol
    else new LR0(lr0.rule, lr0.dot+1)

Earley's algorithm works by keeping a table of all possible matches, both partial and complete. These are organized into Earley sets by the input position where the match ends. My sets also contain a couple of indices and a reference to the grammar so we can quickly find the items and rules we need to look at.

An Earley item consists of a symbol (for complete matches) or an LR(0) item (for partial matches) plus references to the sets (and hence input positions) where the match starts and ends.

Item : { Symbol|LR0 tag, Set start, Set end }

Set : {
    Grammar grammar,
    Integer position,  // index into the input sequence
    Array items,       // items for iteration/processing
    Dictionary idx,    // items by tag and start for uniqueness
    Dictionary wants,  // incomplete items by next symbol

append_item(tag, start, end) =
    item = new Item(tag, start, end)
    end.idx[tag][start] = item
    if(tag.is_lr0) end.wants[tag.rule.symbol].push(item)
    item  // return item

To preserve uniqueness we only call append_item if the item doesn't already exist.

add_item(tag, start, end) =
    end.idx[tag][start] || append_item(tag, start, end)

The Algorithm

We start with an initial set containing rules for the start symbol and any items derived from those. Then we build a set for each successive input symbol. When we reach the end of the string, if we have a complete item for the start symbol matching the entire string then the parse has succeeded.

parse(grammar, input) =
    s0 = process(predict(new Set(grammar, 0), grammar.start))
    sN = foldl(parse_symbol, s0, input)

parse_symbol(set, sym) = process(scan(set, sym))

Like most parsers, Earley's works from both ends. It predicts which rules to match from the top down, but completes matches from the bottom up.

Prediction makes new LR(0) items (with the dot at the beginning) from each of a symbol's rules.

predict(set, sym) =
    for rule in set.grammar.rules[sym]
        add_item(new LR0(rule, 0), set, set)

When an item is complete, we look back to the set where it started and advance any partial items there which were looking for the completed symbol.

complete(set, complete) =
    for item in complete.start.wants[complete.tag]
        add_item(advance(item.tag), item.start, complete.end)

To initialize a new set, we scan an input symbol, adding an item for that symbol. Note that the traditional Earley algorithm does not have items for terminals, and the scan operation advances all items which are looking for the input symbol. But it's convenient to have an item for the input symbol when creating the parse forest. And since we don't distinguish between terminal and non-terminal symbols, our scan can just add a single item for the input symbol and let the completer do all the advancing.

scan(set1, sym) =
    set2 = new Set(set1.grammar, set1.position + 1)
    add_item(sym, set1, set2)
    set2  // return the new set

Then we process items recursively with predict() and complete(). We predict rules for the next symbol of any partial matches, and to look back and advance items which match any newly completed items.

process_once(set) =
    for(item in set.items)  // including newly-added ones
        if(item.tag.is_lr0) predict(set, next_symbol(item.tag))
        else complete(item)

If there are rules which produce the empty string, this may fail to produce all the necessary items (see Aycock and Horspool's Practical Earley Parsing for more info). There are better ways to deal with this, but we're just going to keep trying again until it fails to add any items.

process(set) =
    old = set.items.length
    do  process_once(set)  while(set.items.length > old)
    set  // return the set for foldl's convenience

The Parser

To extend the parser to a recognizer, we add a shared packed parse forest (SPPF) node to each Earley item, as described in Elizabeth Scott's paper SPPF-style Parsing from Earley Recognizers. She uses a pointer to a separate object, but I am using the Earley items themselves as parse forest nodes.

Scanning produces input symbol nodes which are their own derivation. Predictions start with no derivations. The completer appends the derivation of the newly completed symbol to the derivation of the item it is advancing. So we just need to augment the Earley items with two pointers: a left pointer to the previous Earley item (if any) and a right pointer to the new item.

The scan and predict routines are unchanged, since they add nodes with no derivation. But complete needs to (potentially) add a new derivation, regardless of whether it adds a new item.

complete(set, complete) =
    for item in complete.start.wants[complete.tag]
        a = add_item(advance(item.tag), item.start, complete.end)
        add_derivation(a, item, complete)

The contents of the parse pointers fall into three classes:

null/nullno derivations
single derivation
list/nullmultiple derivations

Items with ambiguous parses have a linked list of objects, each of which represents one derivation.

Derivation : { Item left, Item right, Derivation next }

add_derivation needs to handle each of the three classes. It also removes some unnecessary nodes on the left edge of the tree. If the left child is an LR(0) item with one or no children, then we skip it and use its child (if any) directly.

add_derivation(item, left, right) {
    if(!(left || right)) return  // empty derivation

    // remove trivial nodes on the left
    if(left.tag.is_lr0 and left.tag.dot <= 1) left = left.right

    if(!(item.left || item.right)) set_derivation(item, left, right)
    else if(item.right) add_second_derivation(item, left, right)
    else add_another_derivation(item, left, right)

If there aren't any derivations yet, we just set the new one. But if there are existing derivations we'll need to check whether the new one is already present.

set_derivation(item, left, right) =
    item.left = left
    item.right = right

same_derivation(item, left, right) =
    item.left == left && item.right == right

If there is one existing derivation, we need to turn it into a list of one item and then add the new one.

add_second_derivation(item, left, right) =
    if(!same_derivation(item, left, right))
        old = new Derivation(item.left, item.right, null)
        item.left = new Derivation(left, right, old)
        item.right = null

Finally, if there are multiple existing derivations, we need to search the list to see if the derivation is present.

add_another_derivation(item, left, right) =
    d = item.left;  while(d)
        if(same_derivation(d, left, right)) return
        d = d.next
    item.left = new Derivation(left, right, item.left)

Walking the Forest

It's a little bit awkward to walk the forest, since the links go back to the left. But collecting the children into their proper order isn't too bad. You basically just do a post-order traversal, accumulating the children before performing the node's action, and for intermediate nodes (those tagged with an LR(0) item) you do nothing but accumulate the children.

So we can provide a forest walker which takes a set of handler functions to process a derivation (array of derivations for each symbol), a set of choices (array of derivations), and a completed symbol (with its derivation).

WalkerFns : {
    Function(Array) derivation,
    Function(Array) choices,
    Function(Symbol, Array) symbol

process_forest(f, fns) =
    if(!f) return

    result = null
    if(f is ambiguous)
        c = map(fns.derivation(collect_children(_, fns)),
        result = fns.choices(c)
    else if(f is intermediate)
        result = collect_children(f, fns)
        d = fns.derivation(collect_children(f, fns));
        result = fns.symbol(f.tag, d)

The helper function pretty much just puts the children into an array, except that if the left child is an intermediate node, it uses that array directly rather than creating a new level of nesting.

collect_children(f, fns) =
    left = process_forest(f.left, fns)
    right = process_forest(f.right, fns)

        result = (f.left is intermediate) ? left : [left]
        if(right) result.push(right)
    else if(right) result = [right]
    else result = null

Then we can easily visualize a parse forest as nested HTML tables.

html_derivation(d) = tr(map(td, d))
html_choices(c) = table(map(tr . td, c))
html_symbol(name, derivation) =
        symbol = tr(td(name, colSpan = derivation.length))
        table(symbol, derivation)
    else name

And here it is in action.

g = Grammar('S',
    S ::= a X X c
    X ::= X b
    X ::=

parse(g, 'abbc')

Choosing a Parse Tree

It's also fairly easy to walk the forest and reduce it to a single tree. I'm using rule priorities (earlier rules have higher priority) to decide. This gives you the nice ordered choice behavior of PEGs without the strange behaviors PEGs can exhibit (e.g. “S ::= a S a / a a” matching only strings of a's whose length is a power of two, instead of all even length ones).

I think this ordered choice behavior gives you enough control for most practical use. For instance, applying it to our example above gives you longest-match behavior, while reversing the order of the rules for X would give you shortest-match behavior instead.

To implement this, we need to modify several things. Obviously we need to modify our grammar initialization code to assign priority numbers.

We need to make sure we can tell which rule an item derives from. This isn't directly accessible when we have a complete item: they just have the symbol. You could probably look back to the left to find an LR(0) item, or look it up in the grammar for null derivations. But I have chosen to add a rule reference to each item and derivation, and to update add_second_derivation so that it moves the rule from the (now ambiguous) item to the derivation.

Then we simply walk the tree, choosing one option for each ambiguous derivation.

disambiguate(f, gt) =
    if(f && is_ambiguous(f))
        best = f.left,  d = f.left
        while(d = d.next)
            best = (cmp(d, best, gt) > 0) ? d : best
        f = best
    // handle our children's derivations
    if(f.left) disambiguate(f.left, choose)
    if(f.right) disambiguate(f.right, choose)

is_ambiguous(f) = f && f.left && !f.right

The tricky stuff is in the actual comparison function. We make the decision based on the rule. Only if the rules are equal do we need to recurse deeper to make a decision. So we know that we are only ever comparing two items of the same type (partial, complete, input symbol). So:

cmp(a, b, gt) =
    if(!(a || b)) result = 0
        disambiguate(a, gt)
        disambiguate(b, gt)
        if(a.rule == b.rule)
            result = cmp(a.left, b.left, gt)
                || cmp(a.right, b.right, gt)
        else if(gt(a.rule, b.rule)) result = 1
        else result = -1  // must be less - we know they're not equal

Then disambiguating by priority is trivial.

higher_priority(a, b) = a.priority > b.priority

prioritized_tree(f) = disambiguate(f, higher_priority)

Adding Parse Actions

Then adding parse actions turns out to be trivial: just add an action to each rule and call them in the tree walker.

eval_fns = {
    choice: error,
    derivation: function(d) = d,
    symbol: function(item, derivation) =
        if(item.rule) item.rule.action(derivation)
        else if(derivation) derivation
        else item.tag  // input symbol; just return it.

Let's try it.

(Note the way the priority is "backwards": since all the rules are going to be part of the parse, the priority just chooses which ones are closer to the top of the tree. But operator precedence works the other way: higher precedence operators bind more tightly to the leaves.)

expr = new Grammar('E',
    E ::= E + E → d[0] + d[2]
    E ::= E * E → d[0] * d[2]
    E ::= 2 → 2
    E ::= 3 → 3
    E ::= 5 → 5
    E ::= 7 → 7

tree = disambiguate(parse(expr, '2*3+5*7'))
process_forest(tree, eval_fns)

Future Work

I'm excited about finally getting this algorithm figured out and running, so I'm writing it up and posting it right away. Since I first posted it I've been working on it quite a bit. I think that to be useful it mostly just needs a means to enter grammars in a convenient notation. I got bogged down deciding how far to go with making BNF syntax, so I've set it aside until I have some use cases and know what I want...