Better parser errors for Lua

One of the passion projects I love to work on, but don’t talk about very often is illuaminate. illuaminate aims to be a comprehensive static analysis toolkit for Lua, combining type-checking, linting and documentation generation into one program.

This is ambitious goal, and I’ve almost definitely bitten off more than I can chew! For now, illuaminate is in the awkward state of being “good enough” that I can cope with the current functionality, but a long way from being usable by anyone else.

One thing I’ve been wanting to improve for a while is parser error messages. Being able to parse an input program is the most basic part of the whole of illuaminate, and we still do an embarrassingly bad job!

Let’s take a common mistake we’ve all made, using = to check for equality rather than ==.

if a = b then
  print("Hello")
end

If we feed this into illuaminate, it’ll spit out this message:

Unexpected `=`: This is unexpected after a simple expression.
   │
 1 │ if a = b then
   │      ^

It’s technically better than what PUC Lua does ('then' expected near '='), but only in the sense that it’s prettier! The message is still rather useless.

Under the hood, illuaminate uses Menhir, an LR(1) parser generator for OCaml.

Shift-reduce and all that guff

Oh gosh, I’m going to have to talk about LR(1) parsers, aren’t I? The whole subject of parsing programming languages is a wide field (and often a whole course at university), so I’ll try to keep this brief.

When we talk about parsing programming languages, it’s useful to distinguish between the syntax of the language (or “grammar”), and implementation details of how parsing actually happens.

The grammar of a language is defined using three sets of objects:

Terminals: Terminals are the individual tokens fed into the parser, such as keywords (if), symbols (=) or identifiers (a).
Non-terminals: A non-terminal is something the parser can produce by consuming other terminals and non-terminals. These correspond to syntactic elements in a programming language, such as expressions or statements.
Productions: Productions are the rules that describe how to parse a non-terminal from a sequence of terminals and non-terminals. For instance, the production stmt -> IF expr THEN block END says “a statement can be formed by the if keyword, an expression, the THEN keyword, a block, and finally the END keyword).

Parser generators then have the tricky job of taking all of this, and building an efficient parser out of it. Menhir does this by building a complex state machine called an LR(1) parser.

Rather than having a single state, like finite state machines, LR(1) parsers have a stack of states. For each input non-terminal (called the lookahead¹), they take the state at the top of the stack, and use a big lookup table to decide whether to push (or “shift”) a new state onto the stack, or to apply a production — popping several items off the stack and “reducing” it to a single new state.

I’m definitely glossing over some of the details here (for instance, how reductions are performed), but it’s not really relevant here.

Instead, we’ll talk about one more useful bit of parser terminology: “items”. These are effectively in-progress productions, using an extra . to show how far we’ve parsed something. For instance, stmt -> IF . expr THEN block END is an item where we’ve consumed an if keyword and are now trying to parse the rest of the if statement.

One useful property of LR(1) parsers is that every item has one or more states associated with it (and visa versa). For instance, our stmt -> IF . expr THEN block END is represented by exactly one LR(1) state — the state we get after shifting an if keyword.

Right, let’s get back on track…

Error messages in Menhir

LR(1) parser generators are a little notorious for leading to bad error messages. Thankfully, Menhir has a simple but clever solution — it finds every state in the LR(1) machine which can error, and gets the developer to write an error message for each state.

In the case with the unexpected =, our parser is in a state where it is trying to parse an expression, and hit an unexpected token. Menhir will show us the exact productions its trying to parse in this state, and prompt us to write an error message:

program: RETURN IDENT TRUE
##
## Ends in an error in state: 35.
##
## atom -> simple_expr . [ WHILE ... ]
## call -> simple_expr . COLON ident call_args [ WHILE ... ]
## name -> simple_expr . DOT ident [ WHILE ... ]
## name -> simple_expr . OSQUARE expr CSQUARE [ WHILE ... ]
##
## The known suffix of the stack is as follows:
## simple_expr
##

This is unexpected after a simple expression.

This actually can work surprisingly well, but it’s also incredibly particular. As an expression can never be followed by an =, due to how LR(1) parsers work, we’ll get an immediate error. However, if we’d written a different symbol (maybe ]), then the parser will finish parsing the expression, and then try to look for a then, giving us an entirely different error!

Unexpected `]`: Expected `then` after `if` condition.
   │
 1 │ if a ] b then
   │      ^

At the start of 2023 (yes, this is another blog post for work I did a year ago!), I started looking for alternative options, and came across the incredible lrgrep by Frédéric Bour.

Rather than writing error messages for particular LR(1) states, lrgrep allows you to write patterns which match against the whole parser stack, allowing far more powerful and context-aware error reporting.

A basic lrgrep rule

Let’s see this in practice. We’d like to write an error message for the case where an expression is followed by an equals sign. In lrgrep’s pattern-matching language, this would look something like this:

(* Match `expr() = ` in any other context: probably wanted `expr() == `. *)
| [expr] @ EQUALS
  { Error.Use_double_equals (token, $startloc(token), $endloc(token)) }

Let’s take this bit by bit:

[expr]: This matches any parser stack where the head of it reduces (the [...]) to an expression (expr).
@ EQUALS: Anything after the @ matches the lookahead token. In this case, we want to report an error when the next token is an EQUALS (=).
{ ... }: This then constructs our error message and returns it. Here, we include the position of the erroneous token.

Now then, all we need to do is write an error message and try it out!

Unexpected `=` in expression.
   │
 1 │ if a = b then
   │      ^
Tip: Replace this with `==` to check if two values are equal.

Filtering rules

The peril with witting error messages like this is that you don’t want to predict your user too much. There’s always going to be some case you haven’t considered, and you don’t want to provide a misleading error message!

For instance, a programmer might instead be trying to create a table with an expression key. The correct syntax for this is { ["some expression"] = 123 }, but it’s easy to forget those square brackets:

return {
  "some expression" = 123,
}

If we then start yelling about == here, then it’s probably not very helpful. Fortunately, lrgrep allows us to write a more precise error message:

(* Match `expr() =` in tables: probably wanted `[expr()] = ...` *)
| [_ / table_entry: expr .] @ EQUALS
  { Error.Table_key_equals (token, $startloc(token), $endloc(token)) }

This is a familiar pattern to what we had before ([...] @ EQUALS), but instead of matching an expression, we’ve got something a little more complex. Let’s break it down again.

_ is just a wildcard, matching anything.
/ then filters this wildcard, restricting us to a set of LR(1) states.
table_entry: expr . filters us to match all LR(1) states corresponding to the table_entry -> expr . item.

The effect of this rule is similar to our much-simpler [expr] @ EQUALS, but instead only applies when immediately inside a table!

Unexpected `=` in expression.
   │
 2 │   "some expression" = 123,
   │                     ^
Tip: Wrap the preceding expression in `[` and `]` to use it as a table key.

In closing

I’ve been really impressed with lrgrep so far. I’ve been using it both in illuaminate, and the much more popular CC: Tweaked (blog post about how that works soon™️), and it’s been working flawlessly.

One thing I cannot shout enough about is how easy it is to write context-sensitive error messages. These are definitely possible to do in recursive descent parsers, but can end up being a little inelegant.

Many, many thanks to Frédéric Bour for all their work on lrgrep. They’ve been incredibly helpful in answering all the questions I’ve had.

If you’re interested, do check out the full lrgrep rules used in illuaminate².

Thanks for reading!

The “1” in LR(1) refers to the fact that it only needs a single “lookahead” token.↩︎
Though maybe not the rest of the code — it’s pretty nasty in places!↩︎