GopherCon 2018 - How to Write a Parser in Go

By Nick Snyder for the GopherCon Liveblog on August 29, 2018


Presenter: Sugu Sougoumarane

Liveblogger: Nick Snyder

Sugu Sougoumarane is the co-creator of Vitess, which contains a SQL parser that is now used by many other projects. In this talk he demonstrates how to write a parser using goyacc.

Summary

  • goyacc is almost an exact translation of the original yacc, except it is written in Go.
  • Using a parser generator like goyacc is a quick way to get a working parser for a LALR(1) grammar.
  • Lex is not code that you live in; it is code you write once and then use for a long time. It is ok if the code is not clean.
  • Code examples

Parser use cases

  • Language
  • Data files (most common)
  • Network wire format
  • Complex state machines

How to write a parser in two easy steps

image

goyacc

  • Translates a formal grammar into code.
  • Creates a state machine that it runs through.
  • Works with LALR(1) grammars (look head one token and decide what action to take).

    • A surprising number of languages can be parsed with LALR(1) parsers.
  • Embeds custom code.

Why goyacc

  • Readable - the file format reads like natural language
  • Extensible - easy to add rules
  • Easy testing - yacc is already tested, so only need to test your own parts as input to yacc
  • Efficient - because it uses a state machine
  • Detect conflicts - it will tell you if you add grammar rules that conflict (this is the best part)

Why not goyacc

  • If you need error handling. It can tell you syntax error at position, but can't recover from error or give nicer error messaging.
  • If you need extreme efficiency; can't beat hand-coded.
  • If you need to parse a complex (non-LALR(1)) grammar.

Using goyacc

General steps:

  1. Express grammar.
  2. Grammar rules can return values and assign types.
  3. Write code to return values.
  4. Write lexical analyzer.

Goyacc is almost an exact translation of the original yacc so some of the idiosyncrasies have been inherited. For example, C programs return only 1 value: 0 for success and 1 for failure. This means you need awkward boilerplate to give values to the lexer:

%{
package jsonparser
func setResult(l yyLexer, v Result) {
  l.(*lex).result = v
}
%}
%union{
}
%start main
%%
main:
  {
    setResult(yylex, 0)
  }

Use go generate to create the actual parser.go file:

//go:generate goyacc -l -o parser.go parser.y

How to parse a phone number

Area code has three parts: area code, first part, second part.

%token D

phone:
  area part1 part2
| area '-' part1 '-' part2
area: D D D
part1: D D D
part2: D D D D

Captital letters signify tokens.

How to return values

The generated parser is just a single function that runs a state machine and uses local variables.

These variables are saved in a union data structure:

%union{
  result Result
  part string
  ch byte
}

%type <result> phone
%type <part> area part1 part2

Actions run Go code (i.e. everything inside the braces) when a rule matches. Dollar variables address a variable that is a value returned by the parser.

part2: D D D D
  {
    $$ = cat($1, $2, $3, $4)
  }

Lexing

Two things are happening concurrently during lexing:

  1. Rules are getting matched. The int returned determines which rules match.
  2. Code is getting executed depending on which rule is matched. The result value is used inside the code you write.

Sometimes lex can return the byte itself as an int. Yacc has builtin predetermined tokens so all first 127 bytes are reserved and can be returned without telling the parser you are returning them

b := l.nextb()
if unicode.IsDigit(rune(b)) {
    lval.ch = b
    return D
}
return b

How does it work?

image

Goyacc

  1. Generates a const string per symbol
  2. Defines an interface for the Lexer
  3. Defines a parse function that accepts the lex as input

Options for lexing

Lex is not code that you live in. It is code you write once and then use for a long time. Ok if the code is not clean.

Future improvements

For complicated grammars (e.g. SQL), Goyacc can generate a big result structure that is expensive to pass around. Goyacc actually assigns this structure every time there is a state transition.

C (yacc) has structure called union which efficiently packs the datastructre, but there is no equivalent in Go....except interfaces are a very close equivalent!

Unlike C union type, you can type assert an interface in Go. One limitation with using a type asserted Go interface is that it is an rvalue which means you can't assign to it.

Switching Vitess to use an interface instead of struct doubles performance, but would be a backward incompatible change to goyacc.