strain: The Slightly Painful Lexer Generator

(full source available here)

strain is a lexer generator in the spirit of GNU flex. I initially developed it to generate syntax highlighters for the code I was showcasing at the archive of interesting code after I realized that reading code in black and white is significantly trickier than reading code with visual cues. You might ask - why do we need another lexer generator? And why didn't you just use one of the myriad existing syntax highlighting packages? The answers are, respectively: we don't, and I probably should have. My main interest in developing strain was not so much to revolutionize the lexing world as much as to prove to myself that I could apply the theoretical techniques I had learned as an undergraduate, and I have to admit that it was very rewarding building up the entire lexer from scratch. I had seen subset construction and DFA minimization before in class, but had never actually tried to sit down and code it from scratch. Similarly, I had never actually written an LR parser by hand, and in the course of learning that the CFG I was using for regular expressions was not LR(0), SLR, or LALR(1), I got a chance to play around with all of those techniques. The end result has its warts - it's usable but there's still a long ways to go - but I'm pretty satisfied.

strain is a lexer generator. It accepts as input a file containing a series of regular expressions and token classes, then outputs a C++ class that reads input from a std::istream, possibly calling user-specified semantic actions defined in the input file.

Rather than using an existing regular expression format in strain, strain uses its own simple regular expression grammar. More concretely, a regular expression can be:

A character class expression is a series of set operations on characters. For example, {[alpha]+[digit]} would represent all letters or digits (the same as [alnum]), while {[xdigit]-[lower]} would be all digits and the letters ABCDEF. Arbitrary character classes can be inserted into character class expressions, including complex ones such as {{G+[xdigit]}-[lower]+[space]}.

Each strain configuration file is broken up into four logical pieces:

  1. The preamble code, which is C++ code copied directly into the top of the generated C++ code for the lexer.
  2. The postamble code, which is C++ code copied directly after the C++ code for the lexer.
  3. A list of token type definitions, which define a set of tokens associated with strongly-typed C++ expressions.
  4. A list of regular expressions, annotated with semantic actions to take when those expressions are encountered.

As an example, consider the following strain configuration file:

/* A strain file to lex simple arithmetic expressions. */
using namespace std;

/* Utility function to convert a string to an integer, assuming it's a valid int. */
int ValidStringToInteger(string token, int radix) {
	stringstream converter(token);
	converter << setbase(radix);
	int result;
	converter >> result;
	return result;

/* No postamble code required. */
Number: int
Operator: char
 RETURN (Number, ValidStringToInteger(TOKEN, 8));

 RETURN (Number, ValidStringToInteger(TOKEN, 10));

 RETURN (Number, ValidStringToInteger(TOKEN, 16));

 RETURN (Operator, TOKEN[0]);

 /* Take no action */

This script begins by writing a C++ function to convert a string to an integer in some base. It then creates two token types, Number (of type int) and Operator (of type char). Finally, it lists several regular expressions which parse a stream into these types. An integer can be a number in octal, decimal, or hexadecimal; operators can be either +, -, *, or /, and any sequence of whitespace tokens is silently skipped.

The lexer generated by strain uses the maximal-munch, first-precedence rule, which means that the lexer will continuously read characters until it finds a maximal sequence of characters it can interpret. It then looks at all of the rules that match, then reports the token as being a member of the lexeme class that was defined first in the file. For example, given this script:

/* No preamble. */
/* No postamble. */
RuleOne: string
RuleTwo: string
RuleThree: string


 RETURN (RuleThree, TOKEN);

RuleOne will almost never be matched, because any sequence of characters of length greater than one will be longer than a and match RuleTwo. RuleOne will be invoked, however, if the input is identically the character a, since that sequence matches all three rules but RuleOne was defined first. RuleThree is never matched, since any time it could match an input sequence RuleOne will also match and has higher priority.

Internally, strain works by converting each regular expression into an ε-NFA that has exactly one start state and one terminal state. It then annotates that terminal state with the number of the regular expression (with 0 being the first defined in the file, 1 being the second, etc.). Next, it constructs a new ε-NFA by adding a start state with ε-transitions to each of the start states of these automata. This automaton will then match any of the regular expressions matched by any of the input regexps. It then uses subset construction to convert this ε-NFA into a DFA, which it minimizes. During the subset construction and minimization, the automaton is careful to take the number of each terminal state into account. Finally, strain outputs a C++ class that encapsulates the transition tables and semantic actions.

The C++ source code for strain can be found by following the indicated link. This directory contains several samples of syntax highlighting programs written using strain, which are currently being used on the archive of interesting code site to make code pretty. If you have any comments or questions about strain, shoot me an email! I'd love to hear your thoughts.