The first task for any lexical analysis engine, such as the one I building for Sotue, is to turn a regular expression such as “[A-Z][a-z]*” into a state machine that can be used by the engine itself to efficiently determine whether or not the input text matches the expression. Sotue generates these state machines in two phases:
- Construct an un-optimized state machine from the regular expression
- Optimize the state machine
The un-optimized state machine is formally known in computation theory as a non-deterministic finite automaton, or NFA. The optimized is formally known as a deterministic finite automaton, or DFA. Sotue generates an NFA from a regular expression, and then generates a DFA from the NFA.
Sotue’s unit tests perform a programmatic analysis of text-controlled regular expressions to ensure that the NFA generation is working as intended. However, I was also looking for something more visual, since state machines can be drawn out and easily examined. To this end, I built a small WPF app that displays NFAs generated by Sotue’s NFA engine:
The WPF app is a simple one, with a text box, a button, and a grid pane for the NFA image. The user types a regular expression in the text box and clicks the “Display” button. Sotue’s NFA engine does its thing and produces an NFA. Once the NFA is generated, the WPF app calls out to Graphviz, generating a Graphviz input file describing the generated NFA in Graphviz’ DOT language. The WPF app then kicks off a Graphviz process which generates a PNG image from the described data, and the PNG is loaded as an image into the WPF app. Pretty straightforward.
Currently, Sotue’s NFA engine handles regular expressions containing grouping constructs (parenthetical expressions) and OR expressions such as (“a|b”). I will now move onto adding support for additional constructs, such as closure operators. This NFA viewer will, along with the unit tests, help ensure that the additional expressions are generating the correct NFA.