As I mentioned back in this post, the initial phase of work needed to allow Sotue to recognize data in input streams is to build a state machine that input characters can move through as they are read. If the state machine ends up in what is called an accepting state, then the input characters match a pattern. To review, Sotue’s process for building these state machines are as follows:

  1. Construct a non-deterministic finite automaton (NFA) from a regular expression.
  2. Convert the NFA into a deterministic finite automaton (DFA) that can be used by Sotue’s analysis engine to recognize input.

(It is possible to create DFAs directly from regular expressions, bypassing the NFA construction step, but I may save that optimization for a later effort. The construction of NFAs from regular expressions is implemented by a well-known process called Thompson’s Construction, developed by Ken Thompson at Bell Labs for the QED editor and patented in 1971.)

I have moved forward with Sotue to the point that I can construct DFAs from NFAs that have an input alphabet of single characters (in other words, I can’t get generate DFAs from NFAs that use character classes – that will come later). I have also upgraded my crummy WPF-based NFA viewer into a crummy WPF-based state machine viewer that displays a graphical view of the NFA as well as a graphical view of the equivalent DFA.

Consider, for example, a regular expression of ((a|b)|cd), which reads as “match a or b or cd”. Using my newly minted, yet still crummy, viewer, I get the following NFA:

AorBorCDNFA

I also get the following, more straightforward, DFA:

AorBorCDDFA

Now it’s time to write unit tests for the DFA code and to add support for character classes in DFAs.