The .NET-based lexical analysis engine on which I have been working, dubbed “Sotue” for no particular reason, now supports building NFAs for regular expressions that includes what is known as the closure operators, all of which you should recognize as soon as you see them:
- ?, which means “match on zero or one occurrences of the preceding expression”
- *, which means “match on zero or more occurrences of the preceding expression”
- +, which means “match on one or more occurrences of the preceding expression”
Numbers, for example, can be recognized as a sequence of one or more digits, so a regular expression to recognize a number might look like this:
(0|1|2|3|4|5|6|7|8|9)+
This regular expression can be read as “one or more occurrences of a ‘0’, a ‘1’, a ‘2’, a ‘3’, a ‘4’, a ‘5’, a ‘6’, a ‘7’, a ‘8’, or a ‘9’”. Since Sotue’s NFA generation code already supports the | and the + operators, this regular expression can be handled. The current build of Sotue produces the following very ugly NFA (according to my WPF-based NFA viewer):
Yuck. That NFA state machine is valid, but, boy, is that ugly.
(This drawing also demonstrates a flaw in my current NFA Viewer: the machine’s end state is shown with a double circle, and, in the rendering above, is State 39; however, the machine’s start state is not necessarily State 0, and it’s just about impossible to see where the machine starts. I’ll have to fix that.)
My next addition to Sotue will be to support character classes in regular expressions. Character classes are ranges of characters that can all be used in a single transition, as apposed to what you see above, in which only one character can cause a move from one state to the next. Character classes are enclosed in square brackets with the endpoints of the range specified with a hyphen between them. For example, the same “number” regular expression above can be written this way:
[0-9]+
This is much nicer, as this is what many regular expression authors are used to writing. In the next blog post, I will show that regular expression based on a character class also produces an NFA with a lot less states.
(Of course, all of this is academic, since NFAs are optimized into DFAs, which are actually used at runtime, and NFAs are only the first step in a process in converting a regular expression to a state machine, but I digress.)