In my post on adding closure operator support for regular expressions input to Sotue, I showed the unoptimized NFA generated by Sotue for a regular expression built to match a number. Since the regular expression used only the OR operator, the generated NFA contained some 40 states. While the state machine was perfectly valid, asking a developer to write a regular expression such as

(0|1|2|3|4|5|6|7|8|9)+

seems like a bit too much to write. The de facto standard for specifying ranges of characters in regular expressions is to write a character class, which specifies one or more ranges of characters within square brackets and separated by a hyphen. I have added support for character classes in Sotue’s regular expression code, and their use can simplify an NFA even further. For example, a developer working with Sotue can now write a regular expression for a number as

[0-9]+

Plugging this regular expression into my NFA Viewer shows me the following NFA:

NFAZeroToNinePlus

This NFA is much simpler (although no more or less valid) than the NFA generated by the number regular expression using only the OR operator.

Sotue supports multiple characters and character ranges in character classes. For example, Sotue supports a character class such as

[A-Za-z]

to specify “any upper or lowercase alphabetic character”. A character class to specify a hexadecimal digit might look something like

[0-9A-Fa-f]

Sotue also supports exclusive character classes, which can be used to specify a range of characters to be excluded from a match. Exclusive character classes contain a caret as the first character inside the character class specfication. For example, a character class such as

[^A-Za-z]

is to be read as “any character except an upper or lowercase alphabetic character”.

I have a few more unit tests to write, but I am pretty much done with Sotue’s NFA generation code. The next step is to optimize these NFAs into the final, deterministic state machine, or DFA, that will be used at runtime.