The DFA creation code in Sotue is well underway. I have more to do on the DFA front (namely, supporting exclusive character classes), but I have been happy with the results I have seen to this point. With this work well underway, I can finally turn to the meat of this entire effort: the lexical analysis engine itself.
I have written the first unit test for the lexical analysis engine, and I think I have settled on an overall design for the lexical analysis engine. At this point, the design looks something like this:
string DataString = "72 (seventy-two) is the natural number following 71 and preceding 73. It is half a gross or 6 dozen (i.e., 60 in duodecimal).";
byte[] DataBytes = Encoding.GetEncoding("iso-8859-1").GetBytes(DataString);
MemoryStream DataStream = new MemoryStream(DataBytes);
StreamReader DataReader = new StreamReader(DataStream);
LexicalAnalyzer Lex = new LexicalAnalyzer();
Lex.AddToken("[0-9]+",
delegate(object sender, LexemeFoundEventArgs args)
{
}
);
Lex.AddToken("[ ]+",
delegate(object sender, LexemeFoundEventArgs args)
{
}
);
Lex.AddToken("[A-Za-z]+",
delegate(object sender, LexemeFoundEventArgs args)
{
}
);
Lex.AddToken("[().,]+",
delegate(object sender, LexemeFoundEventArgs args)
{
}
);
Lex.Analyze(DataReader);
DataReader.Close();
DataStream.Close();
Let’s break this down. I’m starting with the creation of the data to be analyzed. In this case, it’s a string that I have folded into a StreamReader object:
string DataString = "72 (seventy-two) is the natural number following 71 and preceding 73. It is half a gross or 6 dozen (i.e., 60 in duodecimal).";
byte[] DataBytes = Encoding.GetEncoding("iso-8859-1").GetBytes(DataString);
MemoryStream DataStream = new MemoryStream(DataBytes);
StreamReader DataReader = new StreamReader(DataStream);
Next, I create an instance of Sotue’s lexical analysis engine:
LexicalAnalyzer Lex = new LexicalAnalyzer();
I can then define my tokens. Within Sotue’s lexical analysis engine, a token definition consists of two items:
- a regular expression
- a delegate that is called when a match is found in the input
The first token defines a string of digits, illustrated by the regular expression read as “one or more occurrences of a character in the range of ‘0’ to ‘9’ inclusive”:
Lex.AddToken("[0-9]+",
delegate(object sender, LexemeFoundEventArgs args)
{
}
);
The second token defines whitespace:
Lex.AddToken("[ ]+",
delegate(object sender, LexemeFoundEventArgs args)
{
}
);
The third token defines a word, illustrated by the regular expression read as “one or more occurrences of a character in the range of ‘A’ to ‘Z’ or ‘a’ to ‘z’ inclusive”:
Lex.AddToken("[A-Za-z]+",
delegate(object sender, LexemeFoundEventArgs args)
{
}
);
The fourth and final token defines any punctuation in the input:
Lex.AddToken("[().,]+",
delegate(object sender, LexemeFoundEventArgs args)
{
}
);
Note two things about the delegates in the code shown above:
- The delegates called when matches are found in the input don’t do anything. I will fill that in later.
- The delegates shown above makes use of C# 2.0 anonymous delegate syntax. If I decide to release Sotue, as either a product or as open source, I will be sure to support explicit delegate definitions, anonymous delegate definitions, and, if possible, lambda expressions. All of these delegate definition methods will allow users to define the delegates in a way that is meaningful to them.
The LexemeFoundEventArgs class holds a reference to the Token class used to match input as well as a string containing the matched text (in lexical analysis terminology, the string that matches a regular expression is called a lexeme). With this in mind, the delegate code will be able to reference the items in a manner similar to the following:
Lex.AddToken("[0-9]+",
delegate(object sender, LexemeFoundEventArgs args)
{
// args.MatchToken contains a reference to a Token object containing:
// * the regular expression
// * the state machines that encode the regular expression
// args.Lexeme contains a string representing the matching data found in the input
}
);
Once the tokens are defined, the data can be analyzed:
Lex.Analyze(DataReader);
I only hope that when the data is analyzed, a weakness can be found. It’s not over yet. (Obscure? Don’t get the reference? Shame on you. Go sit in the corner.)
Once the data is analyzed, the streams and readers can be closed:
DataReader.Close();
DataStream.Close();
The call to Analyze() will kick things off and do all of the work of reading from the stream, matching input against the DFAs representing the regular expressions in the tokens, and calling the lexeme match delegates when a match is found. I’m happy with the design as a first draft. Now it’s time to see how close I can come to pulling it off.