The Sotue project has passed its first full lexical analysis unit test!
The unit test’s problem statement is as follows:
Parse the string “abc 123” into a word, whitespace, and a number.
The source to the unit test is as follows:
[TestMethod]
[Description("Successfully parse the string 'abc 123' into a word, a whitespace, and a numeric lexeme.")]
public void LexicalAnalyzerTestMethod1()
{
LexicalAnalyzer Lex = new LexicalAnalyzer();
List<LexemeFoundEventArgs> LexemeFoundList = new List<LexemeFoundEventArgs>();
Token NumberToken = new Token("[0-9]+");
Token WhitespaceToken = new Token("[ ]+");
Token WordToken = new Token("[A-Za-z]+");
Lex.AddToken(NumberToken,
delegate(object sender, LexemeFoundEventArgs args)
{
LexemeFoundList.Add(args);
}
);
Lex.AddToken(WhitespaceToken,
delegate(object sender, LexemeFoundEventArgs args)
{
LexemeFoundList.Add(args);
}
);
Lex.AddToken(WordToken,
delegate(object sender, LexemeFoundEventArgs args)
{
LexemeFoundList.Add(args);
}
);
Lex.Analyze("abc 123");
Assert.AreEqual<int>(3, LexemeFoundList.Count);
Assert.AreEqual<string>("abc", LexemeFoundList[0].Lexeme);
Assert.AreEqual<Token>(WordToken, LexemeFoundList[0].MatchedToken);
Assert.AreEqual<string>(" ", LexemeFoundList[1].Lexeme);
Assert.AreEqual<Token>(WhitespaceToken, LexemeFoundList[1].MatchedToken);
Assert.AreEqual<string>("123", LexemeFoundList[2].Lexeme);
Assert.AreEqual<Token>(NumberToken, LexemeFoundList[2].MatchedToken);
}
Let’s break this down. It starts by creating a new lexical analyzer instance:
LexicalAnalyzer Lex = new LexicalAnalyzer();
Next, it creates a list that will be used to hold the returned results:
List<LexemeFoundEventArgs> LexemeFoundList = new List<LexemeFoundEventArgs>();
Next, the tokens are created from regular expressions:
Token NumberToken = new Token("[0-9]+");
Token WhitespaceToken = new Token("[ ]+");
Token WordToken = new Token("[A-Za-z]+");
The NumberToken contains a regular expression that matches a string of one or more digits in the range of 0 through 9 inclusive, the WhitespaceToken contains a regular expression that matches a string of one or more spaces, and the WordToken contains a regular expression that matches a string of one or more upper or lower case characters.
Once the tokens are created, the tokens are added to the lexical analysis engine along with a delegate that is called when input is found to match the tokens:
Lex.AddToken(NumberToken,
delegate(object sender, LexemeFoundEventArgs args)
{
LexemeFoundList.Add(args);
}
);
Lex.AddToken(WhitespaceToken,
delegate(object sender, LexemeFoundEventArgs args)
{
LexemeFoundList.Add(args);
}
);
Lex.AddToken(WordToken,
delegate(object sender, LexemeFoundEventArgs args)
{
LexemeFoundList.Add(args);
}
);
Once that is all in place, the lexical analyzer’s Analyze() method can be called. In this test, the input to be analyzed is in a string:
Lex.Analyze("abc 123");
Once Analyze() is called, the engine kicks in and does its thing. When matches are found, the appropriate delegates are called. Pretty straightforward.
The last part of the test contains the assertions that validate the test:
Assert.AreEqual<int>(3, LexemeFoundList.Count);
Assert.AreEqual<string>("abc", LexemeFoundList[0].Lexeme);
Assert.AreEqual<Token>(WordToken, LexemeFoundList[0].MatchedToken);
Assert.AreEqual<string>(" ", LexemeFoundList[1].Lexeme);
Assert.AreEqual<Token>(WhitespaceToken, LexemeFoundList[1].MatchedToken);
Assert.AreEqual<string>("123", LexemeFoundList[2].Lexeme);
Assert.AreEqual<Token>(NumberToken, LexemeFoundList[2].MatchedToken);
The first test asserts that three lexemes (strings matching a token) were found. I am expecting to find three lexemes:
- A lexeme of “abc” matching the WordToken
- A lexeme of “ “ matching the WhitespaceToken
- A lexeme of “123” matching the NumberToken
Each of the delegates defined with the tokens simply store the LexemeFoundEventArgs object supplied by the lexical analysis engine into a list that can be inspected during the test to ensure that the right lexemes were found in the right order.
The remaining tests assert that:
- The first lexeme found was “abc” and matched the regular expression in the WordToken
- The second lexeme found was “ “ and matched the regular expression in the WhitespaceToken
- the third lexeme found was “123” and matched the regular expression in the NumberToken
All of the assertions held true, and that, to me, is success! Now it’s time for some more complicated scenarios.