Sotue

The lex tool is defined in Wikipedia as follows:

lex is a program that generates lexical analyzers ("scanners" or "lexers"). Lex is commonly used with the yacc parser generator. Lex, originally written by Eric Schmidt and Mike Lesk, is the standard lexical analyzer generator on Unix systems, and is included in the POSIX standard. Lex reads an input stream specifying the lexical analyzer and outputs source code implementing the lexer in the C programming language.

The tool accepts an input file that describes the patterns that it needs to find in the input stream. The input file looks something like this:

/*** Definition section ***/

%{
/* C code to be copied verbatim */
#include <stdio.h>
%}

/* This tells flex to read only one input file */
%option noyywrap

%%
    /*** Rules section ***/

    /* [0-9]+ matches a string of one or more digits */
[0-9]+  {
            /* yytext is a string containing the matched text. */
            printf("Saw an integer: %s\n", yytext);
        }

.       {   /* Ignore all other characters. */   }

%%
/*** C Code section ***/

int main(void)
{
    /* Call the lexer, then quit. */
    yylex();
    return 0;
}

The lex tool generates C code that can be folded into a larger application and used to parse the application's input.

This all made sense in 1970, when Unix systems were the dominant operating system of choice for software development, and C was the dominant language of choice. But it's 2007 now, and many of us are on desktops with .NET runtimes. We're working in Unicode now, and not just ASCII. We're working in C# and VB.NET now, and not just C (in fact, we're probably not working in C much at all). Our brave new .NET world (and by "new" I mean "six and a half years old") has moved beyond cryptic input files and unmaintainable generated code and into a language-agnostic and dynamic environment. I believe that the .NET world can do better. Projects like C# Flex exist, but, in my opinion, don't go far enough. I believe that a fresh approach to lexical analysis and grammatical parsing is needed (of course, I also believe that people should stop putting two spaces after their periods when they type, so my beliefs may often be called into question).

My vision is to create a lexical analysis and grammatical parsing engine, from the ground up, designed for the .NET platform and all of its language agnostic, Unicode-aware goodness. The engine, which I have dubbed Sotue (no, the name doesn't mean anything), would provide a .NET assembly (or set of assemblies) that would contain classes that would handle all of the details. My ultra-high-level vision looks something like this:

LexicalAnalyzer MyParser = new LexicalAnalyzer();

Token NumericToken = new Token("[0-9]+");
NumericToken.OnLexemeFound += <delegate>;
MyParser.Tokens.Add(NumericToken);

MyParser.Parse(<input stream object implementing IStream>);

(That's a C# sample, but, since the vision is to supply a .NET assembly, the classes will support any .NET language.)

The Token class would represent an input pattern and the OnLexemeFound event would fire when data matching the token's regular expression was found in an input stream. (In lexical analysis terms, a lexeme is a string of data that matches a regular expression defined by a token; for example, a lexeme of 123 would match a token with a regular expression of [0-9]+.)

I started work on this vision once before, but its progress was hampered by all of life's little distractions. I have more time now, and I have time to re-engage. Work towards this Sotue vision is already underway, and I'll keep this blog updated with my progress.

Print | posted @ Sunday, October 19, 2008 12:10 PM

Comments on this entry:

No comments posted yet.

Post A Comment
Title:
Name:
Email:
Website:
Comment:
Verification:
 
 
Twitter