Defining Lines with MGrammar

Some of my previous posts have described a lexical analysis engine called Sotue. This work came from my passion for the business of parsing strings into contextual elements to be used in applications such as the language analysis phase of compilers. I have spent some time with the Jan 2009 CTP of Microsoft “Oslo”, and have found the MGrammar language construction technology particularly fascinating. I thought that I would take some time in blog entries describing my findings.

I have a calculus book in my library, and flipping through the book made me wonder if I could write some sort of domain-specific grammar for the calculus and analytic geometry problems it presented. It seemed like using MGrammar for this work would help me understand how MGrammar worked, and so I set out to write a domain specific language for calculus and analytic geometry, which I dubbed Calculatix. The first pages of the book contains a review of two-dimensional lines, and I decided to start small and build a language to define lines.

The design I came up with for my language (which only supports a single statement type at this point) looks like this:

line lineidentifier (XCoordinate1, YCoordinate1) (XCoordinate2, YCoordinate2);

This syntax allows me to write code like this to define lines:

line LineA (4,3) (2,5);
line LineB (5,6) (5,1);

This simple code defines two lines:

  • a line named LineA that runs from point (4, 3) to point (2, 5)
  • a line named LineB that runs from point (5, 6) to point (5, 1)

The MGrammar definition for this language is as follows:

module JeffFerguson.Calculatix.Grammars
{
    language Calculus
    {

        // ignore whitespace
        syntax LF = "\u000A";
        syntax CR = "\u000D";
        syntax Space = "\u0020";
        interleave Whitespace = LF | CR | Space;

         // numbers
        token Digit = "0".."9";
        token WholeNumber = Digit+;

        // identifiers
        token Uppercase = "A".."Z";
        token Lowercase = "a".."z";
        token Alphabetic = Uppercase | Lowercase;
        token Identifier = Alphabetic (Alphabetic | Digit)*;

        // points
        token XCoordinate = WholeNumber;
        token YCoordinate = WholeNumber; 
        syntax Point = "(" x:XCoordinate "," y:YCoordinate ")" =>Point[x,y];

        // lines
        token LineToken = "line";
        syntax LineExpression = LineToken i:Identifier p1:Point p2:Point => Line[i, p1, p2];
   
        // syntax
        syntax Statement = e:(LineExpression) ";" => Statement[valuesof(e)];
        syntax Main = s:Statement* => Main[valuesof(s)];
    }
}

Remember our input?

line LineA (4,3) (2,5);
line LineB (5,6) (5,1);

This input, processed against the language defined in the MGrammar definition shown above, produces a graph that, at a conceptual level, looks like this (the unshaded boxes represent higher-level constructs that are broken down, while the shaded boxes represent pieces of data):

 

image

 

I will take a deeper look at this grammar in the next few blog posts, which will provide answers to the following questions:

  • “What are those funny => symbols in the token definitions and what good are they doing?”
  • “So, now that I have this language defined, how do I use it? How do I write code in this language, and what processes it at runtime?”

Print | posted @ Sunday, March 15, 2009 12:07 AM

Comments on this entry:

Gravatar # re: Defining Lines with MGrammar
by Justin Chase at 3/16/2009 3:15 PM

Have you used Lexx / Yacc before? How does it compare to that? What are the differences similarities?
Gravatar # re: Defining Lines with MGrammar
by Jeff Ferguson at 3/16/2009 3:43 PM

Justin:

Yes, I have. In fact, my early blog posts document some work I did on a lexical analysis engine called Sotue, which was to be a .NET replacement for lex.

Your question is probably best answered in a full blog post, but I will tell you what I have discovered so far, based on my limited understanding of MGrammar:

* lex performs lexical analysis. It is concerned with turning regular expressions into tokens.
* yacc performs grammatical parsing. It is concerned with ensuring that tokens are found in input in an order described by the grammar defined by yacc.
* MGrammar supports lex functionality via its "token" construct.
* MGrammar supports yacc functionality via its "syntax" construct.

It's not a 1:1, though. lex and yacc allow definitions to be sufficed with C code, roughly equivalent to a .NET anonymous delegate, that can specify special processing to be performed when a token or grammar syntax is found. This isn't availabe in MGrammar, and any processing has to be done after all of the input is read in by the code that is driving the parse.

With all of that said, from what I have seen, I much prefer MGrammar. It's lex and yacc in a single tool, and it has a much newer architecture and design to it. lex and yacc were designed in the Unix days, for C, and they aren't very 21st-century friendly.

Thanks for the idea for a future blog post. :)
Gravatar # re: Defining Lines with MGrammar
by Hamid MOUTAWAKKIL at 3/23/2009 6:51 PM

Agree with the comparison. MGrammar is more to be compared with Java tools like Antlr or JavaCC. The tools describe token rules and grammar rules in the same file.

A feature of antlr or JavaCC, like Yacc is to make possible to augment the generated parser with custom code. As mentionned, this feature is not available in MGrammar, and I don't see where it can be applied, assuming that, so far, the purpose of MGrammar is to produce MGraph, from any defined language.
Post A Comment
Title:
Name:
Email:
Website:
Comment:
Verification:
 
 
Twitter