Consider a language authored in MGrammar that recognizes a keyword named “keyword” as well as string literals:
module MyModule
{
language MyLanguage
{
// ignore whitespace
syntax LF = "\u000A";
syntax CR = "\u000D";
syntax Space = "\u0020";
interleave Whitespace = LF | CR | Space;
// string literals
token Quote = "\"";
token StringLiteral = Quote a:any* Quote => StringLiteral[a];
token Keyword = "keyword";
// statements
token Semicolon = ";";
syntax Statement = s:(Keyword | StringLiteral) Semicolon => Statement[s];
syntax Main = Statement*;
}
}
Now consider some sample input:
keyword;
"this is a string";
keyword;
Given this sample input, you would expect the abstract syntax tree (AST) for this input to define two “keyword" statements and one string literal. This is, in fact, exactly what happens:
Main[
[
Statement[
[
"keyword"
]
],
Statement[
[
StringLiteral[
"this is a string"
]
]
],
Statement[
[
"keyword"
]
]
]
]
But what happens when two string literals appear in the input, as in the following example?
keyword;
"this is a string";
keyword;
"this is another string";
keyword;
Given this sample input, you would expect the abstract syntax tree (AST) for this input to define two “keyword" statements and two string literals. This is, in fact, not what happens. The AST you’ll get back looks like this:
Main[
[
Statement[
[
"keyword"
]
],
Statement[
[
StringLiteral[
"this is a string\";\r\nkeyword;\r\n\"this is another string"
]
]
],
Statement[
[
"keyword"
]
]
]
]
The problem lies in the grammar’s definition of a string literal. The definition states that a string literal starts with a quotation mark and finishes with a quotation mark. The input parsing engine for M, however, is as greedy as possible, swallowing as much input as possible until the syntax definition can no longer be satisfied. In this case, it finds the very last quotation mark in the input and uses that entire text as the string literal. This is not what we want.
The workaround for this is to reduce our definition of what should appear between the quotation marks from the any keyword, which states that any character can appear, to a definition which states that any character but a quotation mark can appear. The definition now looks like this:
token StringLiteral = Quote a:(any - Quote)* Quote => StringLiteral[a];
The AST for the input now looks as we expect:
Main[
[
Statement[
[
"keyword"
]
],
Statement[
[
StringLiteral[
"this is a string"
]
]
],
Statement[
[
"keyword"
]
],
Statement[
[
StringLiteral[
"this is another string"
]
]
],
Statement[
[
"keyword"
]
]
]
]
Much better!