Lexical Analysis
CDTk's lexer is driven entirely by Token declarations inside your Grammar subclass. There is no external lexer configuration file — every token is a public static field, and CDTk discovers them automatically via reflection.
How the Lexer Works
When Compiler.CompileText() is called, CDTk:
- Reflects over your
Grammarsubclass and collects allpublic static Tokenfields. - Assigns each field its C# field name (e.g.,
KW_IF,IDENT) so that error messages and structural role maps work correctly. - Runs the lexer over the source string, trying tokens in declaration order (top to bottom, left to right).
- Emits a flat
Token[]stream and forwards it to the parser.
Because CDTk uses reflection, field order matters. Declare more-specific tokens (keywords, multi-char operators) before catch-all tokens (identifiers, single-char operators).
Token Factory Methods
All factory methods are inherited from Grammar:
| Method | Matches | Example Declaration |
|---|---|---|
Kw(string) | Exact keyword string (word-boundary aware) | public static Token KW_IF = Kw("if"); |
Id() | [A-Za-z_][A-Za-z0-9_]* identifier | public static Token IDENT = Id(); |
Num() | Integer or float numeric literal | public static Token INT = Num(); |
Str() | Double-quoted string literal | public static Token STR = Str(); |
Punct(string) | Exact punctuation or multi-char operator | public static Token LBRACE = Punct("{"); |
Op(string) | Alias for Punct (semantic clarity) | public static Token PLUS = Op("+"); |
Custom(Regex) | Arbitrary regular expression | public static Token HEX = Custom(new Regex(@"0x[0-9A-Fa-f]+")); |
Kw("if") only matches the exact word if at a word boundary, so it will not match iffy. The underlying regex is \bif\b. Always declare keywords before Id().Token Priority
The lexer tries each Token in declaration order at the current position. The first match wins. This means you must declare more-specific patterns before less-specific ones:
public class MyLang : Grammar {
// Keywords must come before IDENT, or "return" would be scanned as IDENT
public static Token KW_RETURN = Kw("return");
public static Token KW_IF = Kw("if");
public static Token KW_ELSE = Kw("else");
public static Token KW_WHILE = Kw("while");
// Multi-char operators before single-char operators
public static Token OP_EQ = Op("==");
public static Token OP_NEQ = Op("!=");
public static Token OP_ARROW= Op("->");
public static Token OP_ASSIGN= Op("=");
public static Token OP_MINUS= Op("-");
// Catch-all: identifiers and literals last
public static Token IDENT = Id();
public static Token INT = Num();
public static Token STR = Str();
}
Whitespace & Comments
CDTk automatically skips whitespace (spaces, tabs, newlines) between tokens. Single-line comments starting with // and block comments /* ... */ are also skipped by default. To override this behaviour, subclass the lexer and provide a custom whitespace pattern:
public class PythonGrammar : Grammar {
// Python uses indentation — override whitespace to track indent tokens
protected override Regex WhitespacePattern =>
new Regex(@"[ \t]+"); // newlines handled separately for INDENT/DEDENT
}
Structural Roles
A structural role is a string label you attach to a token that tells CDTk how to translate it across grammars. Roles are declared in a public static Map Structural field:
public static Map Structural = new() {
{ KW_IF, "IfKeyword" },
{ KW_ELSE, "ElseKeyword" },
{ KW_WHILE, "LoopKeyword" },
{ KW_RETURN, "ReturnKeyword" },
{ KW_VOID, "TypeKeyword" }, // TypeKeyword: preserved in return position
{ KW_INT, "TypeKeyword" },
};
CDTk's translation engine (Step 3) and PrettyPrinter.format both read these roles to decide how to map tokens from the source grammar to the target grammar. Tokens with role "TypeKeyword" are preserved as return-type annotations in function void Main(...) output.
| Built-in Role | Effect |
|---|---|
IfKeyword | Maps to conditional keyword in target grammar |
ElseKeyword | Maps to else/otherwise keyword |
LoopKeyword | Maps to while/for keyword |
ReturnKeyword | Maps to return/yield keyword |
FuncKeyword | Maps to function/def/fn keyword |
ClassKeyword | Maps to class/struct keyword |
TypeKeyword | Preserved in return-type position; not stripped by Step 3 |
Applying Roles Automatically
You do not need a constructor. CDTk auto-discovers all public static Map fields in your grammar class via ApplyStaticMaps(), which is called once during grammar initialization. Each entry in the map is passed to AssignTokenStructuralRole(token, role):
// CDTk internal — called once per Grammar instance
void ApplyStaticMaps(Grammar g) {
foreach (var field in g.GetType()
.GetFields(BindingFlags.Public | BindingFlags.Static)
.Where(f => f.FieldType == typeof(Map))) {
var map = (Map)field.GetValue(null)!;
foreach (var (token, role) in map)
AssignTokenStructuralRole(token, role);
}
}
Complete Example
The following is a minimal, fully-working grammar with keywords, operators, and structural roles:
using CDTk;
public class ArrowLang : Grammar {
// Keywords (declared first — highest priority)
public static Token KW_FN = Kw("fn");
public static Token KW_LET = Kw("let");
public static Token KW_RETURN = Kw("return");
public static Token KW_IF = Kw("if");
public static Token KW_VOID = Kw("void");
public static Token KW_INT = Kw("int");
// Multi-char operators before single-char
public static Token ARROW = Op("->");
public static Token OP_EQ = Op("==");
public static Token PLUS = Op("+");
public static Token MINUS = Op("-");
public static Token ASSIGN = Op("=");
// Delimiters
public static Token LBRACE = Punct("{");
public static Token RBRACE = Punct("}");
public static Token LPAREN = Punct("(");
public static Token RPAREN = Punct(")");
public static Token SEMI = Punct(";");
// Catch-all tokens (lowest priority)
public static Token IDENT = Id();
public static Token INT = Num();
// Structural roles — discovered automatically by CDTk
public static Map Structural = new() {
{ KW_FN, "FuncKeyword" },
{ KW_RETURN, "ReturnKeyword" },
{ KW_IF, "IfKeyword" },
{ KW_VOID, "TypeKeyword" },
{ KW_INT, "TypeKeyword" },
};
}
Compiler.Verbose = true.