Lexical Analysis

CDTk's lexer is driven entirely by Token declarations inside your Grammar subclass. There is no external lexer configuration file — every token is a public static field, and CDTk discovers them automatically via reflection.

On This Page

How the Lexer Works
Token Factory Methods
Token Priority
Whitespace & Comments
Structural Roles
Applying Roles Automatically
Complete Example

How the Lexer Works

When Compiler.CompileText() is called, CDTk:

Reflects over your Grammar subclass and collects all public static Token fields.
Assigns each field its C# field name (e.g., KW_IF, IDENT) so that error messages and structural role maps work correctly.
Runs the lexer over the source string, trying tokens in declaration order (top to bottom, left to right).
Emits a flat Token[] stream and forwards it to the parser.

Because CDTk uses reflection, field order matters. Declare more-specific tokens (keywords, multi-char operators) before catch-all tokens (identifiers, single-char operators).

Token Factory Methods

All factory methods are inherited from Grammar:

Method	Matches	Example Declaration
`Kw(string)`	Exact keyword string (word-boundary aware)	`public static Token KW_IF = Kw("if");`
`Id()`	`[A-Za-z_][A-Za-z0-9_]*` identifier	`public static Token IDENT = Id();`
`Num()`	Integer or float numeric literal	`public static Token INT = Num();`
`Str()`	Double-quoted string literal	`public static Token STR = Str();`
`Punct(string)`	Exact punctuation or multi-char operator	`public static Token LBRACE = Punct("{");`
`Op(string)`	Alias for `Punct` (semantic clarity)	`public static Token PLUS = Op("+");`
`Custom(Regex)`	Arbitrary regular expression	`public static Token HEX = Custom(new Regex(@"0x[0-9A-Fa-f]+"));`

💡

Kw() vs Id()

Kw("if") only matches the exact word if at a word boundary, so it will not match iffy. The underlying regex is \bif\b. Always declare keywords before Id().

Token Priority

The lexer tries each Token in declaration order at the current position. The first match wins. This means you must declare more-specific patterns before less-specific ones:

public class MyLang : Grammar {
    // Keywords must come before IDENT, or "return" would be scanned as IDENT
    public static Token KW_RETURN = Kw("return");
    public static Token KW_IF     = Kw("if");
    public static Token KW_ELSE   = Kw("else");
    public static Token KW_WHILE  = Kw("while");

    // Multi-char operators before single-char operators
    public static Token OP_EQ   = Op("==");
    public static Token OP_NEQ  = Op("!=");
    public static Token OP_ARROW= Op("->");
    public static Token OP_ASSIGN= Op("=");
    public static Token OP_MINUS= Op("-");

    // Catch-all: identifiers and literals last
    public static Token IDENT = Id();
    public static Token INT   = Num();
    public static Token STR   = Str();
}

Whitespace & Comments

CDTk automatically skips whitespace (spaces, tabs, newlines) between tokens. Single-line comments starting with // and block comments /* ... */ are also skipped by default. To override this behaviour, subclass the lexer and provide a custom whitespace pattern:

public class PythonGrammar : Grammar {
    // Python uses indentation — override whitespace to track indent tokens
    protected override Regex WhitespacePattern =>
        new Regex(@"[ \t]+");  // newlines handled separately for INDENT/DEDENT
}

Structural Roles

A structural role is a string label you attach to a token that tells CDTk how to translate it across grammars. Roles are declared in a public static Map Structural field:

public static Map Structural = new() {
    { KW_IF,     "IfKeyword"     },
    { KW_ELSE,   "ElseKeyword"   },
    { KW_WHILE,  "LoopKeyword"   },
    { KW_RETURN, "ReturnKeyword" },
    { KW_VOID,   "TypeKeyword"   },  // TypeKeyword: preserved in return position
    { KW_INT,    "TypeKeyword"   },
};

CDTk's translation engine (Step 3) and PrettyPrinter.format both read these roles to decide how to map tokens from the source grammar to the target grammar. Tokens with role "TypeKeyword" are preserved as return-type annotations in function void Main(...) output.

Built-in Role	Effect
`IfKeyword`	Maps to conditional keyword in target grammar
`ElseKeyword`	Maps to else/otherwise keyword
`LoopKeyword`	Maps to while/for keyword
`ReturnKeyword`	Maps to return/yield keyword
`FuncKeyword`	Maps to function/def/fn keyword
`ClassKeyword`	Maps to class/struct keyword
`TypeKeyword`	Preserved in return-type position; not stripped by Step 3

Applying Roles Automatically

You do not need a constructor. CDTk auto-discovers all public static Map fields in your grammar class via ApplyStaticMaps(), which is called once during grammar initialization. Each entry in the map is passed to AssignTokenStructuralRole(token, role):

// CDTk internal — called once per Grammar instance
void ApplyStaticMaps(Grammar g) {
    foreach (var field in g.GetType()
                            .GetFields(BindingFlags.Public | BindingFlags.Static)
                            .Where(f => f.FieldType == typeof(Map))) {
        var map = (Map)field.GetValue(null)!;
        foreach (var (token, role) in map)
            AssignTokenStructuralRole(token, role);
    }
}

Complete Example

The following is a minimal, fully-working grammar with keywords, operators, and structural roles:

using CDTk;

public class ArrowLang : Grammar {
    // Keywords (declared first — highest priority)
    public static Token KW_FN     = Kw("fn");
    public static Token KW_LET    = Kw("let");
    public static Token KW_RETURN = Kw("return");
    public static Token KW_IF     = Kw("if");
    public static Token KW_VOID   = Kw("void");
    public static Token KW_INT    = Kw("int");

    // Multi-char operators before single-char
    public static Token ARROW  = Op("->");
    public static Token OP_EQ  = Op("==");
    public static Token PLUS   = Op("+");
    public static Token MINUS  = Op("-");
    public static Token ASSIGN = Op("=");

    // Delimiters
    public static Token LBRACE = Punct("{");
    public static Token RBRACE = Punct("}");
    public static Token LPAREN = Punct("(");
    public static Token RPAREN = Punct(")");
    public static Token SEMI   = Punct(";");

    // Catch-all tokens (lowest priority)
    public static Token IDENT = Id();
    public static Token INT   = Num();

    // Structural roles — discovered automatically by CDTk
    public static Map Structural = new() {
        { KW_FN,     "FuncKeyword"   },
        { KW_RETURN, "ReturnKeyword" },
        { KW_IF,     "IfKeyword"     },
        { KW_VOID,   "TypeKeyword"   },
        { KW_INT,    "TypeKeyword"   },
    };
}

✓

Grammar Token Counts

CSharp grammar: 107 tokens. Python: 85. WASM: 107. LLVM IR: 147. Token counts are logged by Compiler.Verbose = true.

← PreviousGetting Started Next →Syntax Analysis