Assembler Phases
The Keurnel Assembler transforms assembly source code into executable machine code through a series of well-defined phases. Each phase performs specific transformations and validations.
Pipeline Overview
The Keurnel Assembler follows a traditional multi-pass architecture, where each phase builds upon the output of the previous phase. This modular design ensures clean separation of concerns and allows for easier debugging and optimization.
Phase Details
Pre-processing
ImplementedThe pre-processing phase prepares the source code for assembly by handling directives, macros, and file inclusions.
Lexical Analysis
ImplementedThe lexer tokenizes the pre-processed source code into a stream of tokens for the parser. It is architecture-agnostic and uses an ArchitectureProfile for classification.
Parsing (Syntax Analysis)
ImplementedThe parser analyzes the token stream to build an Abstract Syntax Tree (AST) representing the program structure.
Semantic Analysis
ImplementedSemantic analysis validates the logical correctness of the assembly program beyond syntax.
Code Generation
PlannedThe code generator translates the validated AST into machine code instructions.
Linking & Output
PlannedThe final phase combines all code segments and produces the executable binary output.
Pre-processor
Processing Pipeline
The pre-processor executes in a specific order to ensure correct dependency resolution:
1. Include File Processing
The %include directive allows you to split your assembly code across multiple files. The pre-processor handles includes through a three-pass algorithm for validation and replacement.
Internal Data Structure:
// PreProcessingInclusion - tracks each include directive
PreProcessingInclusion {
IncludedFilePath string // Path of the included file
LineNumber int // Line number for error reporting
}Three-Pass Processing:
- Pass 1Collection & Validation— Collects all
%include "path"directives. Validates that only.kasmfiles are included (any other extension causes a pre-processing error) - Pass 2Duplicate Detection— Detects duplicate include directives. A file may only be included once; duplicates are a pre-processing error
- Pass 3Content Replacement— Replaces each directive with file content wrapped in traceability comments:
; FILE: pathand; END FILE: path
Supported file types:
.kasm — Assembly source files ; Include shared macros
%include "lib/macros.kasm"
; Relative path support
%include "../common/utils.kasm"Validation Rules:
- Only
.kasmfiles may be included - Each file can only be included once (duplicates cause errors)
- File content is wrapped in boundary comments for traceability
2. Macro Expansion & Substitution
Macros allow you to define reusable code blocks that are expanded inline during pre-processing. The Keurnel pre-processor uses a multi-pass approach: first detecting macros, building a macro table, collecting all calls, and then performing the expansion.
Internal Data Structures:
// MacroParameter - represents a single macro parameter
MacroParameter {
Name string // Parameter name (e.g., "paramA", "paramB")
}
// MacroCall - tracks each invocation of a macro
MacroCall {
Name string // Name of the macro being called
Arguments []string // Arguments in order they are provided
LineNumber int // Line number for error reporting
}
// Macro - complete macro definition
Macro {
Name string // Macro identifier
Parameters map[string]MacroParameter // Parameters indexed by name
Body string // Code to expand
Calls []MacroCall // All invocations found
}Processing Stages:
- 1PreProcessingMacroTable()— Extracts all macro definitions into a table indexed by name. Internally uses
PreProcessingHasMacros()to detect macros via regex pattern%macro\s+\w+\s*\d*. Parses parameter count and captures body until%endmacro - 2PreProcessingColectMacroCalls()— Finds all macro invocations in source, validates argument count matches parameter count, and stores line numbers for error reporting
- 3PreProcessingReplaceMacroCalls()— Performs actual expansion by replacing
%1,%2, etc. with arguments and inserting; MACRO: namecomment
Syntax Reference:
%macro NAME N— Begins macro definition with N parameters (N is optional, defaults to 0)%endmacro— Ends the macro definition block%1, %2, ...— Positional parameter placeholders replaced during expansion
Error Handling:
If the number of arguments in a macro call doesn't match the expected parameter count, the assembler will panic with a detailed error message including the macro name, expected count, actual count, and line number.
; Define a macro with 2 parameters
%macro PRINT_REG 2
push %1
mov rdi, %2
call print_value
pop %1
%endmacro
; Usage in source code
PRINT_REG rax, format_hex
PRINT_REG rbx, format_dec
; After pre-processing expands to:
; MACRO: PRINT_REG
push rax
mov rdi, format_hex
call print_value
pop rax
; MACRO: PRINT_REG
push rbx
mov rdi, format_dec
call print_value
pop rbx3. Constant Definitions & Symbol Table
The pre-processor builds a symbol table for use in conditional assembly. Symbols can be defined explicitly with %define or implicitly through macro definitions.
Three-Pass Symbol Table Building:
- Pass 1Collection & Validation— Collects all
%define SYMBOL_NAMEdirectives. Validates that symbol names are non-empty valid identifiers - Pass 2Duplicate Detection— Detects duplicate
%definedirectives. A symbol may only be defined once; duplicates are a pre-processing error - Pass 3Macro Integration— Adds all macro names from the macro table as defined symbols, so
%ifdef/%ifndefcan test for macro existence
Definition syntax:
%define SYMBOL— Defines a symbol (sets it totruein symbol table)
Symbol Table Behavior:
- Symbols are stored as
map[string]bool— presence indicates definition - Macro names are automatically added to the symbol table
- Use
%ifdef MACRO_NAMEto check if a macro is defined
; Define symbols for conditional assembly
%define DEBUG
%define PLATFORM_X86_64
; Define a macro - automatically added to symbol table
%macro LOG_MSG 1
push rdi
mov rdi, %1
call print_string
pop rdi
%endmacro
; Check if macro exists before using
%ifdef LOG_MSG
LOG_MSG "Application started"
%else
; No logging macro available
%endif
; Platform-specific code
%ifdef PLATFORM_X86_64
; 64-bit specific instructions
mov rax, [rbp + 8]
%endif4. Conditional Assembly Directives
Conditional directives allow you to include or exclude code based on symbol definitions. The pre-processor uses a two-pass algorithm with stack-based block matching.
Internal Data Structure:
// conditionalBlock - tracks a complete conditional section
conditionalBlock {
ifDirective string // "ifdef" or "ifndef"
symbol string // symbol being tested
ifStart int // byte offset of %ifdef/%ifndef start
ifEnd int // byte offset of %ifdef/%ifndef end
elseStart int // byte offset of %else start (-1 if absent)
elseEnd int // byte offset of %else end (-1 if absent)
endifStart int // byte offset of %endif start
endifEnd int // byte offset of %endif end
lineNumber int // line number for error reporting
}Two-Pass Processing:
- Pass 1Structural Validation— Uses a stack to match every
%ifdef/%ifndefwith its%endif. Validates at most one%elseper block. Panics on structural errors - Pass 2Evaluation & Replacement— Evaluates each block against the defined symbols map. Processes blocks in reverse order to preserve byte offsets during replacement
Available directives:
%ifdef SYMBOL— Include if symbol is defined%ifndef SYMBOL— Include if symbol is NOT defined%else— Alternative branch (optional, max one per block)%endif— End conditional block (required)Error Conditions:
%elsewithout matching%ifdef/%ifndef- Duplicate
%elsewithin the same conditional block %endifwithout matching%ifdef/%ifndef- Unclosed
%ifdef/%ifndef(no matching%endif)
%define DEBUG
%ifdef DEBUG
; Debug-only code - included because DEBUG is defined
call log_registers
call dump_stack
%else
; Production code - excluded
%endif
%ifndef RELEASE
; Include assertions in non-release builds
%include "debug/assertions.kasm"
%endif
; Nested conditionals are supported
%ifdef FEATURE_A
%ifdef FEATURE_B
; Code when both FEATURE_A and FEATURE_B are defined
%endif
%endifLexical Analysis (Lexer) In Progress
The lexer (tokenizer) transforms a pre-processed .kasm source string into an ordered sequence of tokens. Each token carries a type, literal value, and source location. The lexer sits between the pre-processor and the parser in the assembly pipeline.
Architecture-Agnostic Design
The lexer does not hardcode any register names, instruction mnemonics, or keywords. Instead, it receives an ArchitectureProfile at construction time that supplies these sets for the target architecture. This means the same lexer can tokenize source code for x86_64, ARM, RISC-V, or any future architecture without modification.
pre-processed source
│
▼
┌──────────────────────────────────────────────┐
│ Lexer │
│ LexerNew(input, profile) → Start() → []Token│
│ │
│ ┌────────────────────┐ │
│ │ ArchitectureProfile│ ← injected at │
│ │ · Registers() │ construction │
│ │ · Instructions() │ │
│ │ · Keywords() │ │
│ └────────────────────┘ │
└──────────────────────┬───────────────────────┘
│ ordered token slice
▼
parser inputArchitecture Profile Interface
An ArchitectureProfile represents a validated, immutable vocabulary for a specific hardware architecture. It provides three lookup maps for registers, instructions, and keywords.
Interface Definition:
type ArchitectureProfile interface {
// Registers returns the set of recognised register names (lower-case).
Registers() map[string]bool
// Instructions returns the set of recognised instruction mnemonics (lower-case).
Instructions() map[string]bool
// Keywords returns the set of reserved language keywords (lower-case).
Keywords() map[string]bool
}Profile Properties:
- All maps store lower-case keys — classification is case-insensitive
- Maps are guaranteed non-nil — no nil guards needed before lookup
- Profile is immutable after construction — safe for concurrent use
- Each lookup is O(1) using pre-built maps
Built-in Profiles:
NewX8664Profile() returns an ArchitectureProfile populated with the x86_64 register set, instruction set, and default keywords. Additional profiles for ARM64 and RISC-V may be added in the future.
Lexer Construction
The lexer is constructed via LexerNew(input, profile) which accepts the pre-processed source string and an ArchitectureProfile. Construction is infallible — any valid string (including empty) is accepted.
Construction Semantics:
// LexerNew is the sole constructor
lexer := LexerNew(preProcessedSource, NewX8664Profile())
// After construction:
// - Position and ReadPosition start at 0
// - Line starts at 1
// - Column starts at 0 (incremented to 1 on first char)
// - Tokens slice is initialized as empty, non-nil
// - First character is loaded via readChar()
tokens := lexer.Start() // Perform tokenizationState After Construction:
Position— Starts at 0ReadPosition— Starts at 0Line— Starts at 1 (1-based)Column— Starts at 0, becomes 1 on first charCh— Holds first character (or 0/NUL if empty)Tokens— Empty, non-nil sliceToken Types
Every token emitted by the lexer is classified into exactly one token type. The classification is determined by the character context at the point of consumption, combined with the ArchitectureProfile lookup tables.
| Type | Constant | Description |
|---|---|---|
| Whitespace | TokenWhitespace | Spaces, tabs, \r, \n. Never emitted. |
| Comment | TokenComment | ; to end of line. Never emitted. |
| Directive | TokenDirective | %-prefixed word (e.g. %define, %include) |
| Instruction | TokenInstruction | Known mnemonic from profile (e.g. mov, add, syscall) |
| Register | TokenRegister | Known register from profile (e.g. rax, x0, a0) |
| Immediate | TokenImmediate | Decimal (42) or hex (0xFF) numeric literal |
| String | TokenString | "…" delimited literal (quotes not stored) |
| Keyword | TokenKeyword | Reserved keyword from profile (e.g. namespace) |
| Identifier | TokenIdentifier | Other words, labels (_start:), or single punctuation |
Note:
Whitespace and comments are consumed but never emitted — the token stream contains only semantically meaningful tokens.
Word Classification
Words (contiguous sequences of letters, digits, underscores, and dots) are classified using the ArchitectureProfile via case-insensitive lookup. The original casing is preserved in the literal.
Classification Priority:
- 1Context Check— If previous token is
TokenKeyword, classify asTokenIdentifierregardless of lookup (preventsnamespace movfrom classifyingmovas instruction) - 2Label Check— If
:follows the word, append it and classify asTokenIdentifier(e.g._start:) - 3Register Lookup— If lower-cased word matches
profile.Registers()→TokenRegister - 4Instruction Lookup— If lower-cased word matches
profile.Instructions()→TokenInstruction - 5Keyword Lookup— If lower-cased word matches
profile.Keywords()→TokenKeyword - 6Fallback— Otherwise →
TokenIdentifier
; Input source
namespace myModule ; "namespace" → Keyword, "myModule" → Identifier
mov rax, rbx ; "mov" → Instruction, "rax"/"rbx" → Register
_start: ; "_start:" → Identifier (label)
add eax, 42 ; "add" → Instruction, "eax" → Register, "42" → ImmediateTokenization Process (Start)
Start() performs a single-pass, left-to-right scan of the input and returns an ordered slice of tokens. It is the sole public method that drives tokenization and is guaranteed to be infallible.
Guarantees:
- Single pass — no backtracking or multi-pass scanning
- Infallible — cannot fail or panic on any input
- Complete coverage — every character is handled by exactly one branch
- Graceful termination — stops when
Chequals0(NUL) - Accurate positions — each token carries correct
LineandColumnvalues
Important:
Start() may be called only once per Lexer instance. Calling it again would re-scan from the exhausted position and return an empty slice.
Token Structure
Each token is a value type carrying four fields. Tokens are safe to copy, compare, and store without aliasing concerns.
Token Fields:
type Token struct {
Type TokenType // Classification (TokenRegister, TokenInstruction, etc.)
Literal string // Verbatim text from source (without delimiters for strings)
Line int // 1-based line number where token starts
Column int // 1-based column number where token starts
}TokenType Methods:
Identifier()Directive()Instruction()Register()Immediate()StringLiteral()Whitespace()Comment()x86_64 Register Set
The x86_64 profile includes the following register names. All entries are lower-case in the lookup table; classification is case-insensitive.
64-bit General Purpose:
rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp, r8–r1532-bit General Purpose:
eax, ebx, ecx, edx, esi, edi, ebp, esp, r8d–r15d16-bit Registers:
ax, bx, cx, dx, si, di, bp, sp8-bit Registers:
al, bl, cl, dl, ah, bh, ch, dh, sil, dil, bpl, splSegment Registers:
cs, ds, es, fs, gs, ssSpecial Registers:
rip, eip, rflags, eflagsx86_64 Instruction Categories
The x86_64 profile includes a comprehensive set of instruction mnemonics organized by category.
Data Transfer:
mov, movzx, movsx, lea, push, pop, xchgArithmetic:
add, sub, mul, imul, div, idiv, inc, dec, negBitwise / Shift:
and, or, xor, not, shl, shr, sal, sar, rol, rorComparison:
cmp, testControl Flow:
jmp, je, jne, jz, jnz, jg, jge, jl, jle, ja, jae, jb, jbe, call, ret, syscall, intSystem / Misc:
nop, hlt, cli, sti, loop, loope, loopneArchitecture & File Layout
The lexer is split across multiple files within v0/internal/kasm. Each file owns a single concern, ensuring modifications to one area don't affect others.
| File | Responsibility |
|---|---|
lexer.go | Lexer struct, LexerNew, Start, scanning methods |
token.go | Token struct definition |
token_types.go | TokenType enum and convenience methods |
architecture_profile.go | ArchitectureProfile interface, defaultKeywords() |
profile_x86_64.go | NewX8664Profile() — concrete x86_64 profile |
Adding a New Architecture:
Adding support for a new architecture requires only two steps: (1) Create a profile file implementing ArchitectureProfile, and (2) Wire the profile in the orchestrator. No changes to the lexer core are required.
Parsing (Syntax Analysis)
The parser transforms an ordered sequence of Token values (produced by the lexer) into a structured Abstract Syntax Tree (AST). Each AST node represents a syntactic construct in the .kasm language — an instruction with its operands, a label declaration, a namespace block, a use import, or a directive. The parser sits between the lexer and the semantic analyser / code-generation stages in the assembly pipeline.
Architecture-Agnostic Design
The parser is architecture-agnostic: it does not validate instruction mnemonics, register names, or operand counts. It recognises the shape of constructs (e.g. "instruction followed by operands separated by commas") but defers validation to a later semantic-analysis pass. Because the parser operates on token types — not literal values — the same parser handles any architecture for which a lexer profile exists.
lexer output ([]Token)
│
▼
┌──────────────────────────────────────────────────────────┐
│ Parser │
│ ParserNew(tokens) → Parse() → (*Program, []ParseError) │
└──────────────────────┬────────────────────────────────────┘
│ AST + diagnostics
▼
semantic analysis / code generationParser Construction
A Parser represents a ready-to-parse consumer of a token slice. If a Parser value exists, it is guaranteed to hold a valid token slice and initialised position state. There is no uninitialised or partially-constructed state.
Parser Struct:
type Parser struct {
Position int // Current index into the Tokens slice
Tokens []Token // The input token slice from the lexer
errors []ParseError // Accumulated parse errors
}Construction Requirements:
ParserNew(tokens)is the sole constructor — accepts the[]Tokenslice fromLexer.Start()- Infallible — cannot fail. An empty slice produces an empty Program; nil treated as empty
Positionstarts at0, pointing to the first token- Token slice stored by reference — parser does not copy or modify tokens
// Lexer produces token slice
tokens := lexer.Start()
// Parser consumes the token slice
parser := ParserNew(tokens)
// Parse returns AST and any errors
program, errors := parser.Parse()Parsing Process (Parse)
Parse() performs a single left-to-right pass over the token slice and returns a *Program AST and a slice of ParseError values. It is the sole public method that drives parsing.
Parse() Guarantees:
- Complete consumption — consumes entire token slice, stopping when Position reaches end
- Progress guarantee — each loop branch consumes at least one token (no infinite loops)
- Partial results — returns all successfully parsed nodes even when errors occur
- Source positions — each error carries
LineandColumnfrom originating token - Single use — may be called only once per Parser instance
Error Handling:
The parser does not abort on the first error — it continues parsing to report as many issues as possible. If no errors occurred, the error slice is empty (not nil).
Token Consumption Helpers
The parser advances through the token slice using a set of helper methods. All advancement goes through these helpers — bounds-checking is centralised and out-of-bounds access is impossible.
| Method | Behavior |
|---|---|
current() | Returns token at Position, or sentinel zero-value Token if at/past end |
peek() | Returns token at Position + 1 without advancing; sentinel if no next token |
advance() | Increments Position by one, returns token at previous position; sentinel if at end |
expect(tokenType) | If current matches, consume and return; otherwise record ParseError (no advance) |
isAtEnd() | Returns true when Position is at/past token slice length |
AST Node Types
Every construct in the .kasm language maps to exactly one AST node type. The parser produces a flat list of top-level statements inside a Program. Because .kasm is a line-oriented assembly language, there is no nested expression tree — operands are leaves, not recursive sub-expressions.
Statement Types:
type Program struct {
Statements []Statement // Ordered slice in source order
}
// Statement kinds:
type InstructionStmt struct {
Mnemonic string // Instruction name (e.g., "mov", "add")
Operands []Operand // Zero or more operands
Line, Column int // Source position
}
type LabelStmt struct {
Name string // Label name WITHOUT trailing ":"
Line, Column int // Source position
}
type NamespaceStmt struct {
Name string // Namespace identifier
Line, Column int // Source position
}
type UseStmt struct {
ModuleName string // Module to import
Line, Column int // Source position
}
type DirectiveStmt struct {
Literal string // Full directive including "%" prefix
Args []Token // Argument tokens
Line, Column int // Source position
}| Statement Kind | Description | Example |
|---|---|---|
InstructionStmt | An instruction mnemonic followed by zero or more operands | mov rax, 1 |
LabelStmt | A label declaration (identifier ending in :) | _start: |
NamespaceStmt | A namespace keyword followed by a name | namespace myModule |
UseStmt | A use instruction followed by a module name | use stdio |
DirectiveStmt | A pre-processor directive that survived into the token stream | %section .text |
Operand Types
An Operand represents a single argument to an instruction. Operands are not recursive — there are no sub-expressions. Each operand is one of the following kinds:
Operand Types:
type RegisterOperand struct {
Name string // Register name (original casing preserved)
Line, Column int
}
type ImmediateOperand struct {
Value string // Numeric literal as string ("42", "0xFF")
Line, Column int
}
type IdentifierOperand struct {
Name string // Symbolic reference (label name, data symbol)
Line, Column int
}
type StringOperand struct {
Value string // String content (delimiters already stripped)
Line, Column int
}
type MemoryOperand struct {
Components []MemoryComponent // Base, displacement, index, operators
Line, Column int
}| Operand Kind | Token Type | Examples |
|---|---|---|
RegisterOperand | TokenRegister | rax, r8, eax |
ImmediateOperand | TokenImmediate | 42, 0xFF, 0b1010 |
IdentifierOperand | TokenIdentifier | label, msg, data_ptr |
StringOperand | TokenString | "Hello", "World\n" |
MemoryOperand | composite | [rbp], [rax + 8], [rbx + rcx*4] |
Memory Operand Parsing:
Memory operands are enclosed in [ and ]. The parser consumes the opening bracket, collects inner tokens (base register, optional displacement, optional index), and consumes the closing bracket. An unterminated [ produces a ParseError.
Statement Dispatch
The main parsing loop inspects the current token's type to determine which parsing method to invoke. Because each token type maps to at most one statement kind, dispatch is a simple switch — there is no ambiguity.
Dispatch Rules:
- 1
TokenInstruction→ parse asInstructionStmt(orUseStmtif literal isuse) - 2
TokenIdentifierwith trailing:→ parse asLabelStmt - 3
TokenIdentifierwithout:→ operand outside instruction context = parse error, recover - 4
TokenKeyword→ dispatch by literal:namespace→NamespaceStmt; unknown → error - 5
TokenDirective→ parse asDirectiveStmt - 6
TokenRegister,TokenImmediate,TokenStringoutside instruction → parse error (operand without instruction) - 7Any other token at top level → parse error, advance past token
Instruction Parsing
When the parser encounters a TokenInstruction, it collects the instruction's operands. Operands are separated by , tokens — the comma is consumed but not stored.
Instruction Parsing Rules:
- Consume instruction token, record literal as mnemonic
- Consume zero or more operands separated by commas
- Parser accepts any number of operands — operand-count validation is a semantic concern
- If literal (case-insensitive) is
use, delegate toUseStmtparsing
; Zero operands
ret ; InstructionStmt { Mnemonic: "ret", Operands: [] }
syscall ; InstructionStmt { Mnemonic: "syscall", Operands: [] }
nop ; InstructionStmt { Mnemonic: "nop", Operands: [] }
; One operand
push rax ; InstructionStmt { Mnemonic: "push", Operands: [RegisterOperand{rax}] }
jmp _exit ; InstructionStmt { Mnemonic: "jmp", Operands: [IdentifierOperand{_exit}] }
; Two operands
mov rax, 60 ; InstructionStmt { Mnemonic: "mov", Operands: [Register, Immediate] }
add rbx, [rsp] ; InstructionStmt { Mnemonic: "add", Operands: [Register, Memory] }Error Handling and Recovery
The parser must be resilient. A syntax error in one statement must not prevent parsing of subsequent statements. Because .kasm is line-oriented, recovery is straightforward — skip to the next statement boundary.
ParseError Structure:
type ParseError struct {
Message string // Human-readable error description
Line int // 1-based line number
Column int // 1-based column number
}Error Handling Guarantees:
- No panics — malformed sequences, empty slices, and unexpected tokens are handled gracefully
- Error accumulation — multiple errors may be reported in a single Parse() call
- Recovery strategy — advance past tokens until a recognisable statement start is found
- Source order — errors are returned in the order they were encountered
Common Parse Errors:
namespacewithout a following identifierusewithout a following module name- Unclosed memory operand (
[without matching]) - Operand without preceding instruction
- Unknown token at top level
Architecture & File Layout
The parser lives in v0/kasm alongside the lexer and token definitions. Because the parser consumes Token and TokenType from the same package, no cross-package import is required for the core data types.
| File | Responsibility |
|---|---|
parsing.go | Parser struct, ParserNew, Parse, parsing methods |
ast.go | AST node types (Program, Statement, Operand, etc.) |
parse_error.go | ParseError type definition |
Separation of Concerns:
- Parser does not import any architecture-specific package — operates on token types only
- AST nodes in
ast.goseparate from parsing logic for reusability ParseErroris a plain data struct, not anerrorinterface
Semantic Analysis
The semantic analyser validates a *Program AST (produced by the parser) against the rules of the .kasm language and the target architecture. It detects errors that are syntactically legal but semantically invalid — unknown instructions, wrong operand counts, mismatched operand types, duplicate labels, unresolved symbol references, and namespace violations. The semantic analyser sits between the parser and the code-generation stage in the assembly pipeline.
Architecture-Aware Design
The semantic analyser is architecture-aware: it receives an architecture description (instruction groups with their variants) at construction time and uses it to validate instruction operands. Because the architecture description is injected, the same analyser logic handles any architecture for which instruction metadata exists.
parser output (*Program AST)
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Semantic Analyser │
│ AnalyserNew(program, instructions) → Analyse() → []SemanticError│
│ │
│ ┌─────────────────────────────┐ │
│ │ Instruction metadata │ ← injected at construction │
│ │ (groups, variants, operand │ │
│ │ types) │ │
│ └─────────────────────────────┘ │
└──────────────────────┬───────────────────────────────────────────┘
│ validated AST + diagnostics
▼
code generationAnalyser Construction
An Analyser represents a ready-to-validate consumer of a *Program AST. If an Analyser value exists, it is guaranteed to hold a valid program reference and initialised internal state.
Analyser Struct:
type Analyser struct {
program *Program // The AST to analyse
instructions map[string]Instruction // Instruction lookup (upper-case keys)
labels map[string]labelDecl // Label name → declaration location
namespaces map[string]namespaceDecl // Namespace name → declaration location
modules map[string]useDecl // Module name → import location
errors []SemanticError // Accumulated semantic errors
}
// Helper types for tracking declarations
type labelDecl struct {
Name string
Line, Column int
}
type namespaceDecl struct {
Name string
Line, Column int
}
type useDecl struct {
Name string
Line, Column int
}Construction Requirements:
AnalyserNew(program, instructions)is the sole constructor- Infallible — cannot fail. Empty program produces zero errors; nil treated as empty
- Instruction table must provide O(1) lookup via
map[string]Instruction - Internal tables (labels, namespaces, modules) initialised as empty during construction
// Parser produces AST
program, parseErrors := parser.Parse()
// Build instruction table from architecture groups
instructions := buildInstructionTable(x86_64Groups)
// Analyser validates the AST
analyser := AnalyserNew(program, instructions)
semanticErrors := analyser.Analyse()Analysis Process (Analyse)
Analyse() performs a single left-to-right pass over the Program.Statements slice and returns a []SemanticError slice. It is the sole public method that drives analysis.
Two-Phase Analysis:
- Phase 1Collection— Gather all label declarations and namespace declarations into lookup tables so that forward references can be resolved
- Phase 2Validation— Validate every statement against the collected tables and the instruction metadata
Analyse() Guarantees:
- Single pass per statement — visits every statement exactly once, in source order
- Read-only — does not modify the AST (inspects and records diagnostics only)
- Multi-error reporting — continues analysing to report as many issues as possible
- Forward reference support —
jmp labelbeforelabel:resolves correctly - Single use — may be called only once per Analyser instance
Forward References:
Because .kasm allows forward references (e.g. jmp label before label: is declared), the collection phase must complete before the validation phase begins.
Instruction Validation
When the analyser encounters an InstructionStmt, it must validate the mnemonic and its operands against the architecture's instruction metadata.
Mnemonic Validation:
- Lookup mnemonic (case-insensitive) in instruction table
- If not found:
"unknown instruction '<mnemonic>'"
Operand Count Validation:
- If instruction has variants, check operand count matches at least one variant
- If no match:
"instruction '<mnemonic>' expects <n> operand(s), got <m>" - If no variants defined, skip count validation (allows partial metadata)
Operand Type Mapping:
| AST Node Kind | Semantic Type |
|---|---|
RegisterOperand | "register" |
ImmediateOperand | "immediate" |
MemoryOperand | "memory" |
IdentifierOperand | "identifier" (compatible with "relative", "far") |
StringOperand | "string" |
Operand Type Validation:
- Use
Instruction.FindVariant(operandTypes...)to match variant - If no match:
"no variant of '<mnemonic>' accepts operands (<type1>, <type2>, ...)"
; Valid - matches variant (register, immediate)
mov rax, 60 ; ✓ Found: MOV r64, imm32
; Invalid - unknown instruction
xyz rax, rbx ; ✗ Error: "unknown instruction 'xyz'"
; Invalid - wrong operand count
push rax, rbx ; ✗ Error: "instruction 'push' expects 1 operand(s), got 2"
; Invalid - wrong operand types
mov 42, rax ; ✗ Error: "no variant of 'mov' accepts operands (immediate, register)"Label Validation
Labels are declaration-site identifiers. The analyser must ensure they are unique within their scope and that all references can be resolved.
Duplicate Label Detection:
- Maintain label table (map of name → location)
- First declaration accepted
- Second+ produces error with original location
Undefined Reference Detection:
- Check every
IdentifierOperand - Run after all labels collected (phase 2)
- Non-instruction identifiers not checked
; Forward reference - valid
jmp _exit ; ✓ Resolved in phase 2
_start: ; ✓ First declaration
mov rax, 60
_start: ; ✗ Error: "duplicate label '_start', previously declared at 4:1"
nop
_exit: ; ✓ Resolves the forward reference
syscall
jmp undefined ; ✗ Error: "undefined reference to 'undefined'"Namespace Validation
Namespaces group related code under a name. The analyser validates namespace declarations for uniqueness.
Validation Rules:
- Record namespace name when
NamespaceStmtencountered - If duplicate:
"duplicate namespace '<name>', previously declared at <line>:<column>" - Name must be valid identifier (non-empty, doesn't start with digit)
Future Extension:
Future versions may introduce namespace-scoped label resolution (e.g. namespace.label). The namespace table is preserved for downstream stages that implement scoped resolution.
Use Statement Validation
use imports a module by name. The analyser validates the module reference and detects duplicates.
Validation Rules:
- Record module name when
UseStmtencountered - If duplicate:
"duplicate use of module '<name>', previously imported at <line>:<column>" - Module name must be valid identifier (non-empty)
Note:
Module resolution (locating the module's source file or compiled artefact) is not the analyser's responsibility. The analyser validates the statement and records it — a later linker or module resolver consumes the information.
Directive Validation
Directives that survive into the AST (not consumed by the pre-processor) are captured as DirectiveStmt nodes.
Validation Rules:
- If directive not recognised:
"unrecognised directive '<literal>'" - Pre-processor consumes:
%include,%macro,%endmacro,%define,%ifdef,%ifndef,%else,%endif - Any surviving directive is either undefined or a user error
Future Directives:
Future language-level directives (e.g. %section, %align) will be recognised and validated with their arguments.
Immediate Value Validation
ImmediateOperand values are stored as verbatim strings by the parser. The analyser validates they represent legal numeric values.
Validation Rules:
- Decimal: one or more digits (
0–9) - Hexadecimal:
0xor0Xfollowed by hex digits - If invalid:
"invalid immediate value '<value>'"
Overflow Detection:
Overflow detection is optional in the initial implementation. When implemented, the analyser will warn (not error) when an immediate exceeds the maximum value for the instruction's operand size.
Memory Operand Validation
MemoryOperand nodes contain a Components slice of raw tokens. The analyser validates the structure of the memory reference.
Validation Rules:
- Non-empty: must contain at least one component
- Base must be register or identifier: first non-operator component cannot be immediate
- Valid operators only: only
+and-allowed - Displacement: registers, immediates, or identifiers after operators
; Valid memory operands
mov rax, [rbp] ; ✓ Register base
mov rax, [rbp + 8] ; ✓ Register + immediate displacement
mov rax, [rsp - 16] ; ✓ Register - immediate displacement
mov rax, [data_ptr] ; ✓ Identifier base
; Invalid memory operands
mov rax, [] ; ✗ Error: "empty memory operand"
mov rax, [42] ; ✗ Error: "memory operand base must be a register or identifier, got immediate"
mov rax, [rbp * 2] ; ✗ Error: "invalid operator '*' in memory operand"Validation Summary
The following table summarizes all validation checks performed by the semantic analyser.
| Check | Statement | Error Condition |
|---|---|---|
| Unknown instruction | InstructionStmt | Mnemonic not in table |
| Operand count mismatch | InstructionStmt | No variant matches count |
| Operand type mismatch | InstructionStmt | No variant matches types |
| Duplicate label | LabelStmt | Name already declared |
| Undefined reference | InstructionStmt | Identifier not in label table |
| Duplicate namespace | NamespaceStmt | Name already declared |
| Duplicate use | UseStmt | Module already imported |
| Unrecognised directive | DirectiveStmt | Literal not in recognised set |
| Invalid immediate | InstructionStmt | Cannot parse as number |
| Empty memory operand | InstructionStmt | Components slice empty |
| Invalid memory base | InstructionStmt | First component is immediate |
| Invalid memory operator | InstructionStmt | Operator not + or - |
Architecture & File Layout
The semantic analyser lives in v0/kasm alongside the parser, lexer, and AST definitions.
| File | Responsibility |
|---|---|
semantic.go | Analyser struct, AnalyserNew, Analyse, validation methods |
semantic_error.go | SemanticError type definition |
Dependencies:
- Imports
v0/architectureforInstructionandInstructionVarianttypes - Does not import architecture-specific packages — receives instruction table via constructor
SemanticErroris a plain data struct (likeParseError)
Development Roadmap
The Keurnel Assembler is under active development. Currently, only the Pre-processing phase is fully implemented. The remaining phases are being developed iteratively to ensure a robust and efficient assembly pipeline.