Files
GacUI/.github/KnowledgeBase/manual/vlppparser2/lexer.md
2026-02-04 14:57:24 -08:00

1.3 KiB

Lexer Definition

An lexer definition consists of multiple tokens. Names of non-discarded tokens can be used in syntax definitions.

Single line comments are allowed using**//**.

A token can be defined by:

TokenName           :REGEX

Spaces are allowed before**":"but not after, everything after":"**in this line is considered the regular expression to define the token.

A discarded token can be defined by:

discard TokenName   :REGEX

Such token will be recognized and removed from the input. You can use it for comments or whitespaces.

A token fragment can be defined by:

$FragmentName       :REGEX

This is not a token, but it can be used in other regular expressions using**{$FragmentName}**, to paste the fragment at that position and avoid repeating the same pattern everywhere. Here is an example of defining integer, hex and binary literal in C++:

$CPP_NUMBER_POSTFIX  :[uU]|[lL]|[uU][lL]|[lL][uU]|[lL][lL]|[uU][lL][lL]|[lL][uU][lL]|[lL][lL][uU]|i8|ui8|i16|ui16|i32|ui32|i64|ui64
INT                  :(/d+('/d+)*)({$CPP_NUMBER_POSTFIX})?
HEX                  :0[xX][0-9a-fA-F]+({$CPP_NUMBER_POSTFIX})?
BIN                  :0[bB][01]+({$CPP_NUMBER_POSTFIX})?

Token fragments are just simply copy-pasted, that's why**"()"**is needed here.