ParserGen 2.0

Goal

Parsing
- Explicily declare the boundary of ambiguity resolving (e.g. on EXPR or on STAT)
- Full CFG power, no limitation
  - Experiment: expanding all left-recursive grammer to right-recursive grammar with instructions
  - Experiment: optionally inline all rules which don't generate parser functions
- Error message generation
- Error recovering
Serializing
- Escaping and Unescaping pairs (instead of only unescaping)
- Calculate ambiguous ToString cases
- Generate ToString algorithm
AST
- Low overhead AST with reflection
- Optional creating AST from a pool

AST Definition (compatible with Workflow)

class CLASS_NAME [: BASE_CLASS]
{
  var FIELD_NAME : TYPE;
  ...
}

Configurations

Include files in generated C++ header
Depended AST definition files
Visitors selected to generate
Optional reflection support
- All AST constructors are protected
- Generated factory class
- If AST object pool is enabled
  - reflection is disabled
  - Ptr<T> for all AST types are generated will enumerated Cast function.
  - Use generated RTTI constructions (e.g. enum class tag for type)

Types

Token: In the previous version, Token is a value type, now it is a reference type.
CLASS-NAME: Another class.
TYPE[]: Array, whose element is not allowed to be another array.

MISC

Define a ToString algorithm with customizable configurations.

Lexical Analyzer

Pair name with regular expressions.
Extendable tokens.
- For example, recognize R"[^\s(]\( and invoke a callback function to determine the end of the string
Pair a name with the token subset, and give a default name to a token full set

Error Messages

Generate error messages in C++ code

Syntax Analyzer

Priority of loops:
- +[ RULE ] means if RULE succeeds, skipping RULE is not considered even if the rest doesn't parse.
- -[ RULE ] means only if skipping RULE makes the clause not able to parse, the result of having RULE is not discarded.
- [ RULE ] means keep both result
- +{ RULE }, -{ RULE }, { RULE } are similar, but { RULE } may generate more than two results, meanwhile others only generate one result.
Being able to change token subset during parsing.
Being able to specify a error message when a certain action fails.
Generate SAX-like parser, with a default handler to create AST.
- Generate each POOLED tuple struct type for
  - Loop body. Delimitered list is considered as [ITEM {DELIMITER ITEM}]
    - Loop records a pointer to the reversed linked list of the last item during calculation
    - Loop records an array of items as the result
  - Alternative as Union<Ts...> storing {TYPE-FLAG, ITEM*} (value type)
  - Optional as Optional<T> storing {ITEM*} (value type)
  - Sequencial as {A, B, C ...} with generated field names (value type)
    - Type is rule or rule fragment, not the result AST type
    - If there are multiple fields in same type, appended with an index of the position in the rule (optionals, alternatives and loops are packed as one)
    - If a tuple is created directly from a rule, there will be a static field to indicate which rule does it come from
  - Rule reference as Reference<Ts...> storing {RULE, FRAGMENT, TYPE-FLAG, ITEM*}
    - A Reference<Ts...> are aliased
    - Consider about forward declarations
- All types have an un-templated partner so that the core SAX-like instruction execution doesn't need to know concrete types

Supported EBNF

TOKEN [: PROPERTY-NAME]
RULE [: PROPERTY-NAME]
Optional:
- +[ EBNF ]
- -[ EBNF ]
- [ EBNF ]
Loop:
- +{ EBNF }
- -{ EBNF }
- { EBNF }
with{ PROPERTY-ASSIGNMENT ... }

EBNF Program

RULE {::= CLAUSE as CLASS-NAME} ;
- Consider a syntax here to switch token set

ToString Algorithm Requirements

Every clause should create an AST node. EXP ::= '(' !EXP ')' is not allowed, except that this clause has only one node.
Every rule-name node should be assigned to a property. Token nodes are optional but those properties will be auto-generated.
Loops cannot be embedded in another loop. It doesn't limit the syntax, but it limit the shape of AST.

Development Project Structure

Original ParserGen code will be separated from Vlpp.
Development Steps
1. Symbols for ParserGen AST
2. Manually: symbols -> ParserGen AST declaration in C++
3. Manually: ParserGen Syntax described in ParserGen AST declaration in C++ -> ParserGen Parser declaration in C++
4. Integrate
TODO: Reorganize unit test projects to pure unit test and code generation steps
- Code generation steps are also multiple projects
- Because there are projects and the partner unit test that rely on generated code from depended projects
AstGen:
- Goal: given symbols and generated C++ code for AST
  - Produce (from unit test):
    - Generated source code (AST part): declaration, visitors, builder, reflection for ParserGen input
- AST symbols and C++ code generation.
- Generate visitors.
- Generate easy builder.
- Generate reflection.
Execution:
- Goal: given instructions and parse input text with SAX-like callback
  - An instruction could generate multiple continuations
- Serialization for parser-generated automaton and instructions.
- Run the automaton on an input
  - Determine a path on state transition from the input first, and then follow this path to execute instructions.
  - Introduce a garbage-collectable memory allocation for path, could be ref counted.
- Execute instructions as a SAX-like parser, with notification on ambiguous node, error message generation and error recovering.
  - If there is ambiguity, different callbacks could be called on the same position, and results could be discarded in the future execution.
Compiler -> AstGen, Execution
- Goal: input described using Generated source code (AST part) and generate instructions (text parser)
  - Produce (from unit test)
    - Generated C++ source code (parser part) for ParserGen input
- Take the ParserGen AST declaration and generate instructions.
- Generate the default handler to create AST for the SAX-like parser.
- ToString algorithm.
- Bidirection binding with AST the text.
ParserGen -> Compiler
- Goal: CLI Tool
- Integrate Generated source code (AST part)
- Integrate Generated source code (parser part)
- Handle command line arguments
UnitTestAst:
- Unit test of AstGen building block and pool allocation etc.
- Produces steps
  - Hand-written AST for ParserGen symbols.
  - Codegen symbols and get Generated source code (AST part) for ParserGen input
UnitTestExecution:
- Unit test of Execution.
- Assert directly on SAX-like parser.
UnitTestCompiler:
- Unit test of Compiler, input are all UnitTestExecution test cases rewritten using the generated easy-builder for AST for ParserGen AST.
- Assert on the ToString-ed AST. (shared)
- Produces steps
  - Implementa ParserGen AST Input syntax using generated easy builder form Generated source code (AST part)
  - Serialize instructions and get C++ source code (parser part) for ParserGen input
UnitTestParserGen:
- Unit test of ParserGen, input are all UnitTestExecution test cases rewritten in text format.
- Assert on the ToString-ed AST. (shared)
- Generate all parser in text format to C++ code
UnitTest:
- Link all cpp files in all other unit test projects so that all test cases can be run in one F5.
- Test all generated parsers in UnitTestParserGen.
- Assert on the ToString-ed AST. (shared)
Since parser are written in different ways for different unit test projects, they are stored separately from unit test projects to share necessary files.
Do not really write a file if the generated content doesn't change.

Note

Pooled data structure (tree/linked list) for intermediate parsing result before turning into AST
Multiple way of AST codegen (normal/reflection/pooled)
Imperative & concurrent parsing and AST building instructions, potential to generate parser in different languages
Serialized lexical table instead of storing regex
VlppParser2

8.2 KiB Raw Blame History

ParserGen 2.0

Goal

AST Definition (compatible with Workflow)

Configurations

Types

MISC

Lexical Analyzer

Error Messages

Syntax Analyzer

Supported EBNF

EBNF Program

ToString Algorithm Requirements

Development Project Structure

Note

8.2 KiB

Raw Blame History