diff --git a/TODO_ParserGen.md b/TODO_ParserGen.md index 2be316c9..e78ebfcd 100644 --- a/TODO_ParserGen.md +++ b/TODO_ParserGen.md @@ -1,4 +1,6 @@ -# Goal +# ParserGen 2.0 + +## Goal * Parsing * Explicily declare the boundary of ambiguity resolving (e.g. on EXPR or on STAT) @@ -15,7 +17,7 @@ * Low overhead AST with reflection * Optional creating AST from a pool -# AST Definition (compatible with Workflow) +## AST Definition (compatible with Workflow) ``` class CLASS_NAME [: BASE_CLASS] @@ -25,128 +27,149 @@ class CLASS_NAME [: BASE_CLASS] } ``` -## Configurations +### Configurations -- Include files in generated C++ header -- Depended AST definition files -- Visitors selected to generate -- Optional reflection support - - All AST constructors are protected - - Generated factory class - - If AST object pool is enabled - - reflection is disabled - - `Ptr` for all AST types are generated will enumerated `Cast` function. - - Use generated RTTI constructions (e.g. enum class tag for type) +* Include files in generated C++ header +* Depended AST definition files +* Visitors selected to generate +* Optional reflection support + * All AST constructors are protected + * Generated factory class + * If AST object pool is enabled + * reflection is disabled + * `Ptr` for all AST types are generated will enumerated `Cast` function. + * Use generated RTTI constructions (e.g. enum class tag for type) -## Types +### Types -- `Token`: In the previous version, `Token` is a value type, now it is a reference type. -- `CLASS-NAME`: Another class. -- `TYPE[]`: Array, whose element is not allowed to be another array. +* `Token`: In the previous version, `Token` is a value type, now it is a reference type. +* `CLASS-NAME`: Another class. +* `TYPE[]`: Array, whose element is not allowed to be another array. -## MISC +### MISC -- Define a `ToString` algorithm with customizable configurations. +* Define a `ToString` algorithm with customizable configurations. -# Lexical Analyzer +## Lexical Analyzer -- Pair name with regular expressions. -- Extendable tokens. - - For example, recognize `R"[^\s(]\(` and invoke a callback function to determine the end of the string -- Pair a name with the token subset, and give a default name to a token full set +* Pair name with regular expressions. +* Extendable tokens. + * For example, recognize `R"[^\s(]\(` and invoke a callback function to determine the end of the string +* Pair a name with the token subset, and give a default name to a token full set -# Error Messages +## Error Messages -- Generate error messages in C++ code +* Generate error messages in C++ code -# Syntax Analyzer +## Syntax Analyzer -- Priority of loops: - - `+[ RULE ]` means if `RULE` succeeds, skipping `RULE` is not considered even if the rest doesn't parse. - - `-[ RULE ]` means only if skipping `RULE` makes the clause not able to parse, the result of having RULE is not discarded. - - `[ RULE ]` means keep both result - - `+{ RULE }`, `-{ RULE }`, `{ RULE }` are similar, but `{ RULE }` may generate more than two results, meanwhile others only generate one result. -- Being able to change token subset during parsing. -- Being able to specify a error message when a certain action fails. -- Generate SAX-like parser, with a default handler to create AST. +* Priority of loops: + * `+[ RULE ]` means if `RULE` succeeds, skipping `RULE` is not considered even if the rest doesn't parse. + * `-[ RULE ]` means only if skipping `RULE` makes the clause not able to parse, the result of having RULE is not discarded. + * `[ RULE ]` means keep both result + * `+{ RULE }`, `-{ RULE }`, `{ RULE }` are similar, but `{ RULE }` may generate more than two results, meanwhile others only generate one result. +* Being able to change token subset during parsing. +* Being able to specify a error message when a certain action fails. +* Generate SAX-like parser, with a default handler to create AST. + * Generate each **POOLED** tuple struct type for + * Loop body. Delimitered list is considered as [ITEM {DELIMITER ITEM}] + * Loop records a pointer to the reversed linked list of the last item during calculation + * Loop records an array of items as the result + * Alternative as `Union` storing `{TYPE-FLAG, ITEM*}` (value type) + * Optional as `Optional` storing `{ITEM*}` (value type) + * Sequencial as `{A, B, C ...}` with generated field names (value type) + * Type is rule or rule fragment, not the result AST type + * If there are multiple fields in same type, appended with an index of the position in the rule (optionals, alternatives and loops are packed as one) + * If a tuple is created directly from a rule, there will be a static field to indicate which rule does it come from + * Rule reference as `Reference` storing `{RULE, FRAGMENT, TYPE-FLAG, ITEM*}` + * A `Reference` are aliased + * Consider about forward declarations + * All types have an un-templated partner so that the core SAX-like instruction execution doesn't need to know concrete types -## Supported EBNF +### Supported EBNF -- TOKEN [`:` PROPERTY-NAME] -- RULE [`:` PROPERTY-NAME] -- Optional: - - `+[` EBNF `]` - - `-[` EBNF `]` - - `[` EBNF `]` -- Loop: - - `+{` EBNF `}` - - `-{` EBNF `}` - - `{` EBNF `}` -- `with{` PROPERTY-ASSIGNMENT ... `}` - -## EBNF Program +* TOKEN [`:` PROPERTY-NAME] +* RULE [`:` PROPERTY-NAME] +* Optional: + * `+[` EBNF `]` + * `-[` EBNF `]` + * `[` EBNF `]` +* Loop: + * `+{` EBNF `}` + * `-{` EBNF `}` + * `{` EBNF `}` +* `with{` PROPERTY-ASSIGNMENT ... `}` -- RULE {`::=` CLAUSE `as` CLASS-NAME} `;` - - Consider a syntax here to switch token set +### EBNF Program -## ToString Algorithm Requirements -- Every clause should create an AST node. `EXP ::= '(' !EXP ')'` is not allowed, except that this clause has only one node. -- Every rule-name node should be assigned to a property. Token nodes are optional but those properties will be auto-generated. -- Loops cannot be embedded in another loop. It doesn't limit the syntax, but it limit the shape of AST. +* RULE {`::=` CLAUSE `as` CLASS-NAME} `;` + * Consider a syntax here to switch token set -## Development Project Structure -- Original ParserGen code will be separated from Vlpp. -- **Development Steps** +### ToString Algorithm Requirements + +* Every clause should create an AST node. `EXP ::= '(' !EXP ')'` is not allowed, except that this clause has only one node. +* Every rule-name node should be assigned to a property. Token nodes are optional but those properties will be auto-generated. +* Loops cannot be embedded in another loop. It doesn't limit the syntax, but it limit the shape of AST. + +### Development Project Structure + +* Original ParserGen code will be separated from Vlpp. +* **Development Steps** 1. Symbols for ParserGen AST - 2. Manually: symbols -> `ParserGen AST described in C++` - 3. Manually: ParserGen Syntax described in `ParserGen AST described in C++` -> `ParserGen Parser described in C++` + 2. Manually: symbols -> `ParserGen AST declaration in C++` + 3. Manually: ParserGen Syntax described in `ParserGen AST declaration in C++` -> `ParserGen Parser declaration in C++` 4. Integrate -- **AstGen**: - - **Goal**: given symbols and generated C++ code for AST - - **Produce** (from unit test): - - Generated source code (AST part): declaration, visitors, builder, reflection for ParserGen input - - AST symbols and C++ code generation. - - Generate visitors. - - Generate easy builder. - - Generate reflection. -- **Execution**: - - **Goal**: given instructions and parse input text with SAX-like callback - - Parser-generated instructions serialization. - - Execute instructions as a SAX-like parser, with notification on ambiguous node, error message generation and error recovering. -- **Compiler** -> **AstGen**, **Execution** - - **Goal**: input described using `Generated source code (AST part)` and generate instructions (text parser) - - **Produce** (from unit test) - - Generated C++ source code (parser part) for ParserGen input - - Take the `ParserGen AST declaration` and generate instructions. - - Generate the default handler to create AST for the SAX-like parser. - - ToString algorithm. - - Bidirection binding with AST the text. -- **ParserGen** -> **Compiler** - - **Goal**: CLI Tool - - Integrate `Generated source code (AST part)` - - Integrate `Generated source code (parser part)` - - Handle command line arguments -- **UnitTestAst**: - - Unit test of **AstGen** building block and pool allocation etc. - - **Produces** steps - - Hand-written `AST for ParserGen` symbols. - - Codegen symbols and get `Generated source code (AST part)` for ParserGen input -- **UnitTestExecution**: - - Unit test of **Execution**. - - Assert directly on SAX-like parser. -- **UnitTestCompiler**: - - Unit test of **Compiler**, input are all **UnitTestExecution** test cases rewritten using the generated easy-builder for `AST for ParserGen` AST. - - Assert on the ToString-ed AST. (shared) - - **Produces** steps - - Implementa ParserGen AST Input syntax using generated easy builder form `Generated source code (AST part)` - - Serialize instructions and get `C++ source code (parser part)` for ParserGen input -- **UnitTestParserGen**: - - Unit test of **ParserGen**, input are all **UnitTestExecution** test cases rewritten in text format. - - Assert on the ToString-ed AST. (shared) - - Generate all parser in text format to C++ code -- **UnitTest**: - - Link all cpp files in all other unit test projects so that all test cases can be run in one F5. - - Test all generated parsers in **UnitTestParserGen**. - - Assert on the ToString-ed AST. (shared) -- Since parser are written in different ways for different unit test projects, they are stored separately from unit test projects to share necessary files. -- Do not really write a file if the generated content doesn't change. +* **TODO**: Reorganize unit test projects to pure unit test and code generation steps + * Code generation steps are also multiple projects + * Because there are projects and the partner unit test that rely on generated code from depended projects +* **AstGen**: + * **Goal**: given symbols and generated C++ code for AST + * **Produce** (from unit test): + * Generated source code (AST part): declaration, visitors, builder, reflection for ParserGen input + * AST symbols and C++ code generation. + * Generate visitors. + * Generate easy builder. + * Generate reflection. +* **Execution**: + * **Goal**: given instructions and parse input text with SAX-like callback + * An instruction could generate multiple continuations + * Parser-generated instructions serialization. + * Execute instructions as a SAX-like parser, with notification on ambiguous node, error message generation and error recovering. + * If there is ambiguity, different callbacks could be called on the same position, and results could be discarded in the future execution. +* **Compiler** -> **AstGen**, **Execution** + * **Goal**: input described using `Generated source code (AST part)` and generate instructions (text parser) + * **Produce** (from unit test) + * Generated C++ source code (parser part) for ParserGen input + * Take the `ParserGen AST declaration` and generate instructions. + * Generate the default handler to create AST for the SAX-like parser. + * ToString algorithm. + * Bidirection binding with AST the text. +* **ParserGen** -> **Compiler** + * **Goal**: CLI Tool + * Integrate `Generated source code (AST part)` + * Integrate `Generated source code (parser part)` + * Handle command line arguments +* **UnitTestAst**: + * Unit test of **AstGen** building block and pool allocation etc. + * **Produces** steps + * Hand-written `AST for ParserGen` symbols. + * Codegen symbols and get `Generated source code (AST part)` for ParserGen input +* **UnitTestExecution**: + * Unit test of **Execution**. + * Assert directly on SAX-like parser. +* **UnitTestCompiler**: + * Unit test of **Compiler**, input are all **UnitTestExecution** test cases rewritten using the generated easy-builder for `AST for ParserGen` AST. + * Assert on the ToString-ed AST. (shared) + * **Produces** steps + * Implementa ParserGen AST Input syntax using generated easy builder form `Generated source code (AST part)` + * Serialize instructions and get `C++ source code (parser part)` for ParserGen input +* **UnitTestParserGen**: + * Unit test of **ParserGen**, input are all **UnitTestExecution** test cases rewritten in text format. + * Assert on the ToString-ed AST. (shared) + * Generate all parser in text format to C++ code +* **UnitTest**: + * Link all cpp files in all other unit test projects so that all test cases can be run in one F5. + * Test all generated parsers in **UnitTestParserGen**. + * Assert on the ToString-ed AST. (shared) +* Since parser are written in different ways for different unit test projects, they are stored separately from unit test projects to share necessary files. +* Do not really write a file if the generated content doesn't change.