Update TODO_ParserGen.md

This commit is contained in:
Zihan Chen
2020-07-19 12:58:15 -07:00
parent e1f0951f21
commit 61ee6e4a3c
+131 -108
View File
@@ -1,4 +1,6 @@
# Goal # ParserGen 2.0
## Goal
* Parsing * Parsing
* Explicily declare the boundary of ambiguity resolving (e.g. on EXPR or on STAT) * Explicily declare the boundary of ambiguity resolving (e.g. on EXPR or on STAT)
@@ -15,7 +17,7 @@
* Low overhead AST with reflection * Low overhead AST with reflection
* Optional creating AST from a pool * Optional creating AST from a pool
# AST Definition (compatible with Workflow) ## AST Definition (compatible with Workflow)
``` ```
class CLASS_NAME [: BASE_CLASS] class CLASS_NAME [: BASE_CLASS]
@@ -25,128 +27,149 @@ class CLASS_NAME [: BASE_CLASS]
} }
``` ```
## Configurations ### Configurations
- Include files in generated C++ header * Include files in generated C++ header
- Depended AST definition files * Depended AST definition files
- Visitors selected to generate * Visitors selected to generate
- Optional reflection support * Optional reflection support
- All AST constructors are protected * All AST constructors are protected
- Generated factory class * Generated factory class
- If AST object pool is enabled * If AST object pool is enabled
- reflection is disabled * reflection is disabled
- `Ptr<T>` for all AST types are generated will enumerated `Cast` function. * `Ptr<T>` for all AST types are generated will enumerated `Cast` function.
- Use generated RTTI constructions (e.g. enum class tag for type) * Use generated RTTI constructions (e.g. enum class tag for type)
## Types ### Types
- `Token`: In the previous version, `Token` is a value type, now it is a reference type. * `Token`: In the previous version, `Token` is a value type, now it is a reference type.
- `CLASS-NAME`: Another class. * `CLASS-NAME`: Another class.
- `TYPE[]`: Array, whose element is not allowed to be another array. * `TYPE[]`: Array, whose element is not allowed to be another array.
## MISC ### MISC
- Define a `ToString` algorithm with customizable configurations. * Define a `ToString` algorithm with customizable configurations.
# Lexical Analyzer ## Lexical Analyzer
- Pair name with regular expressions. * Pair name with regular expressions.
- Extendable tokens. * Extendable tokens.
- For example, recognize `R"[^\s(]\(` and invoke a callback function to determine the end of the string * For example, recognize `R"[^\s(]\(` and invoke a callback function to determine the end of the string
- Pair a name with the token subset, and give a default name to a token full set * Pair a name with the token subset, and give a default name to a token full set
# Error Messages ## Error Messages
- Generate error messages in C++ code * Generate error messages in C++ code
# Syntax Analyzer ## Syntax Analyzer
- Priority of loops: * Priority of loops:
- `+[ RULE ]` means if `RULE` succeeds, skipping `RULE` is not considered even if the rest doesn't parse. * `+[ RULE ]` means if `RULE` succeeds, skipping `RULE` is not considered even if the rest doesn't parse.
- `-[ RULE ]` means only if skipping `RULE` makes the clause not able to parse, the result of having RULE is not discarded. * `-[ RULE ]` means only if skipping `RULE` makes the clause not able to parse, the result of having RULE is not discarded.
- `[ RULE ]` means keep both result * `[ RULE ]` means keep both result
- `+{ RULE }`, `-{ RULE }`, `{ RULE }` are similar, but `{ RULE }` may generate more than two results, meanwhile others only generate one result. * `+{ RULE }`, `-{ RULE }`, `{ RULE }` are similar, but `{ RULE }` may generate more than two results, meanwhile others only generate one result.
- Being able to change token subset during parsing. * Being able to change token subset during parsing.
- Being able to specify a error message when a certain action fails. * Being able to specify a error message when a certain action fails.
- Generate SAX-like parser, with a default handler to create AST. * Generate SAX-like parser, with a default handler to create AST.
* Generate each **POOLED** tuple struct type for
* Loop body. Delimitered list is considered as [ITEM {DELIMITER ITEM}]
* Loop records a pointer to the reversed linked list of the last item during calculation
* Loop records an array of items as the result
* Alternative as `Union<Ts...>` storing `{TYPE-FLAG, ITEM*}` (value type)
* Optional as `Optional<T>` storing `{ITEM*}` (value type)
* Sequencial as `{A, B, C ...}` with generated field names (value type)
* Type is rule or rule fragment, not the result AST type
* If there are multiple fields in same type, appended with an index of the position in the rule (optionals, alternatives and loops are packed as one)
* If a tuple is created directly from a rule, there will be a static field to indicate which rule does it come from
* Rule reference as `Reference<Ts...>` storing `{RULE, FRAGMENT, TYPE-FLAG, ITEM*}`
* A `Reference<Ts...>` are aliased
* Consider about forward declarations
* All types have an un-templated partner so that the core SAX-like instruction execution doesn't need to know concrete types
## Supported EBNF ### Supported EBNF
- TOKEN [`:` PROPERTY-NAME] * TOKEN [`:` PROPERTY-NAME]
- RULE [`:` PROPERTY-NAME] * RULE [`:` PROPERTY-NAME]
- Optional: * Optional:
- `+[` EBNF `]` * `+[` EBNF `]`
- `-[` EBNF `]` * `-[` EBNF `]`
- `[` EBNF `]` * `[` EBNF `]`
- Loop: * Loop:
- `+{` EBNF `}` * `+{` EBNF `}`
- `-{` EBNF `}` * `-{` EBNF `}`
- `{` EBNF `}` * `{` EBNF `}`
- `with{` PROPERTY-ASSIGNMENT ... `}` * `with{` PROPERTY-ASSIGNMENT ... `}`
## EBNF Program ### EBNF Program
- RULE {`::=` CLAUSE `as` CLASS-NAME} `;` * RULE {`::=` CLAUSE `as` CLASS-NAME} `;`
- Consider a syntax here to switch token set * Consider a syntax here to switch token set
## ToString Algorithm Requirements ### ToString Algorithm Requirements
- Every clause should create an AST node. `EXP ::= '(' !EXP ')'` is not allowed, except that this clause has only one node.
- Every rule-name node should be assigned to a property. Token nodes are optional but those properties will be auto-generated.
- Loops cannot be embedded in another loop. It doesn't limit the syntax, but it limit the shape of AST.
## Development Project Structure * Every clause should create an AST node. `EXP ::= '(' !EXP ')'` is not allowed, except that this clause has only one node.
- Original ParserGen code will be separated from Vlpp. * Every rule-name node should be assigned to a property. Token nodes are optional but those properties will be auto-generated.
- **Development Steps** * Loops cannot be embedded in another loop. It doesn't limit the syntax, but it limit the shape of AST.
### Development Project Structure
* Original ParserGen code will be separated from Vlpp.
* **Development Steps**
1. Symbols for ParserGen AST 1. Symbols for ParserGen AST
2. Manually: symbols -> `ParserGen AST described in C++` 2. Manually: symbols -> `ParserGen AST declaration in C++`
3. Manually: ParserGen Syntax described in `ParserGen AST described in C++` -> `ParserGen Parser described in C++` 3. Manually: ParserGen Syntax described in `ParserGen AST declaration in C++` -> `ParserGen Parser declaration in C++`
4. Integrate 4. Integrate
- **AstGen**: * **TODO**: Reorganize unit test projects to pure unit test and code generation steps
- **Goal**: given symbols and generated C++ code for AST * Code generation steps are also multiple projects
- **Produce** (from unit test): * Because there are projects and the partner unit test that rely on generated code from depended projects
- Generated source code (AST part): declaration, visitors, builder, reflection for ParserGen input * **AstGen**:
- AST symbols and C++ code generation. * **Goal**: given symbols and generated C++ code for AST
- Generate visitors. * **Produce** (from unit test):
- Generate easy builder. * Generated source code (AST part): declaration, visitors, builder, reflection for ParserGen input
- Generate reflection. * AST symbols and C++ code generation.
- **Execution**: * Generate visitors.
- **Goal**: given instructions and parse input text with SAX-like callback * Generate easy builder.
- Parser-generated instructions serialization. * Generate reflection.
- Execute instructions as a SAX-like parser, with notification on ambiguous node, error message generation and error recovering. * **Execution**:
- **Compiler** -> **AstGen**, **Execution** * **Goal**: given instructions and parse input text with SAX-like callback
- **Goal**: input described using `Generated source code (AST part)` and generate instructions (text parser) * An instruction could generate multiple continuations
- **Produce** (from unit test) * Parser-generated instructions serialization.
- Generated C++ source code (parser part) for ParserGen input * Execute instructions as a SAX-like parser, with notification on ambiguous node, error message generation and error recovering.
- Take the `ParserGen AST declaration` and generate instructions. * If there is ambiguity, different callbacks could be called on the same position, and results could be discarded in the future execution.
- Generate the default handler to create AST for the SAX-like parser. * **Compiler** -> **AstGen**, **Execution**
- ToString algorithm. * **Goal**: input described using `Generated source code (AST part)` and generate instructions (text parser)
- Bidirection binding with AST the text. * **Produce** (from unit test)
- **ParserGen** -> **Compiler** * Generated C++ source code (parser part) for ParserGen input
- **Goal**: CLI Tool * Take the `ParserGen AST declaration` and generate instructions.
- Integrate `Generated source code (AST part)` * Generate the default handler to create AST for the SAX-like parser.
- Integrate `Generated source code (parser part)` * ToString algorithm.
- Handle command line arguments * Bidirection binding with AST the text.
- **UnitTestAst**: * **ParserGen** -> **Compiler**
- Unit test of **AstGen** building block and pool allocation etc. * **Goal**: CLI Tool
- **Produces** steps * Integrate `Generated source code (AST part)`
- Hand-written `AST for ParserGen` symbols. * Integrate `Generated source code (parser part)`
- Codegen symbols and get `Generated source code (AST part)` for ParserGen input * Handle command line arguments
- **UnitTestExecution**: * **UnitTestAst**:
- Unit test of **Execution**. * Unit test of **AstGen** building block and pool allocation etc.
- Assert directly on SAX-like parser. * **Produces** steps
- **UnitTestCompiler**: * Hand-written `AST for ParserGen` symbols.
- Unit test of **Compiler**, input are all **UnitTestExecution** test cases rewritten using the generated easy-builder for `AST for ParserGen` AST. * Codegen symbols and get `Generated source code (AST part)` for ParserGen input
- Assert on the ToString-ed AST. (shared) * **UnitTestExecution**:
- **Produces** steps * Unit test of **Execution**.
- Implementa ParserGen AST Input syntax using generated easy builder form `Generated source code (AST part)` * Assert directly on SAX-like parser.
- Serialize instructions and get `C++ source code (parser part)` for ParserGen input * **UnitTestCompiler**:
- **UnitTestParserGen**: * Unit test of **Compiler**, input are all **UnitTestExecution** test cases rewritten using the generated easy-builder for `AST for ParserGen` AST.
- Unit test of **ParserGen**, input are all **UnitTestExecution** test cases rewritten in text format. * Assert on the ToString-ed AST. (shared)
- Assert on the ToString-ed AST. (shared) * **Produces** steps
- Generate all parser in text format to C++ code * Implementa ParserGen AST Input syntax using generated easy builder form `Generated source code (AST part)`
- **UnitTest**: * Serialize instructions and get `C++ source code (parser part)` for ParserGen input
- Link all cpp files in all other unit test projects so that all test cases can be run in one F5. * **UnitTestParserGen**:
- Test all generated parsers in **UnitTestParserGen**. * Unit test of **ParserGen**, input are all **UnitTestExecution** test cases rewritten in text format.
- Assert on the ToString-ed AST. (shared) * Assert on the ToString-ed AST. (shared)
- Since parser are written in different ways for different unit test projects, they are stored separately from unit test projects to share necessary files. * Generate all parser in text format to C++ code
- Do not really write a file if the generated content doesn't change. * **UnitTest**:
* Link all cpp files in all other unit test projects so that all test cases can be run in one F5.
* Test all generated parsers in **UnitTestParserGen**.
* Assert on the ToString-ed AST. (shared)
* Since parser are written in different ways for different unit test projects, they are stored separately from unit test projects to share necessary files.
* Do not really write a file if the generated content doesn't change.