Update TODO_ParserGen.md

This commit is contained in:
Zihan Chen
2020-07-19 12:58:15 -07:00
parent e1f0951f21
commit 61ee6e4a3c
+132 -109
View File
@@ -1,4 +1,6 @@
# Goal
# ParserGen 2.0
## Goal
* Parsing
* Explicily declare the boundary of ambiguity resolving (e.g. on EXPR or on STAT)
@@ -15,7 +17,7 @@
* Low overhead AST with reflection
* Optional creating AST from a pool
# AST Definition (compatible with Workflow)
## AST Definition (compatible with Workflow)
```
class CLASS_NAME [: BASE_CLASS]
@@ -25,128 +27,149 @@ class CLASS_NAME [: BASE_CLASS]
}
```
## Configurations
### Configurations
- Include files in generated C++ header
- Depended AST definition files
- Visitors selected to generate
- Optional reflection support
- All AST constructors are protected
- Generated factory class
- If AST object pool is enabled
- reflection is disabled
- `Ptr<T>` for all AST types are generated will enumerated `Cast` function.
- Use generated RTTI constructions (e.g. enum class tag for type)
* Include files in generated C++ header
* Depended AST definition files
* Visitors selected to generate
* Optional reflection support
* All AST constructors are protected
* Generated factory class
* If AST object pool is enabled
* reflection is disabled
* `Ptr<T>` for all AST types are generated will enumerated `Cast` function.
* Use generated RTTI constructions (e.g. enum class tag for type)
## Types
### Types
- `Token`: In the previous version, `Token` is a value type, now it is a reference type.
- `CLASS-NAME`: Another class.
- `TYPE[]`: Array, whose element is not allowed to be another array.
* `Token`: In the previous version, `Token` is a value type, now it is a reference type.
* `CLASS-NAME`: Another class.
* `TYPE[]`: Array, whose element is not allowed to be another array.
## MISC
### MISC
- Define a `ToString` algorithm with customizable configurations.
* Define a `ToString` algorithm with customizable configurations.
# Lexical Analyzer
## Lexical Analyzer
- Pair name with regular expressions.
- Extendable tokens.
- For example, recognize `R"[^\s(]\(` and invoke a callback function to determine the end of the string
- Pair a name with the token subset, and give a default name to a token full set
* Pair name with regular expressions.
* Extendable tokens.
* For example, recognize `R"[^\s(]\(` and invoke a callback function to determine the end of the string
* Pair a name with the token subset, and give a default name to a token full set
# Error Messages
## Error Messages
- Generate error messages in C++ code
* Generate error messages in C++ code
# Syntax Analyzer
## Syntax Analyzer
- Priority of loops:
- `+[ RULE ]` means if `RULE` succeeds, skipping `RULE` is not considered even if the rest doesn't parse.
- `-[ RULE ]` means only if skipping `RULE` makes the clause not able to parse, the result of having RULE is not discarded.
- `[ RULE ]` means keep both result
- `+{ RULE }`, `-{ RULE }`, `{ RULE }` are similar, but `{ RULE }` may generate more than two results, meanwhile others only generate one result.
- Being able to change token subset during parsing.
- Being able to specify a error message when a certain action fails.
- Generate SAX-like parser, with a default handler to create AST.
* Priority of loops:
* `+[ RULE ]` means if `RULE` succeeds, skipping `RULE` is not considered even if the rest doesn't parse.
* `-[ RULE ]` means only if skipping `RULE` makes the clause not able to parse, the result of having RULE is not discarded.
* `[ RULE ]` means keep both result
* `+{ RULE }`, `-{ RULE }`, `{ RULE }` are similar, but `{ RULE }` may generate more than two results, meanwhile others only generate one result.
* Being able to change token subset during parsing.
* Being able to specify a error message when a certain action fails.
* Generate SAX-like parser, with a default handler to create AST.
* Generate each **POOLED** tuple struct type for
* Loop body. Delimitered list is considered as [ITEM {DELIMITER ITEM}]
* Loop records a pointer to the reversed linked list of the last item during calculation
* Loop records an array of items as the result
* Alternative as `Union<Ts...>` storing `{TYPE-FLAG, ITEM*}` (value type)
* Optional as `Optional<T>` storing `{ITEM*}` (value type)
* Sequencial as `{A, B, C ...}` with generated field names (value type)
* Type is rule or rule fragment, not the result AST type
* If there are multiple fields in same type, appended with an index of the position in the rule (optionals, alternatives and loops are packed as one)
* If a tuple is created directly from a rule, there will be a static field to indicate which rule does it come from
* Rule reference as `Reference<Ts...>` storing `{RULE, FRAGMENT, TYPE-FLAG, ITEM*}`
* A `Reference<Ts...>` are aliased
* Consider about forward declarations
* All types have an un-templated partner so that the core SAX-like instruction execution doesn't need to know concrete types
## Supported EBNF
### Supported EBNF
- TOKEN [`:` PROPERTY-NAME]
- RULE [`:` PROPERTY-NAME]
- Optional:
- `+[` EBNF `]`
- `-[` EBNF `]`
- `[` EBNF `]`
- Loop:
- `+{` EBNF `}`
- `-{` EBNF `}`
- `{` EBNF `}`
- `with{` PROPERTY-ASSIGNMENT ... `}`
## EBNF Program
* TOKEN [`:` PROPERTY-NAME]
* RULE [`:` PROPERTY-NAME]
* Optional:
* `+[` EBNF `]`
* `-[` EBNF `]`
* `[` EBNF `]`
* Loop:
* `+{` EBNF `}`
* `-{` EBNF `}`
* `{` EBNF `}`
* `with{` PROPERTY-ASSIGNMENT ... `}`
- RULE {`::=` CLAUSE `as` CLASS-NAME} `;`
- Consider a syntax here to switch token set
### EBNF Program
## ToString Algorithm Requirements
- Every clause should create an AST node. `EXP ::= '(' !EXP ')'` is not allowed, except that this clause has only one node.
- Every rule-name node should be assigned to a property. Token nodes are optional but those properties will be auto-generated.
- Loops cannot be embedded in another loop. It doesn't limit the syntax, but it limit the shape of AST.
* RULE {`::=` CLAUSE `as` CLASS-NAME} `;`
* Consider a syntax here to switch token set
## Development Project Structure
- Original ParserGen code will be separated from Vlpp.
- **Development Steps**
### ToString Algorithm Requirements
* Every clause should create an AST node. `EXP ::= '(' !EXP ')'` is not allowed, except that this clause has only one node.
* Every rule-name node should be assigned to a property. Token nodes are optional but those properties will be auto-generated.
* Loops cannot be embedded in another loop. It doesn't limit the syntax, but it limit the shape of AST.
### Development Project Structure
* Original ParserGen code will be separated from Vlpp.
* **Development Steps**
1. Symbols for ParserGen AST
2. Manually: symbols -> `ParserGen AST described in C++`
3. Manually: ParserGen Syntax described in `ParserGen AST described in C++` -> `ParserGen Parser described in C++`
2. Manually: symbols -> `ParserGen AST declaration in C++`
3. Manually: ParserGen Syntax described in `ParserGen AST declaration in C++` -> `ParserGen Parser declaration in C++`
4. Integrate
- **AstGen**:
- **Goal**: given symbols and generated C++ code for AST
- **Produce** (from unit test):
- Generated source code (AST part): declaration, visitors, builder, reflection for ParserGen input
- AST symbols and C++ code generation.
- Generate visitors.
- Generate easy builder.
- Generate reflection.
- **Execution**:
- **Goal**: given instructions and parse input text with SAX-like callback
- Parser-generated instructions serialization.
- Execute instructions as a SAX-like parser, with notification on ambiguous node, error message generation and error recovering.
- **Compiler** -> **AstGen**, **Execution**
- **Goal**: input described using `Generated source code (AST part)` and generate instructions (text parser)
- **Produce** (from unit test)
- Generated C++ source code (parser part) for ParserGen input
- Take the `ParserGen AST declaration` and generate instructions.
- Generate the default handler to create AST for the SAX-like parser.
- ToString algorithm.
- Bidirection binding with AST the text.
- **ParserGen** -> **Compiler**
- **Goal**: CLI Tool
- Integrate `Generated source code (AST part)`
- Integrate `Generated source code (parser part)`
- Handle command line arguments
- **UnitTestAst**:
- Unit test of **AstGen** building block and pool allocation etc.
- **Produces** steps
- Hand-written `AST for ParserGen` symbols.
- Codegen symbols and get `Generated source code (AST part)` for ParserGen input
- **UnitTestExecution**:
- Unit test of **Execution**.
- Assert directly on SAX-like parser.
- **UnitTestCompiler**:
- Unit test of **Compiler**, input are all **UnitTestExecution** test cases rewritten using the generated easy-builder for `AST for ParserGen` AST.
- Assert on the ToString-ed AST. (shared)
- **Produces** steps
- Implementa ParserGen AST Input syntax using generated easy builder form `Generated source code (AST part)`
- Serialize instructions and get `C++ source code (parser part)` for ParserGen input
- **UnitTestParserGen**:
- Unit test of **ParserGen**, input are all **UnitTestExecution** test cases rewritten in text format.
- Assert on the ToString-ed AST. (shared)
- Generate all parser in text format to C++ code
- **UnitTest**:
- Link all cpp files in all other unit test projects so that all test cases can be run in one F5.
- Test all generated parsers in **UnitTestParserGen**.
- Assert on the ToString-ed AST. (shared)
- Since parser are written in different ways for different unit test projects, they are stored separately from unit test projects to share necessary files.
- Do not really write a file if the generated content doesn't change.
* **TODO**: Reorganize unit test projects to pure unit test and code generation steps
* Code generation steps are also multiple projects
* Because there are projects and the partner unit test that rely on generated code from depended projects
* **AstGen**:
* **Goal**: given symbols and generated C++ code for AST
* **Produce** (from unit test):
* Generated source code (AST part): declaration, visitors, builder, reflection for ParserGen input
* AST symbols and C++ code generation.
* Generate visitors.
* Generate easy builder.
* Generate reflection.
* **Execution**:
* **Goal**: given instructions and parse input text with SAX-like callback
* An instruction could generate multiple continuations
* Parser-generated instructions serialization.
* Execute instructions as a SAX-like parser, with notification on ambiguous node, error message generation and error recovering.
* If there is ambiguity, different callbacks could be called on the same position, and results could be discarded in the future execution.
* **Compiler** -> **AstGen**, **Execution**
* **Goal**: input described using `Generated source code (AST part)` and generate instructions (text parser)
* **Produce** (from unit test)
* Generated C++ source code (parser part) for ParserGen input
* Take the `ParserGen AST declaration` and generate instructions.
* Generate the default handler to create AST for the SAX-like parser.
* ToString algorithm.
* Bidirection binding with AST the text.
* **ParserGen** -> **Compiler**
* **Goal**: CLI Tool
* Integrate `Generated source code (AST part)`
* Integrate `Generated source code (parser part)`
* Handle command line arguments
* **UnitTestAst**:
* Unit test of **AstGen** building block and pool allocation etc.
* **Produces** steps
* Hand-written `AST for ParserGen` symbols.
* Codegen symbols and get `Generated source code (AST part)` for ParserGen input
* **UnitTestExecution**:
* Unit test of **Execution**.
* Assert directly on SAX-like parser.
* **UnitTestCompiler**:
* Unit test of **Compiler**, input are all **UnitTestExecution** test cases rewritten using the generated easy-builder for `AST for ParserGen` AST.
* Assert on the ToString-ed AST. (shared)
* **Produces** steps
* Implementa ParserGen AST Input syntax using generated easy builder form `Generated source code (AST part)`
* Serialize instructions and get `C++ source code (parser part)` for ParserGen input
* **UnitTestParserGen**:
* Unit test of **ParserGen**, input are all **UnitTestExecution** test cases rewritten in text format.
* Assert on the ToString-ed AST. (shared)
* Generate all parser in text format to C++ code
* **UnitTest**:
* Link all cpp files in all other unit test projects so that all test cases can be run in one F5.
* Test all generated parsers in **UnitTestParserGen**.
* Assert on the ToString-ed AST. (shared)
* Since parser are written in different ways for different unit test projects, they are stored separately from unit test projects to share necessary files.
* Do not really write a file if the generated content doesn't change.