mirror of
https://github.com/vczh-libraries/Release.git
synced 2026-02-06 03:42:11 +08:00
8.2 KiB
8.2 KiB
ParserGen 2.0
Goal
- Parsing
- Explicily declare the boundary of ambiguity resolving (e.g. on EXPR or on STAT)
- Full CFG power, no limitation
- Experiment: expanding all left-recursive grammer to right-recursive grammar with instructions
- Experiment: optionally inline all rules which don't generate parser functions
- Error message generation
- Error recovering
- Serializing
- Escaping and Unescaping pairs (instead of only unescaping)
- Calculate ambiguous ToString cases
- Generate ToString algorithm
- AST
- Low overhead AST with reflection
- Optional creating AST from a pool
AST Definition (compatible with Workflow)
class CLASS_NAME [: BASE_CLASS]
{
var FIELD_NAME : TYPE;
...
}
Configurations
- Include files in generated C++ header
- Depended AST definition files
- Visitors selected to generate
- Optional reflection support
- All AST constructors are protected
- Generated factory class
- If AST object pool is enabled
- reflection is disabled
Ptr<T>for all AST types are generated will enumeratedCastfunction.- Use generated RTTI constructions (e.g. enum class tag for type)
Types
Token: In the previous version,Tokenis a value type, now it is a reference type.CLASS-NAME: Another class.TYPE[]: Array, whose element is not allowed to be another array.
MISC
- Define a
ToStringalgorithm with customizable configurations.
Lexical Analyzer
- Pair name with regular expressions.
- Extendable tokens.
- For example, recognize
R"[^\s(]\(and invoke a callback function to determine the end of the string
- For example, recognize
- Pair a name with the token subset, and give a default name to a token full set
Error Messages
- Generate error messages in C++ code
Syntax Analyzer
- Priority of loops:
+[ RULE ]means ifRULEsucceeds, skippingRULEis not considered even if the rest doesn't parse.-[ RULE ]means only if skippingRULEmakes the clause not able to parse, the result of having RULE is not discarded.[ RULE ]means keep both result+{ RULE },-{ RULE },{ RULE }are similar, but{ RULE }may generate more than two results, meanwhile others only generate one result.
- Being able to change token subset during parsing.
- Being able to specify a error message when a certain action fails.
- Generate SAX-like parser, with a default handler to create AST.
- Generate each POOLED tuple struct type for
- Loop body. Delimitered list is considered as [ITEM {DELIMITER ITEM}]
- Loop records a pointer to the reversed linked list of the last item during calculation
- Loop records an array of items as the result
- Alternative as
Union<Ts...>storing{TYPE-FLAG, ITEM*}(value type) - Optional as
Optional<T>storing{ITEM*}(value type) - Sequencial as
{A, B, C ...}with generated field names (value type)- Type is rule or rule fragment, not the result AST type
- If there are multiple fields in same type, appended with an index of the position in the rule (optionals, alternatives and loops are packed as one)
- If a tuple is created directly from a rule, there will be a static field to indicate which rule does it come from
- Rule reference as
Reference<Ts...>storing{RULE, FRAGMENT, TYPE-FLAG, ITEM*}- A
Reference<Ts...>are aliased - Consider about forward declarations
- A
- Loop body. Delimitered list is considered as [ITEM {DELIMITER ITEM}]
- All types have an un-templated partner so that the core SAX-like instruction execution doesn't need to know concrete types
- Generate each POOLED tuple struct type for
Supported EBNF
- TOKEN [
:PROPERTY-NAME] - RULE [
:PROPERTY-NAME] - Optional:
+[EBNF]-[EBNF][EBNF]
- Loop:
+{EBNF}-{EBNF}{EBNF}
with{PROPERTY-ASSIGNMENT ...}
EBNF Program
- RULE {
::=CLAUSEasCLASS-NAME};- Consider a syntax here to switch token set
ToString Algorithm Requirements
- Every clause should create an AST node.
EXP ::= '(' !EXP ')'is not allowed, except that this clause has only one node. - Every rule-name node should be assigned to a property. Token nodes are optional but those properties will be auto-generated.
- Loops cannot be embedded in another loop. It doesn't limit the syntax, but it limit the shape of AST.
Development Project Structure
- Original ParserGen code will be separated from Vlpp.
- Development Steps
- Symbols for ParserGen AST
- Manually: symbols ->
ParserGen AST declaration in C++ - Manually: ParserGen Syntax described in
ParserGen AST declaration in C++->ParserGen Parser declaration in C++ - Integrate
- TODO: Reorganize unit test projects to pure unit test and code generation steps
- Code generation steps are also multiple projects
- Because there are projects and the partner unit test that rely on generated code from depended projects
- AstGen:
- Goal: given symbols and generated C++ code for AST
- Produce (from unit test):
- Generated source code (AST part): declaration, visitors, builder, reflection for ParserGen input
- Produce (from unit test):
- AST symbols and C++ code generation.
- Generate visitors.
- Generate easy builder.
- Generate reflection.
- Goal: given symbols and generated C++ code for AST
- Execution:
- Goal: given instructions and parse input text with SAX-like callback
- An instruction could generate multiple continuations
- Serialization for parser-generated automaton and instructions.
- Run the automaton on an input
- Determine a path on state transition from the input first, and then follow this path to execute instructions.
- Introduce a garbage-collectable memory allocation for path, could be ref counted.
- Execute instructions as a SAX-like parser, with notification on ambiguous node, error message generation and error recovering.
- If there is ambiguity, different callbacks could be called on the same position, and results could be discarded in the future execution.
- Goal: given instructions and parse input text with SAX-like callback
- Compiler -> AstGen, Execution
- Goal: input described using
Generated source code (AST part)and generate instructions (text parser)- Produce (from unit test)
- Generated C++ source code (parser part) for ParserGen input
- Produce (from unit test)
- Take the
ParserGen AST declarationand generate instructions. - Generate the default handler to create AST for the SAX-like parser.
- ToString algorithm.
- Bidirection binding with AST the text.
- Goal: input described using
- ParserGen -> Compiler
- Goal: CLI Tool
- Integrate
Generated source code (AST part) - Integrate
Generated source code (parser part) - Handle command line arguments
- UnitTestAst:
- Unit test of AstGen building block and pool allocation etc.
- Produces steps
- Hand-written
AST for ParserGensymbols. - Codegen symbols and get
Generated source code (AST part)for ParserGen input
- Hand-written
- UnitTestExecution:
- Unit test of Execution.
- Assert directly on SAX-like parser.
- UnitTestCompiler:
- Unit test of Compiler, input are all UnitTestExecution test cases rewritten using the generated easy-builder for
AST for ParserGenAST. - Assert on the ToString-ed AST. (shared)
- Produces steps
- Implementa ParserGen AST Input syntax using generated easy builder form
Generated source code (AST part) - Serialize instructions and get
C++ source code (parser part)for ParserGen input
- Implementa ParserGen AST Input syntax using generated easy builder form
- Unit test of Compiler, input are all UnitTestExecution test cases rewritten using the generated easy-builder for
- UnitTestParserGen:
- Unit test of ParserGen, input are all UnitTestExecution test cases rewritten in text format.
- Assert on the ToString-ed AST. (shared)
- Generate all parser in text format to C++ code
- UnitTest:
- Link all cpp files in all other unit test projects so that all test cases can be run in one F5.
- Test all generated parsers in UnitTestParserGen.
- Assert on the ToString-ed AST. (shared)
- Since parser are written in different ways for different unit test projects, they are stored separately from unit test projects to share necessary files.
- Do not really write a file if the generated content doesn't change.
Note
- Pooled data structure (tree/linked list) for intermediate parsing result before turning into AST
- Multiple way of AST codegen (normal/reflection/pooled)
- Imperative & concurrent parsing and AST building instructions, potential to generate parser in different languages
- Serialized lexical table instead of storing regex
- VlppParser2