Writing Treesitter Parser
Treesitter CLI Util
AST & CTS
- A tree only contains named nodes is a abstract syntax tree(AST)
- A tree contains both named and unnamed nodes is a concrete syntax tree(CST)
Grammar Structure
name
: name of that parserrules
: rules for generating nodes- see doc for more properties of a grammar.
Tree Structure
Treesitter syntax tree(and its query) is generally represented in scheme language, a lisp dialect. Each ()
initializes a new list, each element in the list is either presented as node type(named nodes) or string literal(unnamed nodes). The node type is name of rule that matched the section.
- field : Each element might have a field name such as
kind: "const"
to give the element(node) a descriptive name in context. - token: Each atomic node is considered as a token, such as
(number)
and(comment)
in the example.
NOTE
A unnamed node could have field name, the field name is for node representation in tree, not the nominal identity of that node.
; generated tree for javascript code
; const foo = 1 + 2 // this is foo
(program ; [0, 0] - [1, 0]
(lexical_declaration ; [0, 0] - [0, 18]
kind: "const" ; [0, 0] - [0, 5]
(variable_declarator ; [0, 6] - [0, 17]
name: (identifier) ; [0, 6] - [0, 9]
"=" ; [0, 10] - [0, 11]
value: (binary_expression ; [0, 12] - [0, 17]
left: (number) ; [0, 12] - [0, 13]
operator: "+" ; [0, 14] - [0, 15]
right: (number))) ; [0, 16] - [0, 17]
";") ; [0, 17] - [0, 18]
(comment)) ; [0, 19] - [0, 33]
2
3
4
5
6
7
8
9
10
11
12
13
14
Writing Rules
- The top level rule: the most generic wrapper rule to cover all possible content to be parsed.
- top level rule MUST be the first rule property declared in
rules
field. - the name of top level rule can be arbitrary, usually depend on language specification.
C#
for example uses thecompilation_unit
as the name of top level rule.
- top level rule MUST be the first rule property declared in
module.exports = grammar({
name: 'c_sharp',
rules: {
/*...*/
compilation_unit: $ => seq( // must be the first rule
optional($.shebang_directive),
repeat($._top_level_item),
),
_top_level_item: $ => prec(2, choice(
$._top_level_item_no_statement,
$.global_statement,
)),
/*...*/
}
});
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Named & Unnamed Nodes
A node generated by a rule that was assigned to a property of rules
is called a named node. A node generated by a rule that was written in literal string/regex is unnamed nodes.
module.exports = grammar({
name: 'foo',
rules: {
if_statement: $ => seq("if", "(", $._expression, ")", $._statement);
},
});
2
3
4
5
6
NOTE
Unnamed nodes are not visible from treesitter CST by default, but they does exist in the structure and can be inspected. They just don't have a node type.
Aliased Rule
Tokenized Rule
token(rule)
made a complex rule as a atomic node, tree-sitter would only match but does not generate the concrete sub-tree for this node. The following rule would made comment as (comment)
in concrete tree, it does not include the unnamed nodes match that pattern.
module.exports = grammar({
name: 'foo',
rules: {
/* ... */
comment: _ => token(choice(
seq('//', /[^\n\r]*/),
seq(
'/*',
/[^*]*\*+([^/*][^*]*\*+)*/,
'/',
),
)),
},
});
2
3
4
5
6
7
8
9
10
11
12
13
14
Node Description
A field of node is a descriptive name for semantic of that node in certain context.
The following rule defines descriptive name for each node of that function node.
module.exports = grammar({
name: 'foo',
rules: {
/* ... */
function_definition: $ =>
seq(
"func",
field("name", $.identifier),
field("parameters", $.parameter_list),
field("return_type", $._type),
field("body", $.block),
)
},
});
2
3
4
5
6
7
8
9
10
11
12
13
14