Skip to content

Commit b13f7f9

Browse files
committed
[pseudo] Document disambiguation design progress
Need to take a break from this, so write down where we got to. Differential Revision: https://reviews.llvm.org/D135696
1 parent b6da16f commit b13f7f9

File tree

1 file changed

+367
-0
lines changed

1 file changed

+367
-0
lines changed
Lines changed: 367 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,367 @@
1+
# Disambiguation
2+
3+
The C++ grammar is highly ambiguous, so the GLR parser produces a forest of
4+
parses, represented compactly by a DAG.
5+
A real C++ parser finds the correct parse through semantic analysis: mostly
6+
resolving names. But we can't do that, as we don't parse the headers.
7+
8+
Our disambiguation phase should take the parse forest, and choose a single parse
9+
tree that is most likely.
10+
It might **optionally** use some other hints (e.g. coding style, or what
11+
specific names tend to mean in this codebase).
12+
13+
There are some grammatical ambiguities that can be resolved without semantic
14+
analysis, e.g. whether `int <declarator>{}` is a function-definition.
15+
We eliminate these earlier e.g., with rule guards. By "disambiguation" we mean
16+
choosing between interpretations that we can't reject confidently and locally.
17+
18+
## Types of evidence
19+
20+
We have limited information to go on, and strive to use similar heuristics a
21+
human reader might.
22+
23+
### Likely and unlikely structure
24+
25+
In some cases, the shape of a particular interpretation is unlikely but not
26+
impossible. For example, the statement `x(a);` might:
27+
28+
- call a function `x` (likely)
29+
- construct a temporary of class type `x` (less likely)
30+
- define a variable `a` of type `x`, which is an alias for e.g. `int`
31+
(unlikely!)
32+
33+
We model this as a bonus/penalty for including a particular forest node in the
34+
chosen parse. For each rule we want to apply, we can write some code to
35+
recognize the corresponding pattern in the parse tree, and run these recognizers
36+
at each node to assign the bonuses.
37+
38+
### Interpreting names
39+
40+
Just as resolving names allows a C++ parser to choose the right parse (rejecting
41+
others), chunks of a parse tree imply things about how names resolve.
42+
43+
Because the same name often means the same thing in different contexts, we can
44+
apply knowledge from elsewhere. This can be as simple as knowing "`vector` is
45+
usually a type", and adding bonuses to nodes that include that interpretation.
46+
47+
However we can also transfer knowlegde across the same file we're parsing:
48+
49+
```cpp
50+
// Is `Builder` a class or a namespace?
51+
void Builder::build() { ... }
52+
// ...
53+
// `Builder` is a type.
54+
Builder b;
55+
```
56+
57+
We can use this to understand more-ambiguous code based on names in a section
58+
we're more sure about. It also pushes us to provide a consistent parse, rather
59+
than interpreting each occurrence of an unclear name differently.
60+
61+
Again, we can define bonuses/penalties for forest nodes that interpret names,
62+
but this time those bonuses change as we disambiguate. Specifically:
63+
64+
- we can group identifiers into **classes**, most importantly "all identifiers
65+
with text 'foo'" but also "all snake_case identifiers".
66+
- clusters of nodes immediately above the identifiers in the parse forest are
67+
**interpretations**, they bind the identifier to a **kind** such "type",
68+
"value", "namespace", "other".
69+
- for each class we can query information once from an external source (such as
70+
an index or hard-coded list), yielding a vector of weights (one per kind)
71+
- at each point we can compute metrics based on interpretations in the forest:
72+
- the number of identifiers in the class that are interpreted as each kind
73+
(e.g. *all* remaining interpretations of 'foo' at 3:7 are 'type')
74+
- the number of identifiers in the class that *may* be interpereted as each
75+
kind (e.g. 'foo' at 3:7 might be a 'type').
76+
- we can mash these metrics together into a vector of bonuses for a class (e.g.
77+
for identifiers with text 'bar', 'type'=>+5, 'namespace'=>+1, 'value'=>-2).
78+
- these bonuses are assigned to corresponding interpretations in the graph
79+
80+
### Templates
81+
82+
Another aspect of a name is whether it names a template (type or value). This
83+
is ambiguous in many more cases since CTAD allowed template arguments to be
84+
omitted.
85+
86+
A fairly simple heuristic appears sufficient here: things that look like
87+
templates usually are, so if a node for certain rules exists in the forest
88+
(e.g. `template-id := template-name < template-argument-list >`) then we treat
89+
the template name as a probable template, and apply a bonus to every node that
90+
interprets it that way. We do this even if alternate parses are possible
91+
(`a < b > :: c` might be a comparison, but is still evidence `a` is a template).
92+
93+
## Algorithm sketch
94+
95+
With static node scores, finding the best tree is a very tractable problem
96+
with an efficient solution.
97+
With dynamic scores it becomes murky and we have to settle for approximations.
98+
These build on the same idea, so we'll look at the simple version first.
99+
100+
### Naive version (static scores)
101+
102+
At a high level, we want to assign bonuses to nodes, and find the tree that
103+
maximizes the total score. If bonuses were fixed, independent of other
104+
disambiguation decisions, then we could simply walk bottom-up, aggregating
105+
scores and replacing each ambiguous node with the top-scoring alternative
106+
subtree. This could happen directly on the parse tree.
107+
108+
Given this tree as input:
109+
110+
```mermaid
111+
flowchart TB
112+
subgraph &nbsp;
113+
idA["a"]
114+
open["("]
115+
idB["b"]
116+
close[")"]
117+
semi[";"]
118+
end
119+
class idA,open,idB,close,semi token;
120+
121+
typeA["type := IDENTIFIER"] --- idA
122+
exprA["expr := IDENTIFIER"] --- idA
123+
exprB["expr := IDENTIFIER"] --- idB
124+
declB["declarator := IDENTIFIER"] --- idB
125+
stmtExpr --- semi
126+
stmtDecl --- semi
127+
128+
stmtAmbig["stmt?"]:::ambig
129+
stmtAmbig === stmtExpr["stmt := expr ;"]
130+
stmtExpr --- exprAmbig["expr?"]:::ambig
131+
exprAmbig === funcall["expr := expr ( expr )"]:::good
132+
funcall --- exprA
133+
funcall --- open
134+
funcall --- exprB["expr := IDENTIFIER"]
135+
funcall --- close
136+
exprAmbig -.- construct["expr := type ( expr )"]:::bad
137+
construct --- typeA
138+
construct --- open
139+
construct --- exprB
140+
construct --- close
141+
stmtAmbig -.- stmtDecl["stmt := decl"]
142+
stmtDecl --- decl["decl := type declarator ;"]
143+
decl --- typeA
144+
decl --- declParens["declarator := ( declarator )"]:::bad
145+
declParens --- open
146+
declParens --- declB
147+
declParens --- close
148+
149+
classDef ambig fill:blue,color:white;
150+
classDef token fill:yellow;
151+
classDef good fill:lightgreen
152+
classDef bad fill:pink
153+
```
154+
155+
A post-order traversal reaches the ambiguous node `expr?` first.
156+
The left alternative has a total score of +1 (green bonus for
157+
`expr := expr (expr)`) and the right alternative has a total bonus of -1
158+
(red penalty for `expr := type (expr)`). So we replace `expr?` with its left
159+
alternative.
160+
161+
As we continue traversing, we reach `stmt?` next: again we have +1 in the left
162+
subtree and -1 in the right subtree, so we pick the left one. Result:
163+
164+
```mermaid
165+
flowchart TB
166+
subgraph &nbsp;
167+
idA["a"]
168+
open["("]
169+
idB["b"]
170+
close[")"]
171+
semi[";"]
172+
end
173+
class idA,open,idB,close,semi token;
174+
175+
typeA["type := IDENTIFIER"] --- idA
176+
exprA["expr := IDENTIFIER"] --- idA
177+
exprB["expr := IDENTIFIER"] --- idB
178+
declB["declarator := IDENTIFIER"] --- idB
179+
stmtExpr --- semi
180+
stmtDecl --- semi
181+
182+
stmtExpr["stmt := expr ;"]
183+
stmtExpr --- funcall["expr := expr ( expr )"]
184+
funcall --- exprA
185+
funcall --- open
186+
funcall --- exprB["expr := IDENTIFIER"]
187+
funcall --- close
188+
189+
classDef token fill:yellow;
190+
```
191+
192+
### Degrees of freedom
193+
194+
We must traverse the DAG bottom-up in order to make score-based decisions:
195+
if an ambiguous node has ambiguous descendants then we can't calculate the score
196+
for that subtree.
197+
198+
This gives us a topological **partial** order, but we don't have to go from
199+
left-to-right. At any given point there is a "frontier" of ambiguous nodes with
200+
no ambiguous descendants. The sequence we choose matters: each choice adds
201+
more interpretations that should affect future choices.
202+
203+
Initially, most of the ambiguous nodes in the frontier will be e.g. "is this
204+
identifier a type or a value". If we had to work left-to-right then we'd
205+
immediately be forced to resolve the first name in the file, likely with
206+
little to go on and high chance of a mistake.
207+
But there are probably names where we have strong evidence, e.g. we've seen an
208+
(unambiguous) declaration of a variable `foo`, so other occurrences of `foo`
209+
are very likely to be values rather than types. We can disambiguate these with
210+
high confidence, and these choices are likely to "unlock" other conclusions that
211+
we can use for further disambiguation.
212+
213+
This is intuitively similar to how people decipher ambiguous code: they find a
214+
piece that's easy to understand, read that to learn what names mean, and use
215+
the knowledge gained to study the more difficult parts.
216+
217+
To formalize this a little:
218+
- we prioritize making the **highest confidence** decisions first
219+
- we define **confidence** as the score of the accepted alternative, minus the
220+
score of the best rejected alternative.
221+
222+
### Removing the bottom-up restriction
223+
224+
Strictly only resolving "frontier" ambiguities may cause problems.
225+
Consider the following example:
226+
227+
```mermaid
228+
flowchart TB
229+
subgraph &nbsp;
230+
a:::token
231+
b:::token
232+
end
233+
234+
ambig1:::ambig
235+
ambig1 --- choice1:::good
236+
ambig1 --- choice2
237+
choice1 --- ambig2:::ambig
238+
ambig2 --- aType["a is class"] --- a
239+
ambig2 --- aTemplate["a is CTAD"] --- a
240+
choice1 --- bValue["b is variable"] --- b
241+
242+
classDef ambig fill:blue,color:white;
243+
classDef token fill:yellow;
244+
classDef good fill:lightgreen
245+
classDef bad fill:pink
246+
```
247+
248+
We have some evidence that `choice1` is good. If we selected it, we would know
249+
that `b` is a variable and could use this in disambiguating the rest of the
250+
file. However we can't select `choice1` until we decide exactly how to interpret
251+
`a`, and there might be little info to do so. Gating higher-confidence decisions
252+
on lower-confidence ones increases our chance of making an error.
253+
254+
A possible fix would be to generalize to a **range** of possible scores for
255+
nodes above the frontier, and rank by **minimum confidence**, i.e. the highest
256+
min-score of the accepted alternative, minus the highest max-score among the
257+
rejected alternative.
258+
259+
## Details
260+
261+
The remaining challenges are mainly:
262+
- defining the score function for an alternative. This is TBD, pending
263+
experiments.
264+
- finding a data structure and algorithm to efficiently resolve/re-evaluate
265+
in a loop until we've resolved all ambiguities.
266+
267+
### Disambiguation DAG
268+
269+
Rather than operate on the forest directly, it's simpler to consider a reduced
270+
view that hides complexity unrelated to disambiguation:
271+
272+
**Forest:**
273+
274+
```mermaid
275+
flowchart TB
276+
subgraph &nbsp;
277+
open["{"]
278+
a
279+
star["*"]
280+
b
281+
semi[";"]
282+
close["}"]
283+
end
284+
class open,a,star,b,semi,close token
285+
286+
compound-stmt --- open
287+
compound-stmt --- stmt?
288+
compound-stmt --- close
289+
290+
stmt?:::ambig --- decl-stmt
291+
decl-stmt --- type-name
292+
type-name --a is type--- a
293+
decl-stmt --- declarator
294+
declarator --- star
295+
declarator --- declarator_b["declarator"]
296+
declarator_b --b is value--- b
297+
decl-stmt --- semi
298+
299+
stmt?:::ambig --- expr-stmt
300+
expr-stmt --- expr1["expr"]
301+
expr-stmt --- star
302+
expr-stmt --- expr_b["expr"]
303+
expr-stmt --- semi
304+
expr_a --a is value--- a
305+
expr_b --b is value--- b
306+
307+
classDef ambig fill:blue,color:white;
308+
classDef token fill:yellow;
309+
```
310+
311+
**Ambiguity graph:**
312+
313+
```mermaid
314+
flowchart TB
315+
subgraph &nbsp;
316+
a
317+
b
318+
end
319+
class a,b token
320+
321+
root --- stmt?
322+
323+
stmt?:::ambig --- decl-stmt
324+
decl-stmt --a is type--- a
325+
decl-stmt --b is value--- b
326+
327+
stmt?:::ambig --- expr-stmt
328+
expr-stmt --a is value--- a
329+
expr-stmt --b is value--- b
330+
331+
classDef ambig fill:blue,color:white;
332+
classDef token fill:yellow;
333+
```
334+
335+
Here the clusters of non-ambiguous forest nodes are grouped together, so that the DAG is bipartite with ambiguous/cluster nodes, and interpretation edges at the bottom.
336+
337+
Scoring the clusters and selecting which to include is equivalent to disambiguating the full graph.
338+
339+
### Resolving the ambiguity DAG
340+
341+
The static scores of the forest nodes are aggregated into static scores for the clusters.
342+
The interpretation edges of the frontier clusters can be scored based on the context available so far, and the scores "bubble up" to parent nodes, with ambiguous nodes creating score ranges as described above.
343+
344+
The main dynamic signal is when a token has been fully resolved, which happens when all the interpretations leading to it have the same label.
345+
346+
The naive algorithm is to score all clusters, choose the best to resolve and repeat.
347+
However this is very slow:
348+
- there are **many** ambiguities available at first, therefore many clusters to score
349+
- each time we resolve an ambiguity, we invalidate previously computed scores
350+
- while the clusters become fewer over time, there are more interpretations per cluster
351+
352+
It's tempting to use a priority queue to avoid repeatedly scanning clusters. However if we invalidate a large fraction of a heap's elements each round, we lose the efficiency benefits it brings.
353+
We could reuse scores if the resolved cluster doesn't tell us much about the target cluster.
354+
The simplest idea is to only recalculate clusters with an overlapping word, this may not save much (consider `std`) as clusters get larger.
355+
A heuristic to estimate how much a cluster affects another may help.
356+
357+
To stop the clusters having too many interpretation edges (and thus take too long to score), we can drop the edges for any token that is fully resolved. We need to track these anyway (for scoring of interpretations of other identifiers with the same text). And once only a single interpretation exists, removing it has no impact on scores.
358+
359+
So for now the sketch is:
360+
- build the ambiguity DAG
361+
- compute scores for all clusters
362+
- place confidences (score difference) for each cluster in a priority queue
363+
- while there is still ambiguity:
364+
- take the most confident cluster C and resolve it
365+
- propagate the score change to all of C's ancestors
366+
- work out which identifiers are now resolved, record that and remove the interpretations from the graph
367+
- recompute scores for the K clusters most affected by resolving those identifiers, and their ancestors

0 commit comments

Comments
 (0)