Skip to content

Add a flag to dump the AST as JSON. (Second version) #78463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 24, 2025

Conversation

allevato
Copy link
Member

@allevato allevato commented Jan 7, 2025

This is a second attempt (first one is at #77321), based on feedback I was given that introducing a second distinct AST dumper would make evolution of the compiler's AST internals more difficult (I can't disagree with that).

So this version takes the existing ASTDumper and refactors it as needed to support emitting structured JSON, while otherwise being an NFC for the default format. This was tricky in some places—the primary issue being that the default format allows value labels to be elided for simplicity, so I had to introduce a notion of "always vs. optional labels" that the default formatter could ignore but that JSON always had to print.

I've tried to use the commit history here to separate the refactoring changes from the other functional changes to make it easier to review.


Original PR description

The new -dump-ast-format flag can be supplied when invoking the compiler in -dump-ast mode. Valid values for this flag are default (the default pseudo-S-expression format), json, and json-zlib (compression is actually quite significant for these; I've measured anywhere between 4–10x savings depending on the content).

Motivation

This is meant to be used by clients who want to do large-scale semantic analysis of type-checked Swift code. It is written with the understanding that there are no guarantees of stability for the semantic AST in the compiler, so clients must read the version information encoded in the JSON output and adjust their behavior accordingly.

Why not use other solutions, like indexstore? We do! But some of the information we're interested in (e.g., type information, generic substitutions at declref usage sites, etc.) is not available in indexstore, nor is everything we need appropriate to add to indexstore in the first place. Some type information is available from SourceKit, but not often in the format that we need, and since SourceKit is intended for IDE use cases, it is not optimized for making large numbers of large requests. Having a JSON AST dump post-build frees us up to do analysis anywhere (e.g., on a machine that doesn't even have SourceKit or the same exact version of Xcode).

@hamishknight
Copy link
Contributor

FYI I am slowly but surely working my way through reviewing this

Copy link
Contributor

@hamishknight hamishknight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I definitely prefer this approach over the previous one

Comment on lines +4147 to +4225
if (Writer.isParsable()) {
printComponents();
} else {
// `printList` in default mode simply indents, so we need to preserve the
// additional level of `(components ...)` for that mode.
printRecArbitrary([&](Label label) {
printHead("components", ExprColor, label);
printComponents();
printFoot();
}, Label::optional("components"));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this use printRecRange?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but I'd like to remove printRecRange in a follow-up, so my low-risk option for this PR was to retain the existing behavior as closely as possible.

Copy link
Member Author

@allevato allevato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough feedback! I believe I've addressed everything (aside from some minor things that I'd like to hold as follow-ups because they might cause more churn, like removing printRecRange and changing interface type to interface_type).

Comment on lines +232 to +241
if (type->hasArchetype()) {
// We can't generate USRs for types that contain archetypes. Replace them
// with their interface types.
type = type.transformRec([&](TypeBase *t) -> std::optional<Type> {
if (auto AT = dyn_cast<ArchetypeType>(t)) {
return AT->getInterfaceType();
}
return std::nullopt;
});
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely hit some limitations during type mangling/USR generation where some nested packs would cause a segfault (I have a FIXME marked in the no-crash test), which sounds a lot like what you're describing. I'm definitely interested in fixing those to make sure we don't hit them in production use; thanks for the pointer, I can try to take a closer look as a follow-up.

Comment on lines 2409 to 2418
if (Writer.isParsable() && FD->hasInterfaceType()) {
printTypeField(FD->getResultInterfaceType(), Label::always("result"));
if (auto opaque = FD->getOpaqueResultTypeDecl()) {
printRec(opaque, "opaque_result_decl");
printRec(opaque, Label::always("opaque_result_decl"));
}
} else if (FD->getResultTypeRepr()) {
printRec(FD->getResultTypeRepr(), Label::always("result"));
if (auto opaque = FD->getOpaqueResultTypeDecl()) {
printRec(opaque, Label::always("opaque_result_decl"));
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added the getCached* functions for these and a helper function in the dumper to print either the Type or TypeRepr based on which is present.

Copy link
Contributor

@hamishknight hamishknight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

allevato and others added 10 commits January 22, 2025 14:23
ASTDumper allows nodes/values to be printed without labels, which
works fine for the default output but won't work for JSON when every
value needs to have a valid key. To balance these needs, we create
a `Label` abstraction that can be created as either `always` or
`optional`. All the current labels are treated as `always`, and all
other values have had `optional` labels associated with them, which
won't be printed in the default AST dump.

This change also adds a `printList` function that replaces anywhere
that `for` loops with `printRec` calls in their bodies. This will
be used to provide the necessary array structuring for JSON output
later. (There are some places where `for` loops call `printFlag`
which will need to be dealt with later.)
Availability specs are currently dumped by writing a pre-formatted
string directly to the output stream instead of using the structured
primitives in `PrintBase`. This needs to be fixed before we can
introduce the generalized writers.

The format here is meant to be the same as the format printed by
the original methods, but some colors may differ slightly when
dumping to the terminal.
This makes the low-level `print*` methods in `PrintBase` simply
forward to the writer, which does the actual work. Right now,
there's only the abstract base class for the writer and the
default (S-expression-like) writer. A later commit will
introduce the JSON writer.

Also update the `printField*` methods to take `Label`s instead
of `StringRef`s.
This only takes the existing AST information and writes it as JSON
instead of S-expressions. Since many of these fields are stringified,
they're not ideal for the kind of analysis clients of the JSON format
would want to do. A future commit will update these values to use a
more structured representation.
Specifically, this means `-dump-ast-format json` is
incompatible with `-dump-parse`. This is because the JSON
format is meant to export more details about the AST that
require type checking to have been performed.

I'm open to lifting this restriction in the future.
Also improve the output for thrown error destinations in
parsable modes.
@allevato
Copy link
Member Author

@swift-ci please smoke test

@allevato
Copy link
Member Author

@swift-ci please smoke test

@allevato allevato merged commit 9907afa into swiftlang:main Jan 24, 2025
3 checks passed
@allevato allevato deleted the json-ast-new branch January 25, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants