Skip to content

Add a flag to dump the AST as JSON. #77321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

allevato
Copy link
Member

The new -dump-ast-format flag can be supplied when invoking the compiler in -dump-ast mode. Valid values for this flag are default (the default pseudo-S-expression format), json, and json-zlib (compression is actually quite significant for these; I've measured anywhere between 4–10x savings depending on the content).

Motivation

This is meant to be used by clients who want to do large-scale semantic analysis of type-checked Swift code. It is written with the understanding that there are no guarantees of stability for the semantic AST in the compiler, so clients must read the version information encoded in the JSON output and adjust their behavior accordingly. I certainly do not want this to hamper evolution of the compiler internals, so this doesn't make any promises that -dump-ast doesn't already.

I've implemented this as a totally separate type rather than trying to integrate this as an alternative printer for the existing ASTDumper class. I looked at that approach first, especially because there's a comment indicating that the design goal of ASTDumper was to allow other printers to be substituted in, but making that work would be a much more significant refactoring, and the data I'm outputting here is different enough in content and format from ASTDumper that it really needs to be its own separate thing.

Why not use other solutions, like indexstore? We do! But some of the information we're interested in (e.g., type information, generic substitutions at declref usage sites, etc.) is not available in indexstore, nor is everything we need appropriate to add to indexstore in the first place. Some type information is available from SourceKit, but not often in the format that
we need, and since SourceKit is intended for IDE use cases, it is not optimized for making large numbers of large requests. Having a JSON AST dump post-build frees us up to do analysis anywhere (e.g., on a machine that doesn't even have SourceKit or the same exact version of Xcode).

The new `-dump-ast-format` flag can be supplied when invoking the compiler
in `-dump-ast` mode. Valid values for this flag are `default` (the default
pseudo-S-expression format) and `json`.

This is meant to be used by clients who want to do large-scale semantic
analysis of type-checked Swift code. It is written with the understanding
that there are no guarantees of stability for the semantic AST in the
compiler, so clients must read the version information encoded in the
JSON output and adjust their behavior accordingly.
The JSON AST dumps get quite large, even when they're not
pretty-printed. Since they're just text with a large number of
repeated sequences, they compress extremely well with LLVM's
built-in zlib compression.
@allevato
Copy link
Member Author

@swift-ci please smoke test

@allevato
Copy link
Member Author

@swift-ci please smoke test

@ahoppen
Copy link
Member

ahoppen commented Nov 4, 2024

Do you have an example for the kind of analysis could do with this information? I think that would make it easier to understand the purpose of this printing mode.

@allevato
Copy link
Member Author

allevato commented Nov 5, 2024

Do you have an example for the kind of analysis could do with this information? I think that would make it easier to understand the purpose of this printing mode.

The tl;dr would be "think clang-tidy, but for Swift". clang-tidy works by essentially being a custom tool that walks the actual AST in-memory and diagnosing issues. We don't have that kind of API in Swift yet, and binary module versioning issues would likely make it extremely difficult to do so because your tools would have to be built against the exact version of the compiler that you're using (and in our case, we need to support multiple versions of Xcode simultaneously). So, having the compiler dump the type-checked AST is the next best thing (and in fact is actually somewhat better, because we can move the analysis to distributed machines that don't necessarily have Xcode installed).

The current -dump-ast flag has more of the data that we'd like to use, but it suffers from things like being not easily machine-parsable, and expressing type names in human-readable format rather than something we can work with in analysis, like a mangled name/USR.

We've managed to get quite far with just indexstore data, but not all of the data that we want to access is encoded in indexstore (nor should it be, given its purpose as a code navigation database). Off the top of my head, we're missing finely grained type information at arbitrary locations in the AST, like the substitution map used for concrete declrefs when invoking generic functions, and AnyObject dynamic dispatch has caused problems.

We've tried to avoid using SourceKit for a few reasons: it forces us into doing our analysis on machines with specific Xcode versions instead of distributing it anywhere, it's slower, and it's fiddly to get the invocations working right—raw PCMs vs. obj PCMs, relative path issues since it's running as a service, etc.

The other thing that indexstore doesn't give us is the "shape" of some of the references. There's some high-level relations encoded between references, but basic things like "is this expression a parameter to this function call", "is this call site using the default value of an argument", etc. are much easier to determine when we can walk the AST directly. Likewise, concurrency-related information doesn't show up in any existing data sources.

Even in situations where we could add information that we want to indexstore (we've merged some PRs for that recently), having data from the raw AST in some format would be a good fallback option when we need it, because as folks add new features to the language, they sometimes forget to make sure that indexstore also gets updated. (Of course, the same thing could happen to this AST dumper, but I plan to make efforts to make sure it stays up-to-date and have designed it so that it's straightforward to do so.)

@allevato
Copy link
Member Author

allevato commented Nov 5, 2024

There's also precedent for this in Clang, which has both the debug-output-style AST dumper and a separate JSON AST dumper.

I do think that one day it may be possible to converge ASTDumper and JSONASTDumper into a common shared framework, but I didn't have much success refactoring ASTDumper into something more general without disturbing a lot more of the compiler internals, so I don't know if that's the best path to that goal.

@CodaFi
Copy link
Contributor

CodaFi commented Dec 4, 2024

While I'll admit there are quite a few holes in the story of semantically analyzing Swift from outside of the compiler, I'm still generally very skeptical of using supplementary outputs - especially textual ones - as the transfer format for this kind of information. While acknowledging that there are always engineering tradeoffs (as you have noted above) in any approach, I think the cost to the compiler, and to the tools that result from consuming this data, is still too high. I recall that this is not the first time folks have tried to rely on a textual dump of the AST to build tooling on top. There, as now, without stability guarantees your tooling is still tied to one or more compiler releases. Rather than the evolution of the underlying transfer format being baked into the design of the tooling via API contract, it is instead largely implicit, and small fluctuations in the format because of e.g. the addition or subtraction of AST information become compatibility hazards that build up over time.

I would very much prefer these services to be offered to clients via the LLVM-y promise of "compiler as libraries". From that point of view, SourceKit is still not the correct abstraction for a client such as yourself, but neither is an AST dump. My strong preference is for the project to continue to evolve towards offering Sema as a library to clients a la SwiftSyntax. But that preference ignores the fact that this PR has the ultimate engineering advantage: It exists, and it works.

All of that said, this is a fantastic amount of work that you've done to solve a real need. I don't wish to detract from that. But I would also definitely encourage you to continue extending IndexStore, SwiftSyntax, and SourceKit(-LSP) when and if this merges.

@allevato
Copy link
Member Author

allevato commented Dec 4, 2024

@CodaFi , I agree wholeheartedly with everything you've said. Sema-as-a-library would be the ideal end goal here. Even if the library changed in API-breaking ways between versions (just as SwiftSyntax does), source compatibility (of the source being compiled) would let us continuously upgrade and use the single latest version in our tool regardless of the underlying compiler our users were using. It would be particularly interesting to see how such a tool would perform when analyzing multiple modules expressed entirely as source code in a single pass, rather than precompiling the dependencies.

Unfortunately, even in a world where we can imagine Sema-as-a-library, things start to get complicated when C/Obj-C/C++ interop enters the picture. We would still need a way to type-check declarations imported from C modules, so I guess Sema-as-a-library would depend on ClangImporter-as-a-library, which would depend on... something still built from the majority of the Clang codebase. At least, without being more imaginative about ways to get the information about C declarations out of Clang into a form that the Swift compiler can use.

There are also some logistical limitations thanks to the introduction of macros in Apple's SDKs. Those are distributed as pre-built dylibs, which we don't have the source code for. Even a Sema-as-a-library implementation would have to load those, which means we'd be limited to doing our analysis entirely on macOS hardware. At least with supplementary output files, we can move that analysis anywhere we want once we've gotten the outputs from the compiler.

I do hope we get to see a further realized library-based model in the future, but I'm imagining that it's multiple years away at the least. I do want to contribute to those efforts as they arise, but in the meantime, if there is anything I can do to land this so we can move forward with our work sooner than that, I'm happy to accommodate.

@allevato
Copy link
Member Author

allevato commented Jan 7, 2025

I've posted a rewrite of this over at #78463, which builds it on top of the existing ASTDumper instead of introducing a totally distinct JSON printer.

@allevato
Copy link
Member Author

Closing out this version of the feature implementation since I'm fairly certain #78463 will be the one that goes forward.

@allevato allevato closed this Jan 16, 2025
allevato added a commit to allevato/swift-driver that referenced this pull request Jan 24, 2025
allevato added a commit to allevato/swift-driver that referenced this pull request Jan 24, 2025
allevato added a commit to allevato/swift-driver that referenced this pull request Feb 10, 2025
allevato added a commit to allevato/swift-driver that referenced this pull request Feb 11, 2025
artemcm pushed a commit to swiftlang/swift-driver that referenced this pull request Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants