Add a flag to dump the AST as JSON. #77321

allevato · 2024-10-31T13:49:42Z

The new -dump-ast-format flag can be supplied when invoking the compiler in -dump-ast mode. Valid values for this flag are default (the default pseudo-S-expression format), json, and json-zlib (compression is actually quite significant for these; I've measured anywhere between 4–10x savings depending on the content).

Motivation

This is meant to be used by clients who want to do large-scale semantic analysis of type-checked Swift code. It is written with the understanding that there are no guarantees of stability for the semantic AST in the compiler, so clients must read the version information encoded in the JSON output and adjust their behavior accordingly. I certainly do not want this to hamper evolution of the compiler internals, so this doesn't make any promises that -dump-ast doesn't already.

I've implemented this as a totally separate type rather than trying to integrate this as an alternative printer for the existing ASTDumper class. I looked at that approach first, especially because there's a comment indicating that the design goal of ASTDumper was to allow other printers to be substituted in, but making that work would be a much more significant refactoring, and the data I'm outputting here is different enough in content and format from ASTDumper that it really needs to be its own separate thing.

Why not use other solutions, like indexstore? We do! But some of the information we're interested in (e.g., type information, generic substitutions at declref usage sites, etc.) is not available in indexstore, nor is everything we need appropriate to add to indexstore in the first place. Some type information is available from SourceKit, but not often in the format that
we need, and since SourceKit is intended for IDE use cases, it is not optimized for making large numbers of large requests. Having a JSON AST dump post-build frees us up to do analysis anywhere (e.g., on a machine that doesn't even have SourceKit or the same exact version of Xcode).

The new `-dump-ast-format` flag can be supplied when invoking the compiler in `-dump-ast` mode. Valid values for this flag are `default` (the default pseudo-S-expression format) and `json`. This is meant to be used by clients who want to do large-scale semantic analysis of type-checked Swift code. It is written with the understanding that there are no guarantees of stability for the semantic AST in the compiler, so clients must read the version information encoded in the JSON output and adjust their behavior accordingly.

The JSON AST dumps get quite large, even when they're not pretty-printed. Since they're just text with a large number of repeated sequences, they compress extremely well with LLVM's built-in zlib compression.

allevato · 2024-10-31T14:40:28Z

@swift-ci please smoke test

Companion PR to swiftlang/swift#77321.

allevato · 2024-10-31T20:05:33Z

@swift-ci please smoke test

ahoppen · 2024-11-04T18:19:32Z

Do you have an example for the kind of analysis could do with this information? I think that would make it easier to understand the purpose of this printing mode.

allevato · 2024-11-05T15:30:52Z

Do you have an example for the kind of analysis could do with this information? I think that would make it easier to understand the purpose of this printing mode.

The tl;dr would be "think clang-tidy, but for Swift". clang-tidy works by essentially being a custom tool that walks the actual AST in-memory and diagnosing issues. We don't have that kind of API in Swift yet, and binary module versioning issues would likely make it extremely difficult to do so because your tools would have to be built against the exact version of the compiler that you're using (and in our case, we need to support multiple versions of Xcode simultaneously). So, having the compiler dump the type-checked AST is the next best thing (and in fact is actually somewhat better, because we can move the analysis to distributed machines that don't necessarily have Xcode installed).

The current -dump-ast flag has more of the data that we'd like to use, but it suffers from things like being not easily machine-parsable, and expressing type names in human-readable format rather than something we can work with in analysis, like a mangled name/USR.

We've managed to get quite far with just indexstore data, but not all of the data that we want to access is encoded in indexstore (nor should it be, given its purpose as a code navigation database). Off the top of my head, we're missing finely grained type information at arbitrary locations in the AST, like the substitution map used for concrete declrefs when invoking generic functions, and AnyObject dynamic dispatch has caused problems.

We've tried to avoid using SourceKit for a few reasons: it forces us into doing our analysis on machines with specific Xcode versions instead of distributing it anywhere, it's slower, and it's fiddly to get the invocations working right—raw PCMs vs. obj PCMs, relative path issues since it's running as a service, etc.

The other thing that indexstore doesn't give us is the "shape" of some of the references. There's some high-level relations encoded between references, but basic things like "is this expression a parameter to this function call", "is this call site using the default value of an argument", etc. are much easier to determine when we can walk the AST directly. Likewise, concurrency-related information doesn't show up in any existing data sources.

Even in situations where we could add information that we want to indexstore (we've merged some PRs for that recently), having data from the raw AST in some format would be a good fallback option when we need it, because as folks add new features to the language, they sometimes forget to make sure that indexstore also gets updated. (Of course, the same thing could happen to this AST dumper, but I plan to make efforts to make sure it stays up-to-date and have designed it so that it's straightforward to do so.)

allevato · 2024-11-05T15:43:59Z

There's also precedent for this in Clang, which has both the debug-output-style AST dumper and a separate JSON AST dumper.

I do think that one day it may be possible to converge ASTDumper and JSONASTDumper into a common shared framework, but I didn't have much success refactoring ASTDumper into something more general without disturbing a lot more of the compiler internals, so I don't know if that's the best path to that goal.

CodaFi · 2024-12-04T21:11:17Z

While I'll admit there are quite a few holes in the story of semantically analyzing Swift from outside of the compiler, I'm still generally very skeptical of using supplementary outputs - especially textual ones - as the transfer format for this kind of information. While acknowledging that there are always engineering tradeoffs (as you have noted above) in any approach, I think the cost to the compiler, and to the tools that result from consuming this data, is still too high. I recall that this is not the first time folks have tried to rely on a textual dump of the AST to build tooling on top. There, as now, without stability guarantees your tooling is still tied to one or more compiler releases. Rather than the evolution of the underlying transfer format being baked into the design of the tooling via API contract, it is instead largely implicit, and small fluctuations in the format because of e.g. the addition or subtraction of AST information become compatibility hazards that build up over time.

I would very much prefer these services to be offered to clients via the LLVM-y promise of "compiler as libraries". From that point of view, SourceKit is still not the correct abstraction for a client such as yourself, but neither is an AST dump. My strong preference is for the project to continue to evolve towards offering Sema as a library to clients a la SwiftSyntax. But that preference ignores the fact that this PR has the ultimate engineering advantage: It exists, and it works.

All of that said, this is a fantastic amount of work that you've done to solve a real need. I don't wish to detract from that. But I would also definitely encourage you to continue extending IndexStore, SwiftSyntax, and SourceKit(-LSP) when and if this merges.

allevato · 2024-12-04T21:46:27Z

@CodaFi , I agree wholeheartedly with everything you've said. Sema-as-a-library would be the ideal end goal here. Even if the library changed in API-breaking ways between versions (just as SwiftSyntax does), source compatibility (of the source being compiled) would let us continuously upgrade and use the single latest version in our tool regardless of the underlying compiler our users were using. It would be particularly interesting to see how such a tool would perform when analyzing multiple modules expressed entirely as source code in a single pass, rather than precompiling the dependencies.

Unfortunately, even in a world where we can imagine Sema-as-a-library, things start to get complicated when C/Obj-C/C++ interop enters the picture. We would still need a way to type-check declarations imported from C modules, so I guess Sema-as-a-library would depend on ClangImporter-as-a-library, which would depend on... something still built from the majority of the Clang codebase. At least, without being more imaginative about ways to get the information about C declarations out of Clang into a form that the Swift compiler can use.

There are also some logistical limitations thanks to the introduction of macros in Apple's SDKs. Those are distributed as pre-built dylibs, which we don't have the source code for. Even a Sema-as-a-library implementation would have to load those, which means we'd be limited to doing our analysis entirely on macOS hardware. At least with supplementary output files, we can move that analysis anywhere we want once we've gotten the outputs from the compiler.

I do hope we get to see a further realized library-based model in the future, but I'm imagining that it's multiple years away at the least. I do want to contribute to those efforts as they arise, but in the meantime, if there is anything I can do to land this so we can move forward with our work sooner than that, I'm happy to accommodate.

allevato · 2025-01-07T13:54:23Z

I've posted a rewrite of this over at #78463, which builds it on top of the existing ASTDumper instead of introducing a totally distinct JSON printer.

allevato · 2025-01-16T15:18:12Z

Closing out this version of the feature implementation since I'm fairly certain #78463 will be the one that goes forward.

Companion PR to swiftlang/swift#77321.

allevato requested review from artemcm, tshortli, hborla, slavapestov and xedin as code owners October 31, 2024 13:49

Add json-zlib as a -dump-ast-format option.

6b5ab1a

The JSON AST dumps get quite large, even when they're not pretty-printed. Since they're just text with a large number of repeated sequences, they compress extremely well with LLVM's built-in zlib compression.

allevato force-pushed the json-ast branch from ae66627 to 6b5ab1a Compare October 31, 2024 13:57

allevato mentioned this pull request Oct 31, 2024

Pass the -dump-ast-format flag down to the frontend. swiftlang/swift-driver#1722

Merged

allevato added a commit to allevato/swift-driver that referenced this pull request Oct 31, 2024

Pass the -dump-ast-format flag down to the frontend.

e77e164

Companion PR to swiftlang/swift#77321.

Fix the Obj-C test, and the output map test on Windows.

8369810

allevato mentioned this pull request Nov 8, 2024

Emit a relationship between a typealias and the referenced type(s) #77437

Closed

allevato mentioned this pull request Jan 7, 2025

Add a flag to dump the AST as JSON. (Second version) #78463

Merged

tshortli mentioned this pull request Jan 9, 2025

Remove outdated warning about swift interfaces without library evolution #78342

Open

allevato closed this Jan 16, 2025

allevato added a commit to allevato/swift-driver that referenced this pull request Jan 24, 2025

Pass the -dump-ast-format flag down to the frontend.

a59c36a

Companion PR to swiftlang/swift#77321.

allevato added a commit to allevato/swift-driver that referenced this pull request Jan 24, 2025

Pass the -dump-ast-format flag down to the frontend.

91f290f

Companion PR to swiftlang/swift#77321.

allevato added a commit to allevato/swift-driver that referenced this pull request Feb 10, 2025

Pass the -dump-ast-format flag down to the frontend.

70adbdf

Companion PR to swiftlang/swift#77321.

allevato added a commit to allevato/swift-driver that referenced this pull request Feb 11, 2025

Pass the -dump-ast-format flag down to the frontend.

55b40b9

Companion PR to swiftlang/swift#77321.

artemcm pushed a commit to swiftlang/swift-driver that referenced this pull request Feb 11, 2025

Pass the -dump-ast-format flag down to the frontend.

9b83358

Companion PR to swiftlang/swift#77321.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a flag to dump the AST as JSON. #77321

Add a flag to dump the AST as JSON. #77321

allevato commented Oct 31, 2024

allevato commented Oct 31, 2024

allevato commented Oct 31, 2024

ahoppen commented Nov 4, 2024

allevato commented Nov 5, 2024

allevato commented Nov 5, 2024

CodaFi commented Dec 4, 2024 •

edited

Loading

allevato commented Dec 4, 2024

allevato commented Jan 7, 2025

allevato commented Jan 16, 2025

Add a flag to dump the AST as JSON. #77321

Add a flag to dump the AST as JSON. #77321

Conversation

allevato commented Oct 31, 2024

Motivation

allevato commented Oct 31, 2024

allevato commented Oct 31, 2024

ahoppen commented Nov 4, 2024

allevato commented Nov 5, 2024

allevato commented Nov 5, 2024

CodaFi commented Dec 4, 2024 • edited Loading

allevato commented Dec 4, 2024

allevato commented Jan 7, 2025

allevato commented Jan 16, 2025

CodaFi commented Dec 4, 2024 •

edited

Loading