Multithread processing source files #117

keith · 2020-01-15T22:56:03Z

Previously every source file was formatted / linted one by one. On our
codebase this took a full project format from ~17 minutes to ~5 minutes.

keith · 2020-01-15T22:56:25Z

This is meant as a RFC for how we can achieve this if this is something we want to do and if it's safe to do. Note: I also have only tested this on macOS based on the 5.1 branch so far

harlanhaskins · 2020-01-15T23:03:25Z

Could you use DispatchQueue.concurrentPerform?

keith · 2020-01-15T23:04:05Z

@harlanhaskins I was planning on doing that, but the file discovery right now is lazy (which I assume was very intentional) so we don't have a count up front to use

allevato · 2020-01-15T23:09:16Z

This is definitely something we've been wanting to look into, so thanks for kicking it off for us.

At a very high-level, this feels like it should be safe because the closure being called (formatMain or lintMain) is effectively main after the command line options have been type-safely-parsed, so there's not really any globally mutable state that we should be concerned about. (Famous last words.)

But I also feel like my Dispatch skills have atrophied enough after years of mostly writing command-line Swift and subsets-of-Python-build-rules that I'm not the best person to make such pronouncements 😛 @dylansturg , care to take a look?

One possibility would be to add a --parallel flag to control this at first, if we have any concerns. Since command line parsing is done right before the processSources call, it would be trivial to pipe that in and make a decision.

keith · 2020-01-15T23:14:07Z

Definitely happy to add that flag if that's something folks want!

Sources/swift-format/main.swift

dylansturg · 2020-01-16T19:13:33Z

I patched this change in a couple ways, and the formatter is consistently crashing on:

A single simple file (e.g. just an empty function)
A directory containing 1 simple file
A directory containing many thousands of files

The first configuration was patching this in master, using the 2019-09-26 toolchain snapshot. That consistently crashes a few levels deep in SwiftSyntax's doVisit. LLDB also crashes when I have the debugger connected, so I don't have much detail.

The second configuration that I tried was patching in swift-5.1-branch with Xcode 11.0 and it's packaged toolchain. This also consistently crashes in SwiftSyntax, a few levels deep in the visitor methods. LLDB doesn't crash, so I was able to get an error message:

error: Trying to put the stack in unreadable memory at: 0x700005880850.

@allevato Mentioned that he previously encountered an issue where SwiftSyntax's recursive visitor implementation regularly exceeded the stack size on non-main threads. I wonder if that's what is happening here? Could also be a red herring.

Finally, what configuration did you use for testing? I'd like to understand what's different, and why I'm seeing this crash consistently but you didn't.

jpsim · 2020-01-16T19:15:28Z

It might be worth running this with TSan enabled to see if it catches anything.

akyrtzi · 2020-01-16T20:13:07Z

@allevato Mentioned that he previously encountered an issue where SwiftSyntax's recursive visitor implementation regularly exceeded the stack size on non-main threads. I wonder if that's what is happening here? Could also be a red herring.

See the JIRA bug (https://bugs.swift.org/browse/SR-11170) for the workaround.
Also it's possible @ahoppen 's architectural changes in master/5.2 have addressed this (reducing the stack usage of visitor).

allevato · 2020-01-16T20:19:10Z

Thanks @akyrtzi , I wasn't able to find that JIRA issue for some reason. So the good news is, that specific problem will at least be mitigated once we migrate master from the 09-26 snapshot to all the newer SwiftSyntax APIs that landed since then. (If that's indeed the problem causing the crashes @dylansturg is seeing, which I'm unsure about given the test case of a file with just an empty function.)

ahoppen · 2020-01-16T20:27:53Z

Are you seeing the stack issues in both debug and release builds or only debug builds? The issue I encountered (and fixed in swiftlang/swift-syntax#147) only occurred in debug builds.

dylansturg · 2020-01-16T21:28:54Z

@ahoppen I only tried debug originally. I tried again on release, and I'm not seeing any crash in release. Sounds like it's the issue that you fixed?

ahoppen · 2020-01-16T21:29:23Z

@dylansturg Yes, the issue should be fixed in master then.

dylansturg

This implementation looks reasonable to me, with 1 question. I think we should wait until we update the formatter to the latest version of swift-syntax though, since it'll be hard to develop if debug builds crash regularly.

Sources/swift-format/main.swift

akyrtzi · 2020-02-06T17:14:26Z

I think we should wait until we update the formatter to the latest version of swift-syntax

Formatter has been updated.🙂

keith · 2020-02-21T06:32:53Z

After rebasing I believe I'm still seeing the stack issue that was mentioned with a debug build, I can try release tomrorow

akyrtzi · 2020-02-21T16:23:59Z

Use a Thread with a custom stack size, as I mention in https://bugs.swift.org/browse/SR-11170

ahoppen · 2020-02-25T09:06:37Z

@keith, I have just gotten a chance to take a look at the stack overflow issue. It seems like the problem I had, resurfaced. I just opened swiftlang/swift-syntax#205 which should solve the issue. Could you try your changes in combination with my patch?

keith · 2020-02-26T02:09:35Z

With that change it does work with debug builds

ahoppen · 2020-02-26T08:25:07Z

@keith Great. I have just merged the changes. Once a tag has been created for the new version, could you also adjust the version of SwiftSyntax that swift-format depends on in Package.swift?

keith · 2020-03-01T02:14:26Z

I've updated swift-format to include that change, unfortunately I had to use a commit instead of a tag here https://forums.swift.org/t/no-new-master-snapshots-since-2-21/34200

Sources/swift-format/Utilities/Helpers.swift

keith · 2020-04-01T23:04:11Z

I've rebased this again and dropped my Package.swift changes since those are no longer required. I would love to know what we can do to get this merged!

akyrtzi · 2020-04-07T01:30:36Z

Sources/swift-format/Utilities/Helpers.swift

-      diagnosticEngine.diagnose(
-        Diagnostic.Message(.error, "Unable to create a file handle for source from \(path)."))
-      return
+    concurrentQueue.async(group: group) {


You should use DispatchQueue.concurrentPerform(iterations:execute:), it will simplify the code a bit and is strongly recommended over passing async blocks to a concurrent queue without any upper bound check for how many run concurrently. The latter can lead to thread exhaustion.

I don't see a specific answer about this, but there was previous conversation about it #117 (comment) I think the file discovery is intentionally lazy so we don't know the number of iterations here

@allevato is there a reason that the file discovery cannot be eager?

@akyrtzi Just that at the time it was written, there was no reason for it to be, so we could avoid the startup delay of collecting everything up front if someone ran the tool with --recursive on a large directory structure.

Is DispatchQueue.concurrentPerform preferred over OperationQueue, which also provides control over the number of concurrent jobs?

I pushed the concurrentPerform option so everyone can see what that would look like, happy to change it to whatever we want here

Sources/swift-format/Subcommands/LintFormatOptions.swift

Sources/swift-format/Frontend/Frontend.swift

Sources/swift-format/Subcommands/LintFormatOptions.swift

akyrtzi · 2020-04-07T16:51:32Z

If we're going to support a parallel mode, I think we either need to provide our own diagnostic consumer that synchronizes its handle method, or make SwiftSyntax's PrintingDiagnosticConsumer do it by default. @akyrtzi

Yes, it makes sense to have PrintingDiagnosticConsume synchronize its writes to stderr so that it writes out a full diagnostic uninterrupted, but I don't think it is enough, you'd want to avoid interposing diagnostics from different files.

I would propose that diagnostics should be collected for each file and then printed as a group for each file in a sensible manner at the end, instead of printing to stderr as soon as something comes up.
That would also give you the opportunity to improve presentation as well later on, maybe optionally put out an html report or something, instead of printing to stderr.

harlanhaskins · 2020-04-07T17:02:13Z

You could have one diagnostic engine and one consumer per file, and then they can replay their diagnostics out to the outside engine

akyrtzi · 2020-04-07T17:02:30Z

@akyrtzi Just that at the time it was written, there was no reason for it to be, so we could avoid the startup delay of collecting everything up front if someone ran the tool with --recursive on a large directory structure.

Does this startup delay really matter, the total time find+process files will still be the same at the end. And if you go with collecting the diagnostics to avoid having diagnostics printed to stderr interposed from different files, then you'll still not see some diagnostic until all the files are processed.

Is DispatchQueue.concurrentPerform preferred over OperationQueue, which also provides control over the number of concurrent jobs?

The benefit is that you'd just let Dispatch decide how many threads to use, instead of having to manually chose. But either way would be better than just continuously doing async on a concurrent queue.

allevato · 2020-04-07T16:49:23Z

Sources/swift-format/Frontend/Frontend.swift

+    }
+
+    let lock = NSLock()
+    let allFilePaths = Array(FileIterator(paths: paths))


Why not move this into the conditional so that it only eagerly collects the paths in parallel mode? The overhead is unnecessary in the sequential case, so we could retain the original behavior if someone has a huge data set (or a networked mount point, or something else that would be slow to access) with the caveat that they can't use --parallel.

Should we go back to the other option then? Not sure I see a big win from using concurrentPerform at the moment especially if it requires this extra branching

(Pushed this change in the meantime)

That's why I was curious about OperationQueue, because it seems like it still provides a way to let the system choose the parallelism while also letting us walk the file hierarchy lazily, by just calling addOperation repeatedly instead of requiring the iteration count up front. 🤷‍♂

But if that's not advisable for some reason, I don't consider pre-computing the file list only in parallel mode to be extra branching—rather, it's choosing the best distribution of the work given the requirements of the APIs we're using (or not using).

Sources/swift-format/Frontend/Frontend.swift

allevato · 2020-04-07T17:11:55Z

And if you go with collecting the diagnostics to avoid having diagnostics printed to stderr interposed from different files, then you'll still not see some diagnostic until all the files are processed.

I agree that we don't want the diagnostics from different files to be interposed, but I still think we can do better than deferring the output of all diagnostics until all files are processed.

If we have to write our own diagnostic consumer that collects diagnostics and groups them by file, then we can just have a method on that consumer that says to flush the diagnostics for that file, call it when processing of that file is complete, and synchronize around that. (For diagnostics with unknown files, we could just emit those immediately, also synchronized.) Or as @harlanhaskins suggested above, use separate engines for each file, but that one worries me a bit because in the future we might want to have a JSON output mode that keeps all the diagnostics in a single output file and separate engines would make that harder since we'd have to merge them all afterwards.

The benefit is that you'd just let Dispatch decide how many threads to use, instead of having to manually chose. But either way would be better than just continuously doing async on a concurrent queue.

OperationQueue also has defaultMaxConcurrentOperationCount which lets the system decide how many jobs to run concurrently, but I don't know the details of how it selects that number compared to DispatchQueue.performConcurrent.

However, as I mentioned in one of my replies to Keith above, I think it would be fine to have --parallel mode eagerly evaluate the iterator into an array while leaving non-parallel mode to do it lazily, and doesn't require any changes to the way we traverse the file system.

Previously every source file was formatted / linted one by one. On our codebase this took a full project format from ~17 minutes to ~5 minutes.

keith · 2020-04-07T20:14:43Z

At this point with debug or release mode I see crashes in SwiftSyntax:

Exception Type:        EXC_CRASH (SIGABRT)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Application Specific Information:
abort() called
swift-format(12906,0x7000042ba000) malloc: Incorrect checksum for freed object 0x7ff84ac8b7a8: probably modified after being freed.
Corrupt value: 0x3000080184b7284f
 

Thread 0:: Dispatch queue: com.apple.root.user-interactive-qos
0   swift-format                  	0x0000000103c5027c partial apply + 108
1   swift-format                  	0x0000000103c4d2ca SyntaxRewriter.visit(_:) + 330 (SyntaxRewriter.swift:4534)
2   swift-format                  	0x0000000103c4dc80 SyntaxRewriter.visitChildren<A>(_:) + 2224 (SyntaxRewriter.swift:5052)
3   swift-format                  	0x0000000103b9bc1e SyntaxRewriter.visit(_:) + 206 (SyntaxRewriter.swift:324)
4   swift-format                  	0x0000000103bcf19d SyntaxRewriter.visitImplBooleanLiteralExprSyntax(_:) + 1661 (SyntaxRewriter.swift:2147)
5   swift-format                  	0x0000000103c52428 partial apply + 104

harlanhaskins · 2020-04-07T20:18:39Z

What does TSan say? These structures should be immutable...

keith · 2020-04-07T20:30:16Z

So with tsan I first hit many other issues. Looks like we're hitting some non-thread safe code elsewhere, like the use of DiagnosticEngine in rules:

https://github.com/apple/swift-format/blob/5786e26754c100f5e4e7d8df9e75ab50be9a9ce7/Sources/SwiftFormatWhitespaceLinter/WhitespaceLinter.swift#L344-L354

shahmishal · 2020-10-06T20:38:01Z

The Swift project moved the default branch to main and deleted master branch, so GitHub automatically closed the PR. Please re-create the pull request with main branch.

More detail about the branch update - https://forums.swift.org/t/updating-branch-names/40412

keith · 2020-10-06T20:40:36Z

If someone is interested in picking up this change that would be great! Last I remember I believe the next blocker is that the diagnostics reporting types do not support multithreading

mattt · 2020-10-08T21:16:23Z

I'm facing the same threading issues with my use of the SwiftFormatter API for swift-doc. Running with TSAN, it found a data race here:

https://github.com/apple/swift-format/blob/d4bba6e22891ff1813e8267e36f2b00307684366/Sources/SwiftFormatCore/Rule.swift#L34

SUMMARY: ThreadSanitizer: data race Rule.swift:34 in static Rule.ruleName.getter
==================
==================
WARNING: ThreadSanitizer: data race (pid=13716)
  Read of size 8 at 0x0001115c6888 by thread T4:
    #0 static Rule.ruleName.getter Rule.swift:34 (swift-doc:x86_64+0x1003c1c50)
    #1 SyntaxFormatRule.visitAny(_:) SyntaxFormatRule.swift:34 (swift-doc:x86_64+0x1003ce72f)
    #2 SyntaxRewriter.visitImplSourceFileSyntax(_:) SyntaxRewriter.swift:2778 (swift-doc:x86_64+0x100f27e82)

allevato · 2020-10-08T22:14:46Z

Thanks for catching that one; it's not one of the original data races we dealt with in earlier iterations of this PR (the linked code was added later; it wasn't a race before because we inadvertently were never updating the cache 😬) but we'll need to go back and synchronize that now too.

@keith is correct about the original blocking issue being that DiagnosticEngine in SwiftSyntax not being thread-safe. Last I chatted with @akyrtzi he wasn't opposed to just synchronizing inside that class by default, so I can try to put together a PR for that soon unless someone else beats me to it.

This allows consumers to emit diagnostics from multiple threads. Primarily motivated by swiftlang/swift-format#117

keith · 2020-10-19T15:49:18Z

Here's a change to make DiagnosticEngine thread safe: swiftlang/swift-syntax#243

keith · 2020-10-19T15:50:35Z

Here's a change to fix the issue Mattt mentioned that I also hit once DiagnosticEngine was usable from multiple threads: #242

keith · 2020-10-19T15:53:53Z

Here's a new PR for this change #243

This allows consumers to emit diagnostics from multiple threads. Primarily motivated by swiftlang/swift-format#117

kastiglione reviewed Jan 15, 2020

View reviewed changes

Sources/swift-format/main.swift Outdated Show resolved Hide resolved

dylansturg reviewed Jan 17, 2020

View reviewed changes

Sources/swift-format/main.swift Outdated Show resolved Hide resolved

p4checo reviewed Jan 19, 2020

View reviewed changes

Sources/swift-format/main.swift Outdated Show resolved Hide resolved

keith force-pushed the ks/multithread branch from dd90dde to c8940cc Compare February 21, 2020 06:30

ahoppen mentioned this pull request Feb 25, 2020

Resolve stack overflow in tree visitation when having a reduced stack size swiftlang/swift-syntax#205

Merged

keith force-pushed the ks/multithread branch from c8940cc to 432f6ba Compare February 26, 2020 02:13

keith force-pushed the ks/multithread branch from 0558e92 to 45ef20f Compare March 1, 2020 02:17

keith force-pushed the ks/multithread branch from 45ef20f to 7f8f6bd Compare April 1, 2020 23:03

keith commented Apr 1, 2020

View reviewed changes

Sources/swift-format/Utilities/Helpers.swift Outdated Show resolved Hide resolved

akyrtzi reviewed Apr 7, 2020

View reviewed changes

keith force-pushed the ks/multithread branch from 7f8f6bd to 539d779 Compare April 7, 2020 02:07

keith commented Apr 7, 2020

View reviewed changes

Sources/swift-format/Subcommands/LintFormatOptions.swift Show resolved Hide resolved

keith force-pushed the ks/multithread branch from 539d779 to 6c7f8c1 Compare April 7, 2020 02:38

keith commented Apr 7, 2020

View reviewed changes

Sources/swift-format/Frontend/Frontend.swift Outdated Show resolved Hide resolved

allevato reviewed Apr 7, 2020

View reviewed changes

Sources/swift-format/Frontend/Frontend.swift Outdated Show resolved Hide resolved

Sources/swift-format/Frontend/Frontend.swift Show resolved Hide resolved

Sources/swift-format/Subcommands/LintFormatOptions.swift Show resolved Hide resolved

keith force-pushed the ks/multithread branch 2 times, most recently from da6347b to 558a384 Compare April 7, 2020 16:36

allevato reviewed Apr 7, 2020

View reviewed changes

Multithread processing source files

5786e26

Previously every source file was formatted / linted one by one. On our codebase this took a full project format from ~17 minutes to ~5 minutes.

keith force-pushed the ks/multithread branch from 558a384 to 5786e26 Compare April 7, 2020 17:37

shahmishal closed this Oct 6, 2020

keith added a commit to keith/swift-syntax that referenced this pull request Oct 19, 2020

Make DiagnosticEngine thread safe

f34b04c

This allows consumers to emit diagnostics from multiple threads. Primarily motivated by swiftlang/swift-format#117

keith mentioned this pull request Oct 19, 2020

Make DiagnosticEngine thread safe swiftlang/swift-syntax#243

Merged

keith mentioned this pull request Oct 19, 2020

Multithread processing source files #243

Merged

keith deleted the ks/multithread branch October 19, 2020 15:53

akyrtzi pushed a commit to swiftlang/swift-syntax that referenced this pull request Oct 20, 2020

Make DiagnosticEngine thread safe (#243)

f957158

This allows consumers to emit diagnostics from multiple threads. Primarily motivated by swiftlang/swift-format#117

Multithread processing source files #117

Multithread processing source files #117

Uh oh!

Conversation

keith commented Jan 15, 2020

Uh oh!

keith commented Jan 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harlanhaskins commented Jan 15, 2020

Uh oh!

keith commented Jan 15, 2020

Uh oh!

allevato commented Jan 15, 2020

Uh oh!

keith commented Jan 15, 2020

Uh oh!

Uh oh!

dylansturg commented Jan 16, 2020

Uh oh!

jpsim commented Jan 16, 2020

Uh oh!

akyrtzi commented Jan 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allevato commented Jan 16, 2020

Uh oh!

ahoppen commented Jan 16, 2020

Uh oh!

dylansturg commented Jan 16, 2020

Uh oh!

ahoppen commented Jan 16, 2020

Uh oh!

dylansturg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

akyrtzi commented Feb 6, 2020

Uh oh!

keith commented Feb 21, 2020

Uh oh!

akyrtzi commented Feb 21, 2020

Uh oh!

ahoppen commented Feb 25, 2020

Uh oh!

keith commented Feb 26, 2020

Uh oh!

ahoppen commented Feb 26, 2020

Uh oh!

keith commented Mar 1, 2020

Uh oh!

Uh oh!

keith commented Apr 1, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akyrtzi commented Apr 7, 2020

Uh oh!

harlanhaskins commented Apr 7, 2020

Uh oh!

akyrtzi commented Apr 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keith commented Jan 15, 2020 •

edited

Loading

akyrtzi commented Jan 16, 2020 •

edited

Loading