Fix and improve parallel mode #261

p4checo · 2021-08-16T11:48:17Z

Motivation

Parallel mode was crashing with bad memory access due to data races accessing Frontend.configurationLoader, more specifically its cache.

After serializing access (initially with an NSLock), I observed that despite not crashing anymore, the performance was surprisingly worst than in single threaded mode.

That led me down a road of investigating why this was happening, and after some profiling I discovered that Rule's nameCache was the main source of contention - causing around 30% of total spent time waiting for locks (ulock_wait) on my tests.

After replacing the synchronization mechanism on nameCache with a more efficient os_unfair_lock_t (pthread_mutex_t in Linux), the contention dropped significantly, and parallel mode now outperformed single threaded mode as expected. As a bonus, these changes also improved single threaded mode performance as well, due to the reduced overhead of using a lock vs a queue.

I then used the same Lock approach to serialize access to Frontend.configurationLoader which increased the performance gap even further.

After these improvements, I was able to obtain quite significant performance gains:

serial (non optimized) vs parallel (optimized): ~5.4x (13.5s vs 74s)
serial (optimized) vs serial (non optimized): ~1.6x (44s vs 74s)
serial (optimized) vs parallel (optimized): ~3.2x (13.5s vs 44s)

Tests were made on my MacBookPro16,1 (8-core i9 @2.4GHz), on a project with 2135 Swift files, compiling swift-format in Release mode.

Update 1

Sadly, a custom Lock implementation is not ideal for swift-format to maintain and Windows support was not provided. As such, The alternative would be to use NSLock which being a part of Foundation is supported on all major platforms.

Using NSLock the improvements were not so good, unfortunately:

serial (non optimized) vs parallel (NSLock): ~1,9x (38s vs 74s)
serial (NSLock) vs serial (non optimized): ~1,4x (52s vs 74s)
serial (NSLock) vs parallel (NSLock): ~1,3x (38s vs 52s)

Update 2

The best solution was then to try and remove all concurrency contention from swift-format. By flattening the FileToProcess array and making the Rule's nameCache static, the best overall results were achieved:

serial (non optimized) vs parallel (optimized): ~6.7x (11s vs 74s)
serial (optimized) vs serial (non optimized): ~1.8x (40.6s vs 74s)
serial (optimized) vs parallel (optimized): ~3.6x (11s vs 40.6s)

In order to have both Rule.ruleName caching and avoid synchronization, a static cache of rule names is now generated via the generate-pipeline tool.

To make the rule types available to the Rule.ruleName protocol extension, it was moved to SwiftFormatRules target, together with SyntaxFormatRule and SyntaxLintRule.

Update 3

Base rules SyntaxFormatRule and SyntaxLintRule should be kept in the SwiftFormatCore target to eventually support dynamic discovery of rules (ditching generate-pipeline) and only require rule implementers to import SwiftFormatCore.

On the other hand, any statically generated ruleName cache must be in SwiftFormatRules target, because that's where rule implementations reside to capture their ObjectIdentifier at compile time.

So, in order to have a static rule name cache that's accessible from SwiftFormatCore (especially from SyntaxFormatRule), a solution was to inject a ruleNameCache dictionary in Context, which is injected into every Rule.

This allows using generate-pipelines to a produce a ruleNameCache in SwiftFormatRules that's injected on all Contexts by each of the pipelines (which is done in SwiftFormat target).

While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each Context), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow.

All things considered, results were quite good:

serial (non optimized) vs parallel (optimized): ~7x (10.5s vs 74s)
serial (optimized) vs serial (non optimized): ~1.7x (42.5s vs 74s)
serial (optimized) vs parallel (optimized): ~4x (10.5s vs 42.5s)

Changes

Make file preparation serial by calculating all FileToProcess before processing them in parallel.
Update Context to receive a ruleNameCache of type [ObjectIdentifier: String] on init.
Update Context.isRuleEnabled to receive the Rule type instead of the rule's name, to use the new cache.
Add new RuleNameCacheGenerator to generate-pipeline which outputs a static ruleNameCache dictionary to RuleNameCache+Generated.swift file in SwiftFormatRule target with all existing rules.
Remove the nameCache and nameCacheQueue Rule.swift.

Benchmarks

Below are benchmarks of some configurations attempted during development, using hyperfine.
(Some of these configurations' implementations are available in branches on my fork).

The flattened FileToProcess array was an attempt to remove the need to synchronize access to Frontend.configurationLoader completely, by making initial file preparation sequential. It results in a slight performance improvement in single threaded mode (~3.6%), but a performance loss in parallel mode (~4.6%). Opted for maximizing parallel mode's performance, as that should be the more optimized mode.

Work was done based off swift-5.4-branch, because I'm using Xcode 12.5.

single-thread

no lock in `ConfigurationLoader`, DispatchQueue in `Rule` `nameCache`

branches: swift-5.4-branch, main

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     73.908 s ±  1.882 s    [User: 70.942 s, System: 2.086 s]
  Range (min … max):   70.526 s … 76.510 s    10 runs

`Lock` (`os_unfair_lock_t`) in `ConfigurationLoader` and in `Rule` `nameCache`

branch: swift-5.4-branch-lock-configuration-loader-and-rule-name-cache, this PR's

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     44.077 s ±  1.554 s    [User: 41.538 s, System: 1.832 s]
  Range (min … max):   42.809 s … 48.050 s    10 runs

flattened `FileToProcess` array, `Lock` (`os_unfair_lock_t`) in `Rule` `nameCache`

branch: swift-5.4-branch-serial-files-to-process-and-lock-rule-name-cache

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     42.365 s ±  0.933 s    [User: 40.165 s, System: 1.629 s]
  Range (min … max):   40.818 s … 43.554 s    10 runs

parallel

`NSLock` in `ConfigurationLoader`, DispatchQueue in `Rule` `nameCache`

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     86.431 s ±  7.208 s    [User: 369.514 s, System: 276.583 s]
  Range (min … max):   72.179 s … 93.882 s    10 runs

`Lock` (`os_unfair_lock_t`) in `ConfigurationLoader` and in `Rule` `nameCache`

branch: swift-5.4-branch-lock-configuration-loader-and-rule-name-cache, this PR's

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     13.565 s ±  0.354 s    [User: 101.115 s, System: 69.132 s]
  Range (min … max):   12.921 s … 14.184 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     13.453 s ±  0.448 s    [User: 99.746 s, System: 68.694 s]
  Range (min … max):   12.409 s … 13.992 s    10 runs

flattened `FileToProcess` array, `Lock` (`os_unfair_lock_t`) in `Rule` `nameCache`

branch: swift-5.4-branch-serial-files-to-process-and-lock-rule-name-cache

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     14.146 s ±  0.283 s    [User: 101.879 s, System: 73.914 s]
  Range (min … max):   13.711 s … 14.628 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     14.192 s ±  0.537 s    [User: 101.197 s, System: 75.123 s]
  Range (min … max):   13.453 s … 15.096 s    10 runs

flattened `FileToProcess` array, no `Rule` `nameCache`

branch: swift-5.4-branch-serial-files-to-process-and-no-rule-name-cache

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     17.373 s ±  0.348 s    [User: 199.595 s, System: 5.241 s]
  Range (min … max):   16.972 s … 17.946 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     17.553 s ±  0.812 s    [User: 198.131 s, System: 5.342 s]
  Range (min … max):   16.491 s … 18.958 s    10 runs

Benchmarks - Update 1

`NSLock` in `ConfigurationLoader` and in `Rule` `nameCache`

branch: swift-5.4-branch-nslock-configuration-loader-and-rule-name-cache

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     43.643 s ±  1.143 s    [User: 41.397 s, System: 1.711 s]
  Range (min … max):   42.309 s … 45.757 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     47.453 s ±  2.253 s    [User: 44.743 s, System: 2.024 s]
  Range (min … max):   42.963 s … 51.134 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     39.957 s ±  4.178 s    [User: 160.686 s, System: 337.798 s]
  Range (min … max):   29.375 s … 44.625 s    10 runs

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     39.103 s ±  2.587 s    [User: 157.523 s, System: 343.123 s]
  Range (min … max):   32.831 s … 42.022 s    10 runs

Benchmarks - Update 2

flattened `FileToProcess` array, static `Rule` `nameCache`

branch: swift-5.4-branch-static-rule-name-cache

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     40.685 s ±  0.517 s    [User: 38.593 s, System: 1.620 s]
  Range (min … max):   40.013 s … 41.598 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     40.616 s ±  0.791 s    [User: 38.534 s, System: 1.580 s]
  Range (min … max):   39.264 s … 41.752 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     10.416 s ±  1.732 s    [User: 111.428 s, System: 4.094 s]
  Range (min … max):    8.246 s … 12.632 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     11.455 s ±  2.099 s    [User: 122.742 s, System: 4.601 s]
  Range (min … max):    7.896 s … 13.399 s    10 runs

Benchmarks - Update 3

flattened `FileToProcess` array, static `ruleNameCache` injected in `Context`

branch: swift-5.4-branch-rule-name-cache-in-context

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     43.524 s ±  1.003 s    [User: 41.400 s, System: 1.676 s]
  Range (min … max):   42.648 s … 45.514 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     41.646 s ±  0.714 s    [User: 39.650 s, System: 1.610 s]
  Range (min … max):   40.381 s … 42.525 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     10.088 s ±  1.160 s    [User: 108.481 s, System: 3.700 s]
  Range (min … max):    8.441 s … 11.655 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     11.207 s ±  0.735 s    [User: 118.658 s, System: 4.291 s]
  Range (min … max):    9.315 s … 11.871 s    10 runs

allevato

Thanks for looking into this and providing such thorough benchmarks.

My main concern here is that I'd prefer that swift-format not have to implement its own Lock primitive in order to get decent performance. AFAIK, neither implementation in that class will work on Windows, since it doesn't support pthreads (@compnerd correct me if I'm wrong here).

Ideally, I'd like to have this work without needing three (or even two) low-level platform-specific solutions, since it shouldn't be the responsibility of swift-format to maintain those. Can we improve the existing use of Dispatch or figure out why NSLock performed so horribly instead and still get some wins, even if it's not the absolute fastest outcome?

p4checo · 2021-08-26T14:12:32Z

Thanks @allevato 🙏🏼

Yeah, I had a feeling that adding our own Lock primitive was risky 😅 And I completely forgot about Windows, so your comment makes complete sense on both fronts 🙈

I'll give it another shot with NSLock, because I believe that via Dispatch the only option besides a DispatchQueue (which we've seen generates a lot of contention) would be a DispatchSemaphore which is not ideal. Even if the performance is not so good as using a platform-specific low-level lock, I think using an NSLock should still provide some wins.

Will update this in the meantime

compnerd · 2021-08-26T16:19:39Z

@allevato - you are correct, pthreads (POSIX Threads) are not available on Windows, and we would need to use Windows threads. There are already 2 (or was it 3?) implementations of Lock that I've done previously, though using NSLock would work as well as that is available on Windows as well.

p4checo · 2021-08-26T22:24:13Z

Sadly, using NSLock results in a significant performance hit vs the low-level locks, but still provides a slight improvement over the previous implementation. Haven't dived deep into why though, as I couldn't find much information about NSLock's semantics (e.g. fairness) or internals on a quick search.

`NSLock` in `ConfigurationLoader` and in `Rule` `nameCache`

branch: swift-5.4-branch-nslock-configuration-loader-and-rule-name-cache

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     43.643 s ±  1.143 s    [User: 41.397 s, System: 1.711 s]
  Range (min … max):   42.309 s … 45.757 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     47.453 s ±  2.253 s    [User: 44.743 s, System: 2.024 s]
  Range (min … max):   42.963 s … 51.134 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     39.957 s ±  4.178 s    [User: 160.686 s, System: 337.798 s]
  Range (min … max):   29.375 s … 44.625 s    10 runs

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     39.103 s ±  2.587 s    [User: 157.523 s, System: 343.123 s]
  Range (min … max):   32.831 s … 42.022 s    10 runs

EDIT: re-ran benchmarks without the project being formatted being simultaneously opened on Xcode 🤦🏼

keith · 2021-08-27T00:12:41Z

Sources/SwiftFormatCore/Rule.swift


-    if let cachedName = cachedName {
+    nameCacheLock.lock()


at the time, this worked fine, but maybe it's worth revisiting if it's worth caching this at all #242 (comment)

This is a very valid point. I did some tests flattening the FileToProcess array and removing Rule's nameCache, effectively removing all need for synchronization in a concurrent setup (you can find them in one of the benchmarks in the initial PR description). It was indeed faster than this current NSLock version, but only in parallel mode. At the time I reverted back to using a cache with the low level lock as it was more performant still (13.5s vs 17.5s in parallel mode), while being the fastest setup I could achieve overall.

I've re-ran some benchmarks and I think that it's probably not worth removing synchronization, as you can see below.
(I also noticed a very big difference in parallel mode from my previous tests, so it's likely that I screwed up when copy/pasting the previous results 🙈)

flattened FileToProcess array, no Rule nameCache

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources Time (mean ± σ): 96.948 s ± 2.903 s [User: 93.642 s, System: 2.513 s] Range (min … max): 93.368 s … 102.528 s 10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources Time (mean ± σ): 34.519 s ± 6.421 s [User: 364.389 s, System: 13.934 s] Range (min … max): 18.097 s … 38.766 s 10 runs

After opening this PR I also did some other attempts at rethinking the Rule.ruleName mechanics, but ended up giving up because I never reached an elegant solution. I tried:

Precomputing all rule names statically (during codegen), but hit some issues related with dynamic dispatch (Rule.ruleName protocol extension would always be called).

Removing Rule.ruleName default protocol extension and using inheritance, worked but forced rules to override ruleName, defeating the purpose of the protocol's abstraction a bit.

Changing Rule's protocol itself (e.g. making ruleName an instance property), which created other issues and the abstraction itself felt a bit wrong, because the rule's name should effectively be at the type level.

I definitely think we could start this discussion on how to improve this because Rule.ruleName is a part of a very hot path especially because it's a crucial step in visitAny. In the meantime I'll give it another go to see if I can come up with something.

Well, I think I managed to come up with a pretty good trade off that allows having a static rule name cache, by shuffling some files and using codegen from generate-pipeline 😁

Let's see what @allevato thinks of it

p4checo · 2021-08-27T20:53:37Z

Given that using NSLock didn't come near optimal performance, the best solution was then to try and remove all concurrency contention from swift-format. 🤓

By flattening the FileToProcess array and making the Rule's nameCache static, the best overall results were achieved! 🚀

In order to have both Rule.ruleName caching and avoid synchronization, a static cache of rule names is now generated via the generate-pipeline tool. To make the rule types available to the Rule.ruleName protocol extension, it was moved to SwiftFormatRules target, together with SyntaxFormatRule and SyntaxLintRule.

I think this achieves quite a good tradeoff, and doesn't mess too much with project organization. SyntaxFormatRule and SyntaxLintRule are rules after all 😅

What do you say, @allevato?

New results:

serial (non optimized) vs parallel (optimized): ~6.7x (11s vs 74s)
serial (optimized) vs serial (non optimized): ~1.8x (40.6s vs 74s)
serial (optimized) vs parallel (optimized): ~3.6x (11s vs 40.6s)

flattened `FileToProcess` array, static `Rule` `nameCache`

branch: swift-5.4-branch-static-rule-name-cache

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     40.685 s ±  0.517 s    [User: 38.593 s, System: 1.620 s]
  Range (min … max):   40.013 s … 41.598 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     40.616 s ±  0.791 s    [User: 38.534 s, System: 1.580 s]
  Range (min … max):   39.264 s … 41.752 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     10.416 s ±  1.732 s    [User: 111.428 s, System: 4.094 s]
  Range (min … max):    8.246 s … 12.632 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     11.455 s ±  2.099 s    [User: 122.742 s, System: 4.601 s]
  Range (min … max):    7.896 s … 13.399 s    10 runs

allevato · 2021-08-30T16:44:30Z

This is shaping up nicely!

In order to have both Rule.ruleName caching and avoid synchronization, a static cache of rule names is now generated via the generate-pipeline tool. To make the rule types available to the Rule.ruleName protocol extension, it was moved to SwiftFormatRules target, together with SyntaxFormatRule and SyntaxLintRule.

I think this achieves quite a good tradeoff, and doesn't mess too much with project organization. SyntaxFormatRule and SyntaxLintRule are rules after all 😅

Almost. They're not concrete rules; they're the base types that other rules should extend. It's important that we keep those two types in the SwiftFormatCore module and separate from the SwiftFormatRules module that implements concrete rules, because eventually we'd like to be able to support dynamic discovery of rules and get rid of generate-pipeline completely. That would let folks write their own rules as plug-ins where they would import SwiftFormatCore, and we'd also factor out the built-in rules in the same fashion.

It may be some time before we have the runtime metadata APIs we need to make that a reality, but I'd like to keep the design with that in mind, otherwise we'd just have to back the changes out later anyway.

I don't mind the generation of the cache itself, but can you make it work without moving those types around?

p4checo · 2021-08-30T17:26:05Z

Almost. They're not concrete rules; they're the base types that other rules should extend.

Yes, I am aware that they are base classes to the other rules (hence placing them in a directory called Base), but didn't consider the future plans of having dynamic discovery of rules without generate-pipeline.

It may be some time before we have the runtime metadata APIs we need to make that a reality, but I'd like to keep the design with that in mind, otherwise we'd just have to back the changes out later anyway.

It would quite a trivial change to do, but it's a valid point nonetheless.

I don't mind the generation of the cache itself, but can you make it work without moving those types around?

I can definitely try. But there's both a circular dependency and a dynamic dispatch problem as things are, which makes this complicated:

SyntaxFormatRule, SyntaxLintRule and Rule live in SwiftFormatCore
All current rule implementations live in SwiftFormatRules, which depends on SwiftFormatCore
generating the cache as it is requires that the types are visible to the compiler at build time, so need to be in SwiftFormatRules (because that's where rule implementations are)
We can't import SwiftFormatRules (to access the codegen'ed cache) in Rule.swift because it would create a circular dependency
As they are, SyntaxFormatRule, SyntaxLintRule's ruleName invocation will always use dynamic dispatch and use the protocol extension implementation in Rule.swift.

These were the reasons why I ended up moving those two base classes and the Rule.ruleName protocol extension to SwiftFormatRules, and why I consider it an elegant solution given the constraints.

To work around issue in point 5 a solution would be to implement ruleName in base classes and override in subclasses. I've attempted to do so using codegen but it would require classes to use ObjC runtime to allow static vars to be declared in extensions, which then creates other issues related with SyntaxRewriter and SyntaxVisitor already being base classes.

The absolute best approach for this performance-wise would be for rules to simply declare a static let ruleName: String = "Foo" abandoning the protocol extension completely, which although well intended and nicely abstracted, given the project and code structure is causing a significant performance penalty.

I'll give it yet another go to see what I can come up with. Do you have any idea of what could work?

p4checo · 2021-08-31T17:53:27Z

Well, I have yet another possible solution, which hopefully will tick all the necessary boxes 😅

So, in order to have a static rule name cache that's accessible from SwiftFormatCore (especially from SyntaxFormatRule), a solution was to inject a ruleNameCache dictionary in Context, which is injected into every Rule.

This allows using generate-pipelines to a produce a ruleNameCache in SwiftFormatRules that's injected on all Contexts by each of the pipelines (which is done in SwiftFormat target).

While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each Context), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow.

All things considered, results were quite good nonetheless:

serial (non optimized) vs parallel (optimized): ~7x (10.5s vs 74s)
serial (optimized) vs serial (non optimized): ~1.7x (42.5s vs 74s)
serial (optimized) vs parallel (optimized): ~4x (10.5s vs 42.5s)

flattened `FileToProcess` array, static `ruleNameCache` injected in `Context`

branch: swift-5.4-branch-rule-name-cache-in-context

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     43.524 s ±  1.003 s    [User: 41.400 s, System: 1.676 s]
  Range (min … max):   42.648 s … 45.514 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     41.646 s ±  0.714 s    [User: 39.650 s, System: 1.610 s]
  Range (min … max):   40.381 s … 42.525 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     10.088 s ±  1.160 s    [User: 108.481 s, System: 3.700 s]
  Range (min … max):    8.441 s … 11.655 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     11.207 s ±  0.735 s    [User: 118.658 s, System: 4.291 s]
  Range (min … max):    9.315 s … 11.871 s    10 runs

allevato · 2021-08-31T22:00:12Z

So, in order to have a static rule name cache that's accessible from SwiftFormatCore (especially from SyntaxFormatRule), a solution was to inject a ruleNameCache dictionary in Context, which is injected into every Rule.

This allows using generate-pipelines to a produce a ruleNameCache in SwiftFormatRules that's injected on all Contexts by each of the pipelines (which is done in SwiftFormat target).

This feels like a good way to do it! I can also see it scaling well to support dynamic discovery whenever we're able to do that, because we'd inject a different cache into the context instead of mutating some global singleton.

It looks like there are still some outdated changes from the previous push in this PR (the old generated cache file is still there along with the new one, and the lint/format rule base types are still in the SwiftFormatRules module instead of back in SwiftFormatCore). Can you fix those up please?

While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each Context), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow.

I don't think the cache will be copied here; the dictionary is never mutated after it's created, so creating a new Context will just store a Dictionary value that shares the same underlying storage with the original. That should be relatively fast.

p4checo · 2021-08-31T22:04:58Z

This feels like a good way to do it! I can also see it scaling well to support dynamic discovery whenever we're able to do that, because we'd inject a different cache into the context instead of mutating some global singleton.

That's great news!

It looks like there are still some outdated changes from the previous push in this PR (the old generated cache file is still there along with the new one, and the lint/format rule base types are still in the SwiftFormatRules module instead of back in SwiftFormatCore). Can you fix those up please?

Whoops 🙈 Cherry-pick fail. Will fix right away

I don't think the cache will be copied here; the dictionary is never mutated after it's created, so creating a new Context will just store a Dictionary value that shares the same underlying storage with the original. That should be relatively fast.

Yes, that makes sense. COW's got our backs here 😁

Now that we have the `isOptIn` property for rules, this should be used to disable rules by default instead of removing them from pipeline generation entirely. This also unblocks swiftlang#261, since excluded rules wouldn't show up in the generated name cache, causing their tests to crash.

allevato · 2021-09-01T00:17:59Z

We're almost there! I ran the tests locally and realized that the old hardcoded list of rules suppressed from RuleCollector in generate-pipeline means they won't be in the name cache either, and the tests for those rules crash 🙃

I've got #263 out to fix this, which should unblock this change.

p4checo · 2021-09-01T00:38:49Z

Nice catch, and thanks for the quick fix! 🙏🏼

If I'm not mistaken, that was caused by the new assert in Context.isRuleEnabled, right?. I added it as a reminder to favor using the cache and not rely on the Rule.ruleName fallback which will use the (slow) protocol extension 😅. I can remove it if you believe it doesn't add value and is prone to creating more "friction" rather than help.

allevato · 2021-09-01T14:19:14Z

I think it's fine to have that assertion; if a rule's name isn't in the cache, it's probably a sign of other problems (like generate-pipeline not being run after adding something new), and falling back to the slower implementation would lead to slower performance for those rules. I think it's better not to risk that kind of inconsistency.

I merged #263, so if you rebase and rerun generate-pipeline again, the cache should pick those up.

Parallel mode was crashing with bad memory access due to data races accessing `Frontend.configurationLoader`, more specifically its `cache`. After serializing access (initially with an `NSLock`), I observed that despite not crashing anymore, the performance was surprisingly *worst* than in single threaded mode. That led me down a road of investigating why this was happening, and after some profiling I discovered that `Rule`'s `nameCache` was the main source of contention - causing around 30% of total spent time waiting for locks (`ulock_wait`) on my tests. After replacing the synchronization mechanism on `nameCache` with a more efficient `os_unfair_lock_t` (`pthread_mutex_t` in Linux), the contention dropped significantly, and parallel mode now outperformed single threaded mode as expected. As a bonus, these changes also improved single threaded mode performance as well, due to the reduced overhead of using a lock vs a queue. I then used the same `Lock` approach to serialize access to `Frontend.configurationLoader` which increased the performance gap even further. After these improvements, I was able to obtain quite significant performance gains using `Lock`: - serial (non optimized) vs parallel (optimized): ~5.4x (13.5s vs 74s) - serial (optimized) vs serial (non optimized): ~1.6x (44s vs 74s) - serial (optimized) vs parallel (optimized): ~3.2x (13.5s vs 44s) Sadly, a custom `Lock` implementation is not ideal for `swift-format` to maintain and Windows support was not provided. As such, The alternative would be to use `NSLock` which being a part of `Foundation` is supported on all major platforms. Using `NSLock` the improvements were not so good, unfortunately: - serial (non optimized) vs parallel (NSLock): ~1,9x (38s vs 74s) - serial (NSLock) vs serial (non optimized): ~1,4x (52s vs 74s) - serial (NSLock) vs parallel (NSLock): ~1,3x (38s vs 52s) The best solution was then to try and remove all concurrency contention from `swift-format`. By flattening the `FileToProcess` array and making the `Rule`'s `nameCache` static, close to optimal performance should be achievable. Base rules `SyntaxFormatRule` and `SyntaxLintRule` should be kept in the `SwiftFormatCore` target to eventually support dynamic discovery of rules (ditching `generate-pipeline`) and only require rule implementers to import `SwiftFormatCore`. On the other hand, any statically generated `ruleName` cache must be in `SwiftFormatRules` target, because that's where rule implementations reside to capture their `ObjectIdentifier` at compile time. So, in order to have a static rule name cache that's accessible from `SwiftFormatCore` (especially from `SyntaxFormatRule`), a solution was to inject a `ruleNameCache` dictionary in `Context`, which is injected into every `Rule`. This allows using `generate-pipelines` to a produce a `ruleNameCache` in `SwiftFormatRules` that's injected on all `Context`s by each of the pipelines (which is done in `SwiftFormat` target). While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each `Context`), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow. All things considered, results were quite good: - serial (non optimized) vs parallel (optimized): ~7x (10.5s vs 74s) - serial (optimized) vs serial (non optimized): ~1.7x (42.5s vs 74s) - serial (optimized) vs parallel (optimized): ~4x (10.5s vs 42.5s) Tests were made on my `MacBookPro16,1` (8-core [email protected]), on a project with 2135 Swift files, compiling `swift-format` in Release mode. ## Changes - Make file preparation serial by calculating all `FileToProcess` before processing them in parallel. - Update `Context` to receive a `ruleNameCache` of type `[ObjectIdentifier: String]` on `init`. - Update `Context.isRuleEnabled` to receive the `Rule` type instead of the rule's name, to use the new cache. - Add new `RuleNameCacheGenerator` to `generate-pipeline` which outputs a static `ruleNameCache` dictionary to `RuleNameCache+Generated.swift` file in `SwiftFormatRule` target with all existing rules. - Remove the `nameCache` and `nameCacheQueue` `Rule.swift`.

p4checo · 2021-09-01T14:52:49Z

I think it's fine to have that assertion; if a rule's name isn't in the cache, it's probably a sign of other problems (like generate-pipeline not being run after adding something new), and falling back to the slower implementation would lead to slower performance for those rules. I think it's better not to risk that kind of inconsistency.

Yeah, that was precisely my thought process! 😄

I merged #263, so if you rebase and rerun generate-pipeline again, the cache should pick those up.

Indeed it did! Pushing rebased and updated changes.

allevato

Thanks for your patience! I just did some quick testing and it looks good, so I think we're good to merge now.

…l-mode Fix and improve parallel mode

Now that we have the `isOptIn` property for rules, this should be used to disable rules by default instead of removing them from pipeline generation entirely. This also unblocks swiftlang#261, since excluded rules wouldn't show up in the generated name cache, causing their tests to crash.

Rostepher requested a review from allevato August 19, 2021 21:43

allevato requested changes Aug 25, 2021

View reviewed changes

keith mentioned this pull request Aug 26, 2021

Run swift-format in parallel mode swiftlang/swift-package-manager#3690

Merged

p4checo force-pushed the fix-and-improve-parallel-mode branch from 3c5d9fb to 4449670 Compare August 26, 2021 22:12

p4checo force-pushed the fix-and-improve-parallel-mode branch from 4449670 to 1fb7c19 Compare August 26, 2021 22:26

p4checo requested a review from allevato August 26, 2021 22:28

keith reviewed Aug 27, 2021

View reviewed changes

p4checo force-pushed the fix-and-improve-parallel-mode branch from 1fb7c19 to 9823b4b Compare August 27, 2021 20:44

p4checo force-pushed the fix-and-improve-parallel-mode branch from 9823b4b to 2c9cfb4 Compare August 31, 2021 17:48

p4checo force-pushed the fix-and-improve-parallel-mode branch from 2c9cfb4 to 39aa5ab Compare August 31, 2021 22:09

allevato mentioned this pull request Sep 1, 2021

Remove the hardcoded suppressRules list in generate-pipeline. #263

Merged

p4checo force-pushed the fix-and-improve-parallel-mode branch from 39aa5ab to 6d500ea Compare September 1, 2021 14:53

allevato approved these changes Sep 2, 2021

View reviewed changes

allevato merged commit cc40c36 into swiftlang:main Sep 2, 2021

p4checo deleted the fix-and-improve-parallel-mode branch September 3, 2021 09:16

allevato added a commit to allevato/swift-format that referenced this pull request Sep 21, 2021

Merge pull request swiftlang#261 from p4checo/fix-and-improve-paralle…

bd7244c

…l-mode Fix and improve parallel mode

p4checo mentioned this pull request Oct 12, 2022

Fix parallel mode on larger projects #421

Closed

Fix and improve parallel mode #261

Fix and improve parallel mode #261

Uh oh!

Conversation

p4checo commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Update 1

Update 2

Update 3

Changes

Benchmarks

single-thread

no lock in ConfigurationLoader, DispatchQueue in Rule nameCache

Lock (os_unfair_lock_t) in ConfigurationLoader and in Rule nameCache

flattened FileToProcess array, Lock (os_unfair_lock_t) in Rule nameCache

parallel

NSLock in ConfigurationLoader, DispatchQueue in Rule nameCache

Lock (os_unfair_lock_t) in ConfigurationLoader and in Rule nameCache

flattened FileToProcess array, Lock (os_unfair_lock_t) in Rule nameCache

flattened FileToProcess array, no Rule nameCache

Benchmarks - Update 1

NSLock in ConfigurationLoader and in Rule nameCache

serial

parallel

Benchmarks - Update 2

flattened FileToProcess array, static Rule nameCache

serial

parallel

Benchmarks - Update 3

flattened FileToProcess array, static ruleNameCache injected in Context

serial

parallel

Uh oh!

allevato left a comment

Choose a reason for hiding this comment

Uh oh!

p4checo commented Aug 26, 2021

Uh oh!

compnerd commented Aug 26, 2021

Uh oh!

p4checo commented Aug 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NSLock in ConfigurationLoader and in Rule nameCache

serial

parallel

Uh oh!

keith Aug 27, 2021

Choose a reason for hiding this comment

Uh oh!

p4checo Aug 27, 2021

Choose a reason for hiding this comment

flattened FileToProcess array, no Rule nameCache

serial

parallel

Uh oh!

p4checo Aug 27, 2021

Choose a reason for hiding this comment

Uh oh!

p4checo commented Aug 27, 2021

flattened FileToProcess array, static Rule nameCache

serial

parallel

Uh oh!

allevato commented Aug 30, 2021

Uh oh!

p4checo commented Aug 30, 2021

Uh oh!

p4checo commented Aug 31, 2021

flattened FileToProcess array, static ruleNameCache injected in Context

serial

parallel

Uh oh!

allevato commented Aug 31, 2021

Uh oh!

p4checo commented Aug 31, 2021

Uh oh!

allevato commented Sep 1, 2021

Uh oh!

p4checo commented Sep 1, 2021

p4checo commented Aug 16, 2021 •

edited

Loading

no lock in `ConfigurationLoader`, DispatchQueue in `Rule` `nameCache`

`Lock` (`os_unfair_lock_t`) in `ConfigurationLoader` and in `Rule` `nameCache`

flattened `FileToProcess` array, `Lock` (`os_unfair_lock_t`) in `Rule` `nameCache`

`NSLock` in `ConfigurationLoader`, DispatchQueue in `Rule` `nameCache`

`Lock` (`os_unfair_lock_t`) in `ConfigurationLoader` and in `Rule` `nameCache`

flattened `FileToProcess` array, `Lock` (`os_unfair_lock_t`) in `Rule` `nameCache`

flattened `FileToProcess` array, no `Rule` `nameCache`

`NSLock` in `ConfigurationLoader` and in `Rule` `nameCache`

flattened `FileToProcess` array, static `Rule` `nameCache`

flattened `FileToProcess` array, static `ruleNameCache` injected in `Context`

p4checo commented Aug 26, 2021 •

edited

Loading

`NSLock` in `ConfigurationLoader` and in `Rule` `nameCache`

flattened `FileToProcess` array, no `Rule` `nameCache`

flattened `FileToProcess` array, static `Rule` `nameCache`

flattened `FileToProcess` array, static `ruleNameCache` injected in `Context`