-
Notifications
You must be signed in to change notification settings - Fork 248
Fix and improve parallel mode #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this and providing such thorough benchmarks.
My main concern here is that I'd prefer that swift-format not have to implement its own Lock
primitive in order to get decent performance. AFAIK, neither implementation in that class will work on Windows, since it doesn't support pthreads (@compnerd correct me if I'm wrong here).
Ideally, I'd like to have this work without needing three (or even two) low-level platform-specific solutions, since it shouldn't be the responsibility of swift-format to maintain those. Can we improve the existing use of Dispatch or figure out why NSLock
performed so horribly instead and still get some wins, even if it's not the absolute fastest outcome?
Thanks @allevato 🙏🏼 Yeah, I had a feeling that adding our own I'll give it another shot with Will update this in the meantime |
@allevato - you are correct, pthreads (POSIX Threads) are not available on Windows, and we would need to use Windows threads. There are already 2 (or was it 3?) implementations of Lock that I've done previously, though using |
3c5d9fb
to
4449670
Compare
Sadly, using
|
4449670
to
1fb7c19
Compare
Sources/SwiftFormatCore/Rule.swift
Outdated
|
||
if let cachedName = cachedName { | ||
nameCacheLock.lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the time, this worked fine, but maybe it's worth revisiting if it's worth caching this at all #242 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very valid point. I did some tests flattening the FileToProcess
array and removing Rule
's nameCache
, effectively removing all need for synchronization in a concurrent setup (you can find them in one of the benchmarks in the initial PR description). It was indeed faster than this current NSLock
version, but only in parallel mode. At the time I reverted back to using a cache with the low level lock as it was more performant still (13.5s vs 17.5s in parallel mode), while being the fastest setup I could achieve overall.
I've re-ran some benchmarks and I think that it's probably not worth removing synchronization, as you can see below.
(I also noticed a very big difference in parallel mode from my previous tests, so it's likely that I screwed up when copy/pasting the previous results 🙈)
flattened FileToProcess
array, no Rule
nameCache
serial
Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
Time (mean ± σ): 96.948 s ± 2.903 s [User: 93.642 s, System: 2.513 s]
Range (min … max): 93.368 s … 102.528 s 10 runs
parallel
Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
Time (mean ± σ): 34.519 s ± 6.421 s [User: 364.389 s, System: 13.934 s]
Range (min … max): 18.097 s … 38.766 s 10 runs
After opening this PR I also did some other attempts at rethinking the Rule.ruleName
mechanics, but ended up giving up because I never reached an elegant solution. I tried:
- Precomputing all rule names statically (during codegen), but hit some issues related with dynamic dispatch (
Rule.ruleName
protocol extension would always be called). - Removing
Rule.ruleName
default protocol extension and using inheritance, worked but forced rules to overrideruleName
, defeating the purpose of the protocol's abstraction a bit. - Changing
Rule
's protocol itself (e.g. makingruleName
an instance property), which created other issues and the abstraction itself felt a bit wrong, because the rule's name should effectively be at the type level.
I definitely think we could start this discussion on how to improve this because Rule.ruleName
is a part of a very hot path especially because it's a crucial step in visitAny
. In the meantime I'll give it another go to see if I can come up with something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think I managed to come up with a pretty good trade off that allows having a static rule name cache, by shuffling some files and using codegen from generate-pipeline 😁
Let's see what @allevato thinks of it
1fb7c19
to
9823b4b
Compare
Given that using By flattening the In order to have both I think this achieves quite a good tradeoff, and doesn't mess too much with project organization. What do you say, @allevato? New results:
flattened
|
This is shaping up nicely!
Almost. They're not concrete rules; they're the base types that other rules should extend. It's important that we keep those two types in the It may be some time before we have the runtime metadata APIs we need to make that a reality, but I'd like to keep the design with that in mind, otherwise we'd just have to back the changes out later anyway. I don't mind the generation of the cache itself, but can you make it work without moving those types around? |
Yes, I am aware that they are base classes to the other rules (hence placing them in a directory called Base), but didn't consider the future plans of having dynamic discovery of rules without
It would quite a trivial change to do, but it's a valid point nonetheless.
I can definitely try. But there's both a circular dependency and a dynamic dispatch problem as things are, which makes this complicated:
These were the reasons why I ended up moving those two base classes and the To work around issue in point 5 a solution would be to implement The absolute best approach for this performance-wise would be for rules to simply declare a I'll give it yet another go to see what I can come up with. Do you have any idea of what could work? |
9823b4b
to
2c9cfb4
Compare
Well, I have yet another possible solution, which hopefully will tick all the necessary boxes 😅 So, in order to have a static rule name cache that's accessible from This allows using While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each All things considered, results were quite good nonetheless:
flattened
|
This feels like a good way to do it! I can also see it scaling well to support dynamic discovery whenever we're able to do that, because we'd inject a different cache into the context instead of mutating some global singleton. It looks like there are still some outdated changes from the previous push in this PR (the old generated cache file is still there along with the new one, and the lint/format rule base types are still in the
I don't think the cache will be copied here; the dictionary is never mutated after it's created, so creating a new |
That's great news!
Whoops 🙈 Cherry-pick fail. Will fix right away
Yes, that makes sense. COW's got our backs here 😁 |
2c9cfb4
to
39aa5ab
Compare
Now that we have the `isOptIn` property for rules, this should be used to disable rules by default instead of removing them from pipeline generation entirely. This also unblocks swiftlang#261, since excluded rules wouldn't show up in the generated name cache, causing their tests to crash.
We're almost there! I ran the tests locally and realized that the old hardcoded list of rules suppressed from I've got #263 out to fix this, which should unblock this change. |
Nice catch, and thanks for the quick fix! 🙏🏼 If I'm not mistaken, that was caused by the new |
I think it's fine to have that assertion; if a rule's name isn't in the cache, it's probably a sign of other problems (like I merged #263, so if you rebase and rerun |
Parallel mode was crashing with bad memory access due to data races accessing `Frontend.configurationLoader`, more specifically its `cache`. After serializing access (initially with an `NSLock`), I observed that despite not crashing anymore, the performance was surprisingly *worst* than in single threaded mode. That led me down a road of investigating why this was happening, and after some profiling I discovered that `Rule`'s `nameCache` was the main source of contention - causing around 30% of total spent time waiting for locks (`ulock_wait`) on my tests. After replacing the synchronization mechanism on `nameCache` with a more efficient `os_unfair_lock_t` (`pthread_mutex_t` in Linux), the contention dropped significantly, and parallel mode now outperformed single threaded mode as expected. As a bonus, these changes also improved single threaded mode performance as well, due to the reduced overhead of using a lock vs a queue. I then used the same `Lock` approach to serialize access to `Frontend.configurationLoader` which increased the performance gap even further. After these improvements, I was able to obtain quite significant performance gains using `Lock`: - serial (non optimized) vs parallel (optimized): ~5.4x (13.5s vs 74s) - serial (optimized) vs serial (non optimized): ~1.6x (44s vs 74s) - serial (optimized) vs parallel (optimized): ~3.2x (13.5s vs 44s) Sadly, a custom `Lock` implementation is not ideal for `swift-format` to maintain and Windows support was not provided. As such, The alternative would be to use `NSLock` which being a part of `Foundation` is supported on all major platforms. Using `NSLock` the improvements were not so good, unfortunately: - serial (non optimized) vs parallel (NSLock): ~1,9x (38s vs 74s) - serial (NSLock) vs serial (non optimized): ~1,4x (52s vs 74s) - serial (NSLock) vs parallel (NSLock): ~1,3x (38s vs 52s) The best solution was then to try and remove all concurrency contention from `swift-format`. By flattening the `FileToProcess` array and making the `Rule`'s `nameCache` static, close to optimal performance should be achievable. Base rules `SyntaxFormatRule` and `SyntaxLintRule` should be kept in the `SwiftFormatCore` target to eventually support dynamic discovery of rules (ditching `generate-pipeline`) and only require rule implementers to import `SwiftFormatCore`. On the other hand, any statically generated `ruleName` cache must be in `SwiftFormatRules` target, because that's where rule implementations reside to capture their `ObjectIdentifier` at compile time. So, in order to have a static rule name cache that's accessible from `SwiftFormatCore` (especially from `SyntaxFormatRule`), a solution was to inject a `ruleNameCache` dictionary in `Context`, which is injected into every `Rule`. This allows using `generate-pipelines` to a produce a `ruleNameCache` in `SwiftFormatRules` that's injected on all `Context`s by each of the pipelines (which is done in `SwiftFormat` target). While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each `Context`), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow. All things considered, results were quite good: - serial (non optimized) vs parallel (optimized): ~7x (10.5s vs 74s) - serial (optimized) vs serial (non optimized): ~1.7x (42.5s vs 74s) - serial (optimized) vs parallel (optimized): ~4x (10.5s vs 42.5s) Tests were made on my `MacBookPro16,1` (8-core [email protected]), on a project with 2135 Swift files, compiling `swift-format` in Release mode. ## Changes - Make file preparation serial by calculating all `FileToProcess` before processing them in parallel. - Update `Context` to receive a `ruleNameCache` of type `[ObjectIdentifier: String]` on `init`. - Update `Context.isRuleEnabled` to receive the `Rule` type instead of the rule's name, to use the new cache. - Add new `RuleNameCacheGenerator` to `generate-pipeline` which outputs a static `ruleNameCache` dictionary to `RuleNameCache+Generated.swift` file in `SwiftFormatRule` target with all existing rules. - Remove the `nameCache` and `nameCacheQueue` `Rule.swift`.
Yeah, that was precisely my thought process! 😄
Indeed it did! Pushing rebased and updated changes. |
39aa5ab
to
6d500ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience! I just did some quick testing and it looks good, so I think we're good to merge now.
…l-mode Fix and improve parallel mode
Now that we have the `isOptIn` property for rules, this should be used to disable rules by default instead of removing them from pipeline generation entirely. This also unblocks swiftlang#261, since excluded rules wouldn't show up in the generated name cache, causing their tests to crash.
Motivation
Parallel mode was crashing with bad memory access due to data races accessing
Frontend.configurationLoader
, more specifically itscache
.After serializing access (initially with an
NSLock
), I observed that despite not crashing anymore, the performance was surprisingly worst than in single threaded mode.That led me down a road of investigating why this was happening, and after some profiling I discovered that
Rule
'snameCache
was the main source of contention - causing around 30% of total spent time waiting for locks (ulock_wait
) on my tests.After replacing the synchronization mechanism on
nameCache
with a more efficientos_unfair_lock_t
(pthread_mutex_t
in Linux), the contention dropped significantly, and parallel mode now outperformed single threaded mode as expected. As a bonus, these changes also improved single threaded mode performance as well, due to the reduced overhead of using a lock vs a queue.I then used the same
Lock
approach to serialize access toFrontend.configurationLoader
which increased the performance gap even further.After these improvements, I was able to obtain quite significant performance gains:
Tests were made on my
MacBookPro16,1
(8-core i9 @2.4GHz), on a project with 2135 Swift files, compilingswift-format
in Release mode.Update 1
Sadly, a custom
Lock
implementation is not ideal forswift-format
to maintain and Windows support was not provided. As such, The alternative would be to useNSLock
which being a part ofFoundation
is supported on all major platforms.Using
NSLock
the improvements were not so good, unfortunately:Update 2
The best solution was then to try and remove all concurrency contention from
swift-format
. By flattening theFileToProcess
array and making theRule
'snameCache
static, the best overall results were achieved:In order to have both
Rule.ruleName
caching and avoid synchronization, a static cache of rule names is now generated via thegenerate-pipeline
tool.To make the rule types available to the
Rule.ruleName
protocol extension, it was moved toSwiftFormatRules
target, together withSyntaxFormatRule
andSyntaxLintRule
.Update 3
Base rules
SyntaxFormatRule
andSyntaxLintRule
should be kept in theSwiftFormatCore
target to eventually support dynamic discovery of rules (ditchinggenerate-pipeline
) and only require rule implementers to importSwiftFormatCore
.On the other hand, any statically generated
ruleName
cache must be inSwiftFormatRules
target, because that's where rule implementations reside to capture theirObjectIdentifier
at compile time.So, in order to have a static rule name cache that's accessible from
SwiftFormatCore
(especially fromSyntaxFormatRule
), a solution was to inject aruleNameCache
dictionary inContext
, which is injected into everyRule
.This allows using
generate-pipelines
to a produce aruleNameCache
inSwiftFormatRules
that's injected on allContext
s by each of the pipelines (which is done inSwiftFormat
target).While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each
Context
), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow.All things considered, results were quite good:
Changes
Make file preparation serial by calculating all
FileToProcess
before processing them in parallel.Update
Context
to receive aruleNameCache
of type[ObjectIdentifier: String]
oninit
.Update
Context.isRuleEnabled
to receive theRule
type instead of the rule's name, to use the new cache.Add new
RuleNameCacheGenerator
togenerate-pipeline
which outputs a staticruleNameCache
dictionary toRuleNameCache+Generated.swift
file inSwiftFormatRule
target with all existing rules.Remove the
nameCache
andnameCacheQueue
Rule.swift
.Benchmarks
Below are benchmarks of some configurations attempted during development, using
hyperfine
.(Some of these configurations' implementations are available in branches on my fork).
The flattened
FileToProcess
array was an attempt to remove the need to synchronize access toFrontend.configurationLoader
completely, by making initial file preparation sequential. It results in a slight performance improvement in single threaded mode (~3.6%), but a performance loss in parallel mode (~4.6%). Opted for maximizing parallel mode's performance, as that should be the more optimized mode.Work was done based off
swift-5.4-branch
, because I'm using Xcode 12.5.single-thread
no lock in
ConfigurationLoader
, DispatchQueue inRule
nameCache
branches:
swift-5.4-branch
,main
Lock
(os_unfair_lock_t
) inConfigurationLoader
and inRule
nameCache
branch:
swift-5.4-branch-lock-configuration-loader-and-rule-name-cache
, this PR'sflattened
FileToProcess
array,Lock
(os_unfair_lock_t
) inRule
nameCache
branch:
swift-5.4-branch-serial-files-to-process-and-lock-rule-name-cache
parallel
NSLock
inConfigurationLoader
, DispatchQueue inRule
nameCache
Lock
(os_unfair_lock_t
) inConfigurationLoader
and inRule
nameCache
branch:
swift-5.4-branch-lock-configuration-loader-and-rule-name-cache
, this PR'sflattened
FileToProcess
array,Lock
(os_unfair_lock_t
) inRule
nameCache
branch:
swift-5.4-branch-serial-files-to-process-and-lock-rule-name-cache
flattened
FileToProcess
array, noRule
nameCache
branch:
swift-5.4-branch-serial-files-to-process-and-no-rule-name-cache
Benchmarks - Update 1
NSLock
inConfigurationLoader
and inRule
nameCache
branch:
swift-5.4-branch-nslock-configuration-loader-and-rule-name-cache
serial
parallel
Benchmarks - Update 2
flattened
FileToProcess
array, staticRule
nameCache
branch:
swift-5.4-branch-static-rule-name-cache
serial
parallel
Benchmarks - Update 3
flattened
FileToProcess
array, staticruleNameCache
injected inContext
branch:
swift-5.4-branch-rule-name-cache-in-context
serial
parallel