Skip to content

Fix and improve parallel mode #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 2, 2021

Conversation

p4checo
Copy link
Contributor

@p4checo p4checo commented Aug 16, 2021

Motivation

Parallel mode was crashing with bad memory access due to data races accessing Frontend.configurationLoader, more specifically its cache.

After serializing access (initially with an NSLock), I observed that despite not crashing anymore, the performance was surprisingly worst than in single threaded mode.

That led me down a road of investigating why this was happening, and after some profiling I discovered that Rule's nameCache was the main source of contention - causing around 30% of total spent time waiting for locks (ulock_wait) on my tests.

After replacing the synchronization mechanism on nameCache with a more efficient os_unfair_lock_t (pthread_mutex_t in Linux), the contention dropped significantly, and parallel mode now outperformed single threaded mode as expected. As a bonus, these changes also improved single threaded mode performance as well, due to the reduced overhead of using a lock vs a queue.

I then used the same Lock approach to serialize access to Frontend.configurationLoader which increased the performance gap even further.

After these improvements, I was able to obtain quite significant performance gains:

  • serial (non optimized) vs parallel (optimized): ~5.4x (13.5s vs 74s)
  • serial (optimized) vs serial (non optimized): ~1.6x (44s vs 74s)
  • serial (optimized) vs parallel (optimized): ~3.2x (13.5s vs 44s)

Tests were made on my MacBookPro16,1 (8-core i9 @2.4GHz), on a project with 2135 Swift files, compiling swift-format in Release mode.

Update 1

Sadly, a custom Lock implementation is not ideal for swift-format to maintain and Windows support was not provided. As such, The alternative would be to use NSLock which being a part of Foundation is supported on all major platforms.

Using NSLock the improvements were not so good, unfortunately:

  • serial (non optimized) vs parallel (NSLock): ~1,9x (38s vs 74s)
  • serial (NSLock) vs serial (non optimized): ~1,4x (52s vs 74s)
  • serial (NSLock) vs parallel (NSLock): ~1,3x (38s vs 52s)

Update 2

The best solution was then to try and remove all concurrency contention from swift-format. By flattening the FileToProcess array and making the Rule's nameCache static, the best overall results were achieved:

  • serial (non optimized) vs parallel (optimized): ~6.7x (11s vs 74s)
  • serial (optimized) vs serial (non optimized): ~1.8x (40.6s vs 74s)
  • serial (optimized) vs parallel (optimized): ~3.6x (11s vs 40.6s)

In order to have both Rule.ruleName caching and avoid synchronization, a static cache of rule names is now generated via the generate-pipeline tool.

To make the rule types available to the Rule.ruleName protocol extension, it was moved to SwiftFormatRules target, together with SyntaxFormatRule and SyntaxLintRule.

Update 3

Base rules SyntaxFormatRule and SyntaxLintRule should be kept in the SwiftFormatCore target to eventually support dynamic discovery of rules (ditching generate-pipeline) and only require rule implementers to import SwiftFormatCore.

On the other hand, any statically generated ruleName cache must be in SwiftFormatRules target, because that's where rule implementations reside to capture their ObjectIdentifier at compile time.

So, in order to have a static rule name cache that's accessible from SwiftFormatCore (especially from SyntaxFormatRule), a solution was to inject a ruleNameCache dictionary in Context, which is injected into every Rule.

This allows using generate-pipelines to a produce a ruleNameCache in SwiftFormatRules that's injected on all Contexts by each of the pipelines (which is done in SwiftFormat target).

While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each Context), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow.

All things considered, results were quite good:

  • serial (non optimized) vs parallel (optimized): ~7x (10.5s vs 74s)
  • serial (optimized) vs serial (non optimized): ~1.7x (42.5s vs 74s)
  • serial (optimized) vs parallel (optimized): ~4x (10.5s vs 42.5s)

Changes

  • Make file preparation serial by calculating all FileToProcess before processing them in parallel.

  • Update Context to receive a ruleNameCache of type [ObjectIdentifier: String] on init.

  • Update Context.isRuleEnabled to receive the Rule type instead of the rule's name, to use the new cache.

  • Add new RuleNameCacheGenerator to generate-pipeline which outputs a static ruleNameCache dictionary to RuleNameCache+Generated.swift file in SwiftFormatRule target with all existing rules.

  • Remove the nameCache and nameCacheQueue Rule.swift.

Benchmarks

Below are benchmarks of some configurations attempted during development, using hyperfine.
(Some of these configurations' implementations are available in branches on my fork).

The flattened FileToProcess array was an attempt to remove the need to synchronize access to Frontend.configurationLoader completely, by making initial file preparation sequential. It results in a slight performance improvement in single threaded mode (~3.6%), but a performance loss in parallel mode (~4.6%). Opted for maximizing parallel mode's performance, as that should be the more optimized mode.

Work was done based off swift-5.4-branch, because I'm using Xcode 12.5.

single-thread

no lock in ConfigurationLoader, DispatchQueue in Rule nameCache

branches: swift-5.4-branch, main

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     73.908 s ±  1.882 s    [User: 70.942 s, System: 2.086 s]
  Range (min … max):   70.526 s … 76.510 s    10 runs

Lock (os_unfair_lock_t) in ConfigurationLoader and in Rule nameCache

branch: swift-5.4-branch-lock-configuration-loader-and-rule-name-cache, this PR's

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     44.077 s ±  1.554 s    [User: 41.538 s, System: 1.832 s]
  Range (min … max):   42.809 s … 48.050 s    10 runs

flattened FileToProcess array, Lock (os_unfair_lock_t) in Rule nameCache

branch: swift-5.4-branch-serial-files-to-process-and-lock-rule-name-cache

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     42.365 s ±  0.933 s    [User: 40.165 s, System: 1.629 s]
  Range (min … max):   40.818 s … 43.554 s    10 runs

parallel

NSLock in ConfigurationLoader, DispatchQueue in Rule nameCache

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     86.431 s ±  7.208 s    [User: 369.514 s, System: 276.583 s]
  Range (min … max):   72.179 s … 93.882 s    10 runs

Lock (os_unfair_lock_t) in ConfigurationLoader and in Rule nameCache

branch: swift-5.4-branch-lock-configuration-loader-and-rule-name-cache, this PR's

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     13.565 s ±  0.354 s    [User: 101.115 s, System: 69.132 s]
  Range (min … max):   12.921 s … 14.184 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     13.453 s ±  0.448 s    [User: 99.746 s, System: 68.694 s]
  Range (min … max):   12.409 s … 13.992 s    10 runs

flattened FileToProcess array, Lock (os_unfair_lock_t) in Rule nameCache

branch: swift-5.4-branch-serial-files-to-process-and-lock-rule-name-cache

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     14.146 s ±  0.283 s    [User: 101.879 s, System: 73.914 s]
  Range (min … max):   13.711 s … 14.628 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     14.192 s ±  0.537 s    [User: 101.197 s, System: 75.123 s]
  Range (min … max):   13.453 s … 15.096 s    10 runs

flattened FileToProcess array, no Rule nameCache

branch: swift-5.4-branch-serial-files-to-process-and-no-rule-name-cache

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     17.373 s ±  0.348 s    [User: 199.595 s, System: 5.241 s]
  Range (min … max):   16.972 s … 17.946 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     17.553 s ±  0.812 s    [User: 198.131 s, System: 5.342 s]
  Range (min … max):   16.491 s … 18.958 s    10 runs

Benchmarks - Update 1

NSLock in ConfigurationLoader and in Rule nameCache

branch: swift-5.4-branch-nslock-configuration-loader-and-rule-name-cache

serial
Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     43.643 s ±  1.143 s    [User: 41.397 s, System: 1.711 s]
  Range (min … max):   42.309 s … 45.757 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     47.453 s ±  2.253 s    [User: 44.743 s, System: 2.024 s]
  Range (min … max):   42.963 s … 51.134 s    10 runs
parallel
Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     39.957 s ±  4.178 s    [User: 160.686 s, System: 337.798 s]
  Range (min … max):   29.375 s … 44.625 s    10 runs

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     39.103 s ±  2.587 s    [User: 157.523 s, System: 343.123 s]
  Range (min … max):   32.831 s … 42.022 s    10 runs

Benchmarks - Update 2

flattened FileToProcess array, static Rule nameCache

branch: swift-5.4-branch-static-rule-name-cache

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     40.685 s ±  0.517 s    [User: 38.593 s, System: 1.620 s]
  Range (min … max):   40.013 s … 41.598 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     40.616 s ±  0.791 s    [User: 38.534 s, System: 1.580 s]
  Range (min … max):   39.264 s … 41.752 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     10.416 s ±  1.732 s    [User: 111.428 s, System: 4.094 s]
  Range (min … max):    8.246 s … 12.632 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     11.455 s ±  2.099 s    [User: 122.742 s, System: 4.601 s]
  Range (min … max):    7.896 s … 13.399 s    10 runs

Benchmarks - Update 3

flattened FileToProcess array, static ruleNameCache injected in Context

branch: swift-5.4-branch-rule-name-cache-in-context

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     43.524 s ±  1.003 s    [User: 41.400 s, System: 1.676 s]
  Range (min … max):   42.648 s … 45.514 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     41.646 s ±  0.714 s    [User: 39.650 s, System: 1.610 s]
  Range (min … max):   40.381 s … 42.525 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     10.088 s ±  1.160 s    [User: 108.481 s, System: 3.700 s]
  Range (min … max):    8.441 s … 11.655 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     11.207 s ±  0.735 s    [User: 118.658 s, System: 4.291 s]
  Range (min … max):    9.315 s … 11.871 s    10 runs

@Rostepher Rostepher requested a review from allevato August 19, 2021 21:43
Copy link
Member

@allevato allevato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this and providing such thorough benchmarks.

My main concern here is that I'd prefer that swift-format not have to implement its own Lock primitive in order to get decent performance. AFAIK, neither implementation in that class will work on Windows, since it doesn't support pthreads (@compnerd correct me if I'm wrong here).

Ideally, I'd like to have this work without needing three (or even two) low-level platform-specific solutions, since it shouldn't be the responsibility of swift-format to maintain those. Can we improve the existing use of Dispatch or figure out why NSLock performed so horribly instead and still get some wins, even if it's not the absolute fastest outcome?

@p4checo
Copy link
Contributor Author

p4checo commented Aug 26, 2021

Thanks @allevato 🙏🏼

Yeah, I had a feeling that adding our own Lock primitive was risky 😅 And I completely forgot about Windows, so your comment makes complete sense on both fronts 🙈

I'll give it another shot with NSLock, because I believe that via Dispatch the only option besides a DispatchQueue (which we've seen generates a lot of contention) would be a DispatchSemaphore which is not ideal. Even if the performance is not so good as using a platform-specific low-level lock, I think using an NSLock should still provide some wins.

Will update this in the meantime

@compnerd
Copy link
Member

@allevato - you are correct, pthreads (POSIX Threads) are not available on Windows, and we would need to use Windows threads. There are already 2 (or was it 3?) implementations of Lock that I've done previously, though using NSLock would work as well as that is available on Windows as well.

@p4checo
Copy link
Contributor Author

p4checo commented Aug 26, 2021

Sadly, using NSLock results in a significant performance hit vs the low-level locks, but still provides a slight improvement over the previous implementation. Haven't dived deep into why though, as I couldn't find much information about NSLock's semantics (e.g. fairness) or internals on a quick search.

NSLock in ConfigurationLoader and in Rule nameCache

branch: swift-5.4-branch-nslock-configuration-loader-and-rule-name-cache

serial
Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     43.643 s ±  1.143 s    [User: 41.397 s, System: 1.711 s]
  Range (min … max):   42.309 s … 45.757 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     47.453 s ±  2.253 s    [User: 44.743 s, System: 2.024 s]
  Range (min … max):   42.963 s … 51.134 s    10 runs
parallel
Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     39.957 s ±  4.178 s    [User: 160.686 s, System: 337.798 s]
  Range (min … max):   29.375 s … 44.625 s    10 runs

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     39.103 s ±  2.587 s    [User: 157.523 s, System: 343.123 s]
  Range (min … max):   32.831 s … 42.022 s    10 runs

EDIT: re-ran benchmarks without the project being formatted being simultaneously opened on Xcode 🤦🏼

@p4checo p4checo force-pushed the fix-and-improve-parallel-mode branch from 4449670 to 1fb7c19 Compare August 26, 2021 22:26
@p4checo p4checo requested a review from allevato August 26, 2021 22:28

if let cachedName = cachedName {
nameCacheLock.lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the time, this worked fine, but maybe it's worth revisiting if it's worth caching this at all #242 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very valid point. I did some tests flattening the FileToProcess array and removing Rule's nameCache, effectively removing all need for synchronization in a concurrent setup (you can find them in one of the benchmarks in the initial PR description). It was indeed faster than this current NSLock version, but only in parallel mode. At the time I reverted back to using a cache with the low level lock as it was more performant still (13.5s vs 17.5s in parallel mode), while being the fastest setup I could achieve overall.

I've re-ran some benchmarks and I think that it's probably not worth removing synchronization, as you can see below.
(I also noticed a very big difference in parallel mode from my previous tests, so it's likely that I screwed up when copy/pasting the previous results 🙈)

flattened FileToProcess array, no Rule nameCache

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     96.948 s ±  2.903 s    [User: 93.642 s, System: 2.513 s]
  Range (min … max):   93.368 s … 102.528 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     34.519 s ±  6.421 s    [User: 364.389 s, System: 13.934 s]
  Range (min … max):   18.097 s … 38.766 s    10 runs

After opening this PR I also did some other attempts at rethinking the Rule.ruleName mechanics, but ended up giving up because I never reached an elegant solution. I tried:

  1. Precomputing all rule names statically (during codegen), but hit some issues related with dynamic dispatch (Rule.ruleName protocol extension would always be called).
  2. Removing Rule.ruleName default protocol extension and using inheritance, worked but forced rules to override ruleName, defeating the purpose of the protocol's abstraction a bit.
  3. Changing Rule's protocol itself (e.g. making ruleName an instance property), which created other issues and the abstraction itself felt a bit wrong, because the rule's name should effectively be at the type level.

I definitely think we could start this discussion on how to improve this because Rule.ruleName is a part of a very hot path especially because it's a crucial step in visitAny. In the meantime I'll give it another go to see if I can come up with something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think I managed to come up with a pretty good trade off that allows having a static rule name cache, by shuffling some files and using codegen from generate-pipeline 😁

Let's see what @allevato thinks of it

@p4checo p4checo force-pushed the fix-and-improve-parallel-mode branch from 1fb7c19 to 9823b4b Compare August 27, 2021 20:44
@p4checo
Copy link
Contributor Author

p4checo commented Aug 27, 2021

Given that using NSLock didn't come near optimal performance, the best solution was then to try and remove all concurrency contention from swift-format. 🤓

By flattening the FileToProcess array and making the Rule's nameCache static, the best overall results were achieved! 🚀

In order to have both Rule.ruleName caching and avoid synchronization, a static cache of rule names is now generated via the generate-pipeline tool. To make the rule types available to the Rule.ruleName protocol extension, it was moved to SwiftFormatRules target, together with SyntaxFormatRule and SyntaxLintRule.

I think this achieves quite a good tradeoff, and doesn't mess too much with project organization. SyntaxFormatRule and SyntaxLintRule are rules after all 😅

What do you say, @allevato?

New results:

  • serial (non optimized) vs parallel (optimized): ~6.7x (11s vs 74s)
  • serial (optimized) vs serial (non optimized): ~1.8x (40.6s vs 74s)
  • serial (optimized) vs parallel (optimized): ~3.6x (11s vs 40.6s)

flattened FileToProcess array, static Rule nameCache

branch: swift-5.4-branch-static-rule-name-cache

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     40.685 s ±  0.517 s    [User: 38.593 s, System: 1.620 s]
  Range (min … max):   40.013 s … 41.598 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     40.616 s ±  0.791 s    [User: 38.534 s, System: 1.580 s]
  Range (min … max):   39.264 s … 41.752 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     10.416 s ±  1.732 s    [User: 111.428 s, System: 4.094 s]
  Range (min … max):    8.246 s … 12.632 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     11.455 s ±  2.099 s    [User: 122.742 s, System: 4.601 s]
  Range (min … max):    7.896 s … 13.399 s    10 runs

@allevato
Copy link
Member

This is shaping up nicely!

In order to have both Rule.ruleName caching and avoid synchronization, a static cache of rule names is now generated via the generate-pipeline tool. To make the rule types available to the Rule.ruleName protocol extension, it was moved to SwiftFormatRules target, together with SyntaxFormatRule and SyntaxLintRule.

I think this achieves quite a good tradeoff, and doesn't mess too much with project organization. SyntaxFormatRule and SyntaxLintRule are rules after all 😅

Almost. They're not concrete rules; they're the base types that other rules should extend. It's important that we keep those two types in the SwiftFormatCore module and separate from the SwiftFormatRules module that implements concrete rules, because eventually we'd like to be able to support dynamic discovery of rules and get rid of generate-pipeline completely. That would let folks write their own rules as plug-ins where they would import SwiftFormatCore, and we'd also factor out the built-in rules in the same fashion.

It may be some time before we have the runtime metadata APIs we need to make that a reality, but I'd like to keep the design with that in mind, otherwise we'd just have to back the changes out later anyway.

I don't mind the generation of the cache itself, but can you make it work without moving those types around?

@p4checo
Copy link
Contributor Author

p4checo commented Aug 30, 2021

Almost. They're not concrete rules; they're the base types that other rules should extend.

Yes, I am aware that they are base classes to the other rules (hence placing them in a directory called Base), but didn't consider the future plans of having dynamic discovery of rules without generate-pipeline.

It may be some time before we have the runtime metadata APIs we need to make that a reality, but I'd like to keep the design with that in mind, otherwise we'd just have to back the changes out later anyway.

It would quite a trivial change to do, but it's a valid point nonetheless.

I don't mind the generation of the cache itself, but can you make it work without moving those types around?

I can definitely try. But there's both a circular dependency and a dynamic dispatch problem as things are, which makes this complicated:

  1. SyntaxFormatRule, SyntaxLintRule and Rule live in SwiftFormatCore
  2. All current rule implementations live in SwiftFormatRules, which depends on SwiftFormatCore
  3. generating the cache as it is requires that the types are visible to the compiler at build time, so need to be in SwiftFormatRules (because that's where rule implementations are)
  4. We can't import SwiftFormatRules (to access the codegen'ed cache) in Rule.swift because it would create a circular dependency
  5. As they are, SyntaxFormatRule, SyntaxLintRule's ruleName invocation will always use dynamic dispatch and use the protocol extension implementation in Rule.swift.

These were the reasons why I ended up moving those two base classes and the Rule.ruleName protocol extension to SwiftFormatRules, and why I consider it an elegant solution given the constraints.

To work around issue in point 5 a solution would be to implement ruleName in base classes and override in subclasses. I've attempted to do so using codegen but it would require classes to use ObjC runtime to allow static vars to be declared in extensions, which then creates other issues related with SyntaxRewriter and SyntaxVisitor already being base classes.

The absolute best approach for this performance-wise would be for rules to simply declare a static let ruleName: String = "Foo" abandoning the protocol extension completely, which although well intended and nicely abstracted, given the project and code structure is causing a significant performance penalty.

I'll give it yet another go to see what I can come up with. Do you have any idea of what could work?

@p4checo p4checo force-pushed the fix-and-improve-parallel-mode branch from 9823b4b to 2c9cfb4 Compare August 31, 2021 17:48
@p4checo
Copy link
Contributor Author

p4checo commented Aug 31, 2021

Well, I have yet another possible solution, which hopefully will tick all the necessary boxes 😅

So, in order to have a static rule name cache that's accessible from SwiftFormatCore (especially from SyntaxFormatRule), a solution was to inject a ruleNameCache dictionary in Context, which is injected into every Rule.

This allows using generate-pipelines to a produce a ruleNameCache in SwiftFormatRules that's injected on all Contexts by each of the pipelines (which is done in SwiftFormat target).

While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each Context), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow.

All things considered, results were quite good nonetheless:

  • serial (non optimized) vs parallel (optimized): ~7x (10.5s vs 74s)
  • serial (optimized) vs serial (non optimized): ~1.7x (42.5s vs 74s)
  • serial (optimized) vs parallel (optimized): ~4x (10.5s vs 42.5s)

flattened FileToProcess array, static ruleNameCache injected in Context

branch: swift-5.4-branch-rule-name-cache-in-context

serial

Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     43.524 s ±  1.003 s    [User: 41.400 s, System: 1.676 s]
  Range (min … max):   42.648 s … 45.514 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i Sources
  Time (mean ± σ):     41.646 s ±  0.714 s    [User: 39.650 s, System: 1.610 s]
  Range (min … max):   40.381 s … 42.525 s    10 runs

parallel

Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     10.088 s ±  1.160 s    [User: 108.481 s, System: 3.700 s]
  Range (min … max):    8.441 s … 11.655 s    10 runs

Benchmark #2: ./swift-format format --configuration .swift-format -r -i -p Sources
  Time (mean ± σ):     11.207 s ±  0.735 s    [User: 118.658 s, System: 4.291 s]
  Range (min … max):    9.315 s … 11.871 s    10 runs

@allevato
Copy link
Member

So, in order to have a static rule name cache that's accessible from SwiftFormatCore (especially from SyntaxFormatRule), a solution was to inject a ruleNameCache dictionary in Context, which is injected into every Rule.

This allows using generate-pipelines to a produce a ruleNameCache in SwiftFormatRules that's injected on all Contexts by each of the pipelines (which is done in SwiftFormat target).

This feels like a good way to do it! I can also see it scaling well to support dynamic discovery whenever we're able to do that, because we'd inject a different cache into the context instead of mutating some global singleton.

It looks like there are still some outdated changes from the previous push in this PR (the old generated cache file is still there along with the new one, and the lint/format rule base types are still in the SwiftFormatRules module instead of back in SwiftFormatCore). Can you fix those up please?

While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each Context), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow.

I don't think the cache will be copied here; the dictionary is never mutated after it's created, so creating a new Context will just store a Dictionary value that shares the same underlying storage with the original. That should be relatively fast.

@p4checo
Copy link
Contributor Author

p4checo commented Aug 31, 2021

This feels like a good way to do it! I can also see it scaling well to support dynamic discovery whenever we're able to do that, because we'd inject a different cache into the context instead of mutating some global singleton.

That's great news!

It looks like there are still some outdated changes from the previous push in this PR (the old generated cache file is still there along with the new one, and the lint/format rule base types are still in the SwiftFormatRules module instead of back in SwiftFormatCore). Can you fix those up please?

Whoops 🙈 Cherry-pick fail. Will fix right away

I don't think the cache will be copied here; the dictionary is never mutated after it's created, so creating a new Context will just store a Dictionary value that shares the same underlying storage with the original. That should be relatively fast.

Yes, that makes sense. COW's got our backs here 😁

@p4checo p4checo force-pushed the fix-and-improve-parallel-mode branch from 2c9cfb4 to 39aa5ab Compare August 31, 2021 22:09
allevato added a commit to allevato/swift-format that referenced this pull request Sep 1, 2021
Now that we have the `isOptIn` property for rules, this should be
used to disable rules by default instead of removing them from
pipeline generation entirely. This also unblocks
swiftlang#261, since excluded
rules wouldn't show up in the generated name cache, causing their
tests to crash.
@allevato
Copy link
Member

allevato commented Sep 1, 2021

We're almost there! I ran the tests locally and realized that the old hardcoded list of rules suppressed from RuleCollector in generate-pipeline means they won't be in the name cache either, and the tests for those rules crash 🙃

I've got #263 out to fix this, which should unblock this change.

@p4checo
Copy link
Contributor Author

p4checo commented Sep 1, 2021

Nice catch, and thanks for the quick fix! 🙏🏼

If I'm not mistaken, that was caused by the new assert in Context.isRuleEnabled, right?. I added it as a reminder to favor using the cache and not rely on the Rule.ruleName fallback which will use the (slow) protocol extension 😅. I can remove it if you believe it doesn't add value and is prone to creating more "friction" rather than help.

@allevato
Copy link
Member

allevato commented Sep 1, 2021

I think it's fine to have that assertion; if a rule's name isn't in the cache, it's probably a sign of other problems (like generate-pipeline not being run after adding something new), and falling back to the slower implementation would lead to slower performance for those rules. I think it's better not to risk that kind of inconsistency.

I merged #263, so if you rebase and rerun generate-pipeline again, the cache should pick those up.

Parallel mode was crashing with bad memory access due to data races
accessing `Frontend.configurationLoader`, more specifically its `cache`.

After serializing access (initially with an `NSLock`), I observed that
despite not crashing anymore, the performance was surprisingly *worst*
than in single threaded mode.

That led me down a road of investigating why this was happening, and
after some profiling I discovered that `Rule`'s `nameCache` was the
main source of contention - causing around 30% of total spent time
waiting for locks (`ulock_wait`) on my tests.

After replacing the synchronization mechanism on `nameCache` with a
more efficient `os_unfair_lock_t` (`pthread_mutex_t` in Linux), the
contention dropped significantly, and parallel mode now outperformed
single threaded mode as expected. As a bonus, these changes also
improved single threaded mode performance as well, due to the reduced
overhead of using a lock vs a queue.

I then used the same `Lock` approach to serialize access to
`Frontend.configurationLoader` which increased the performance gap
even further.

After these improvements, I was able to obtain quite significant
performance gains using `Lock`:

- serial (non optimized) vs parallel (optimized): ~5.4x (13.5s vs 74s)
- serial (optimized) vs serial (non optimized): ~1.6x (44s vs 74s)
- serial (optimized) vs parallel (optimized): ~3.2x (13.5s vs 44s)

Sadly, a custom `Lock` implementation is not ideal for `swift-format`
to maintain and Windows support was not provided. As such, The
alternative would be to use `NSLock` which being a part of `Foundation`
is supported on all major platforms.

Using `NSLock` the improvements were not so good, unfortunately:

- serial (non optimized) vs parallel (NSLock): ~1,9x (38s vs 74s)
- serial (NSLock) vs serial (non optimized): ~1,4x (52s vs 74s)
- serial (NSLock) vs parallel (NSLock): ~1,3x (38s vs 52s)

The best solution was then to try and remove all concurrency contention
from `swift-format`. By flattening the `FileToProcess` array and
making the `Rule`'s `nameCache` static, close to optimal performance
should be achievable.

Base rules `SyntaxFormatRule` and `SyntaxLintRule` should be kept
in the `SwiftFormatCore` target to eventually support dynamic
discovery of rules (ditching `generate-pipeline`) and only require
rule implementers to import `SwiftFormatCore`.

On the other hand, any statically generated `ruleName` cache must be in
`SwiftFormatRules` target, because  that's where rule implementations
reside to capture their `ObjectIdentifier` at compile time.

So, in order to have a static rule name cache that's accessible from
`SwiftFormatCore` (especially from `SyntaxFormatRule`), a solution was
to inject a `ruleNameCache` dictionary in `Context`, which is injected
into every `Rule`.

This allows using `generate-pipelines` to a produce a `ruleNameCache`
in `SwiftFormatRules` that's injected on all `Context`s by each of the
pipelines (which is done in `SwiftFormat` target).

While not as ideal as having the cached accessed in a single place
(this approach incurs in the cache being copied for each `Context`), it
achieves a good tradeoff given current constraints). Lets hope the
compiler is smart enough to optimize this somehow.

All things considered, results were quite good:

- serial (non optimized) vs parallel (optimized): ~7x (10.5s vs 74s)
- serial (optimized) vs serial (non optimized): ~1.7x (42.5s vs 74s)
- serial (optimized) vs parallel (optimized): ~4x (10.5s vs 42.5s)

Tests were made on my `MacBookPro16,1` (8-core [email protected]), on a project
with 2135 Swift files, compiling `swift-format` in Release mode.

## Changes

- Make file preparation serial by calculating all `FileToProcess`
before processing them in parallel.

- Update `Context` to receive a `ruleNameCache` of type
`[ObjectIdentifier: String]` on `init`.

- Update `Context.isRuleEnabled` to receive the `Rule` type instead of
the rule's name, to use the new cache.

- Add new `RuleNameCacheGenerator` to `generate-pipeline` which
outputs a static `ruleNameCache` dictionary to
`RuleNameCache+Generated.swift` file in `SwiftFormatRule` target with
all existing rules.

- Remove the `nameCache` and `nameCacheQueue` `Rule.swift`.
@p4checo
Copy link
Contributor Author

p4checo commented Sep 1, 2021

I think it's fine to have that assertion; if a rule's name isn't in the cache, it's probably a sign of other problems (like generate-pipeline not being run after adding something new), and falling back to the slower implementation would lead to slower performance for those rules. I think it's better not to risk that kind of inconsistency.

Yeah, that was precisely my thought process! 😄

I merged #263, so if you rebase and rerun generate-pipeline again, the cache should pick those up.

Indeed it did! Pushing rebased and updated changes.

@p4checo p4checo force-pushed the fix-and-improve-parallel-mode branch from 39aa5ab to 6d500ea Compare September 1, 2021 14:53
Copy link
Member

@allevato allevato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience! I just did some quick testing and it looks good, so I think we're good to merge now.

@allevato allevato merged commit cc40c36 into swiftlang:main Sep 2, 2021
@p4checo p4checo deleted the fix-and-improve-parallel-mode branch September 3, 2021 09:16
allevato added a commit to allevato/swift-format that referenced this pull request Sep 21, 2021
dylansturg pushed a commit to dylansturg/swift-format that referenced this pull request Apr 16, 2022
Now that we have the `isOptIn` property for rules, this should be
used to disable rules by default instead of removing them from
pipeline generation entirely. This also unblocks
swiftlang#261, since excluded
rules wouldn't show up in the generated name cache, causing their
tests to crash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants