Skip to content

Restart sourcekitd and clangd after they have crashed #219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

ahoppen
Copy link
Member

@ahoppen ahoppen commented Jan 31, 2020

Recover from crashes in sourcekitd and clangd by transparently restarting them from SourceKit-LSP.

@ahoppen ahoppen requested a review from benlangmuir January 31, 2020 19:29
@ahoppen
Copy link
Member Author

ahoppen commented Jan 31, 2020

swiftlang/swift#29558
@swift-ci Please test

@ahoppen
Copy link
Member Author

ahoppen commented Jan 31, 2020

swiftlang/swift#29558
@swift-ci Please test

Copy link
Contributor

@DavidGoldman DavidGoldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also make sense to swap FileBuildSettings to be a class since it's commonly shared/cached and immutable, not sure.

@ahoppen
Copy link
Member Author

ahoppen commented Feb 5, 2020

@swift-ci Please test

@ahoppen
Copy link
Member Author

ahoppen commented Feb 5, 2020

It might also make sense to swap FileBuildSettings to be a class since it's commonly shared/cached and immutable, not sure.

I don’t think so, we explicitly want the value semantics. Plus it’s not immutable, both compilerArguments and workingDirectory are vars.

Copy link
Contributor

@benlangmuir benlangmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sourcekitd part is looking almost there. Mentioned a couple of issues. I still need to look at the clangd side.

@@ -131,6 +148,27 @@ extension SwiftLanguageServer {
guard let self = self else { return }
let notification = SKResponse(notification, sourcekitd: self.sourcekitd)

if notification.value?[self.keys.notification] == self.values.notification_sema_enabled {
self.state = .connected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notification handler queue is not related to queue for this object, so to access state we need to synchronize. Similarly, addStateChangeHandler needs to synchronize the same way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood it, queue is used to synchronise the requests and responses from sourcekitd. Setting the state does not fall into this category, so I did not synchronise it on queue.

Do you want to extend the definition of queue? I don’t think that state really needs to be synchronised, but it wouldn’t hurt either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on queue is misleading and should be updated. We need to protect shared access to mutable state - e.g. we're doing this for currentDiagnostics already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, in that case, it makes total sense to guard the access to state by queue.


// We don't have any open documents anymore after sourcekitd crashed.
// Reset the document manager to reflect that.
self.documentManager = DocumentManager()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutation also needs synchronization. It's kind of suspicious that you're doing this after you call reopenDocuments - couldn't that repopulate it at the wrong time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutation also needs synchronization.

Good point. I wrapped it in queue.async.

It's kind of suspicious that you're doing this after you call reopenDocuments - couldn't that repopulate it at the wrong time?

No, that ordering is correct. When sourcekitd crashes, state is .connected. Hence the second if self.state == .connectionInterrupted is skipped and the third if let error = notification.error, error.code == .connectionInterrupted sets state = .connectionInterrupted, because we receive a notification with error.code == .connectionInterrupted.

When the next notification comes in, the first if self.state == .connectionInterrupted is taken, which then sets state = .semanticFunctionalityDisabled.

Finally, when semantic functionality is restored, we get a notification that satisfies the first if notification.value?[self.keys.notification] == self.values.notification_sema_enabled, setting state back to .connected.

connection.close()
if process.terminationStatus != 0 {
if let self = self {
self.state = .connectionInterrupted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar synchronization issues to the Swift side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as on the Swift side:

As I understood it, queue is used to synchronise the requests and responses from clangd. Setting the state does not fall into this category, so I did not synchronise it on queue.

Do you want to extend the definition of queue? I don’t think that state really needs to be synchronised, but it wouldn’t hurt either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think that state really needs to be synchronised, but it wouldn’t hurt either.

Why not? It's being mutated and read concurrently.

if process.terminationStatus != 0 {
if let self = self {
self.state = .connectionInterrupted
self.restartClangd()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this could crash loop. For sourcekitd, XPC prevents us from re-spawning too fast in a loop. We should implement something here to delay the relaunch if we crash loop (e.g. after the second crash in a short time start delaying a few seconds before trying again).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW VSCode will restart an LSP for up to 5 times within a 3 minute interval. Really depends upon the type of crash - most of which I've seen are persistent or related to a single document.

Might make sense to do something similar here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I have added some super simple crash-loop prevention code: If clangd has crashed in the last 30 seconds, delay any further restart by 10 seconds. I think this should be sufficient to prevent us from going wild im crash loop scenarios. I also added a test case that checks that clangd isn’t restarted twice in a row within 5 seconds.

@ahoppen
Copy link
Member Author

ahoppen commented Feb 26, 2020

@swift-ci Please test

@ahoppen
Copy link
Member Author

ahoppen commented Feb 26, 2020

@benlangmuir I have aded 2b0cbfd which guards access to state and stateChangeHandlers by queue. Could you take a look at it?

@swift-ci Please test

Copy link
Contributor

@benlangmuir benlangmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running these test under tsan and I'm seeing races like the following

ThreadSanitizer: data race ClangLanguageServer.swift:234
WARNING: ThreadSanitizer: data race (pid=9454)
  Read of size 8 at 0x7b5800030018 by thread T13:
    #0 ClangLanguageServerShim.changeDocument(_:) ClangLanguageServer.swift:234 (SourceKitLSPPackageTests:x86_64+0xc76794)
    #1 protocol witness for ToolchainLanguageServer.changeDocument(_:) in conformance ClangLanguageServerShim <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xc766ca)
    #2 SourceKitServer.changeDocument(_:workspace:) SourceKitServer.swift:490 (SourceKitLSPPackageTests:x86_64+0xc455d9)
    #3 partial apply for SourceKitServer.changeDocument(_:workspace:) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xc63f92)
    #4 closure #1 in SourceKitServer.registerWorkspaceNotfication<A>(_:) SourceKitServer.swift:130 (SourceKitLSPPackageTests:x86_64+0xc4c661)
    #5 partial apply for closure #1 in SourceKitServer.registerWorkspaceNotfication<A>(_:) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xc59414)
    #6 thunk for @escaping @callee_guaranteed (@guaranteed Notification<A>) -> () <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xac9e8b)
    #7 partial apply for thunk for @escaping @callee_guaranteed (@guaranteed Notification<A>) -> () <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xacdd04)
    #8 thunk for @escaping @callee_guaranteed (@in_guaranteed Notification<A>) -> (@out ()) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xacb1e0)
    #9 partial apply for thunk for @escaping @callee_guaranteed (@in_guaranteed Notification<A>) -> (@out ()) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xacee40)
    #10 closure #1 in LanguageServerEndpoint.handle<A>(_:from:) LanguageServer.swift:168 (SourceKitLSPPackageTests:x86_64+0xacb11a)
    #11 partial apply for closure #1 in LanguageServerEndpoint.handle<A>(_:from:) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xace185)
    #12 thunk for @escaping @callee_guaranteed () -> () <compiler-generated> (SourceKitLSPPackageTests:x86_64+0x17a43)
    #13 __tsan::invoke_and_release_block(void*) <null>:5352656 (libclang_rt.tsan_osx_dynamic.dylib:x86_64h+0x71e5b)
    #14 _dispatch_client_callout object.m:495 (libdispatch.dylib:x86_64+0x350d)

  Previous write of size 8 at 0x7b5800030018 by thread T8:
    #0 ClangLanguageServerShim.initialize() ClangLanguageServer.swift:91 (SourceKitLSPPackageTests:x86_64+0xc6cd87)
    #1 closure #1 in ClangLanguageServerShim.restartClangd() ClangLanguageServer.swift:155 (SourceKitLSPPackageTests:x86_64+0xc6fce2)
    #2 partial apply for closure #1 in ClangLanguageServerShim.restartClangd() <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xc7fd65)
    #3 thunk for @escaping @callee_guaranteed () -> () <compiler-generated> (SourceKitLSPPackageTests:x86_64+0x17a43)
    #4 __tsan::invoke_and_release_block(void*) <null>:5352656 (libclang_rt.tsan_osx_dynamic.dylib:x86_64h+0x71e5b)
    #5 _dispatch_client_callout object.m:495 (libdispatch.dylib:x86_64+0x350d)

  Location is heap block of size 736 at 0x7b5800030000 allocated by thread T7:
    #0 malloc <null>:5352688 (libclang_rt.tsan_osx_dynamic.dylib:x86_64h+0x4f88a)
    #1 swift_slowAlloc <null>:5352688 (libswiftCore.dylib:x86_64+0x2cb6d8)
    #2 makeJSONRPCClangServer(client:toolchain:buildSettings:clangdOptions:reopenDocuments:) ClangLanguageServer.swift:339 (SourceKitLSPPackageTests:x86_64+0xc7d644)
    #3 languageService(for:_:options:client:reopenDocuments:) SourceKitServer.swift:817 (SourceKitLSPPackageTests:x86_64+0xc52bf3)
    #4 closure #1 in SourceKitServer.languageService(for:_:) SourceKitServer.swift:222 (SourceKitLSPPackageTests:x86_64+0xc5028c)
    #5 partial apply for closure #1 in SourceKitServer.languageService(for:_:) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xc602a2)
    #6 thunk for @callee_guaranteed () -> (@owned ToolchainLanguageServer?, @error @owned Error) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xc54276)
    #7 partial apply for thunk for @callee_guaranteed () -> (@owned ToolchainLanguageServer?, @error @owned Error) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xc60327)
    #8 orLog<A>(_:level:logger:_:) Logging.swift:58 (SourceKitLSPPackageTests:x86_64+0x590e27)
    #9 SourceKitServer.languageService(for:_:) SourceKitServer.swift:221 (SourceKitLSPPackageTests:x86_64+0xc4fde4)
    #10 SourceKitServer.languageService(for:_:in:) SourceKitServer.swift:256 (SourceKitLSPPackageTests:x86_64+0xc54775)
    #11 SourceKitServer.openDocument(_:workspace:registerForBuildSystemNotifications:) SourceKitServer.swift:471 (SourceKitLSPPackageTests:x86_64+0xc4f679)
    #12 SourceKitServer.openDocument(_:workspace:) SourceKitServer.swift:458 (SourceKitLSPPackageTests:x86_64+0xc44522)
    #13 partial apply for SourceKitServer.openDocument(_:workspace:) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xc64052)
    #14 closure #1 in SourceKitServer.registerWorkspaceNotfication<A>(_:) SourceKitServer.swift:130 (SourceKitLSPPackageTests:x86_64+0xc4c661)
    #15 partial apply for closure #1 in SourceKitServer.registerWorkspaceNotfication<A>(_:) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xc59414)
    #16 thunk for @escaping @callee_guaranteed (@guaranteed Notification<A>) -> () <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xac9e8b)
    #17 partial apply for thunk for @escaping @callee_guaranteed (@guaranteed Notification<A>) -> () <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xacdd04)
    #18 thunk for @escaping @callee_guaranteed (@in_guaranteed Notification<A>) -> (@out ()) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xacb1e0)
    #19 partial apply for thunk for @escaping @callee_guaranteed (@in_guaranteed Notification<A>) -> (@out ()) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xacee40)
    #20 closure #1 in LanguageServerEndpoint.handle<A>(_:from:) LanguageServer.swift:168 (SourceKitLSPPackageTests:x86_64+0xacb11a)
    #21 partial apply for closure #1 in LanguageServerEndpoint.handle<A>(_:from:) <compiler-generated> (SourceKitLSPPackageTests:x86_64+0xace185)
    #22 thunk for @escaping @callee_guaranteed () -> () <compiler-generated> (SourceKitLSPPackageTests:x86_64+0x17a43)
    #23 __tsan::invoke_and_release_block(void*) <null>:5352688 (libclang_rt.tsan_osx_dynamic.dylib:x86_64h+0x71e5b)
    #24 _dispatch_client_callout object.m:495 (libdispatch.dylib:x86_64+0x350d)

  Thread T13 (tid=4418908, running) is a GCD worker thread

  Thread T8 (tid=4418902, running) is a GCD worker thread

  Thread T7 (tid=4418847, running) is a GCD worker thread

SUMMARY: ThreadSanitizer: data race ClangLanguageServer.swift:234 in ClangLanguageServerShim.changeDocument(_:)

// Do a sanity check and verify that we get the expected result from a hover response before crashing clangd.

let expectedHoverContent = """
Declared in global namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text seems to vary between recent clangd versions - I tried this test locally and it failed because "loc" wasn't included at the bottom. We generally try to keep our tests passing with both the latest toolchain and the most recent released toolchain. Let's change this only compare the range, or if you really want to check the markup content, just check a prefix like "Declared in".

@ahoppen
Copy link
Member Author

ahoppen commented Mar 4, 2020

Hi @benlangmuir, thanks for running the thread sanitizer. Looks like I missed a couple data races, but things should be fixed now. I also changed crash recovery tests to only verify that the range is correct. I think that should suffice for the sanity check that we have correctly re-opened the document.

I have added then new changes as two new commits for an easier review. I’d like to squash them all together before merging the PR though.

@swift-ci Please test

@benlangmuir
Copy link
Contributor

benlangmuir commented Mar 18, 2020

Please regenerate linuxmain and squash as appropriate; otherwise LGTM. Thanks!

@ahoppen
Copy link
Member Author

ahoppen commented Mar 26, 2020

Squashed and regenerated LinuxMain.swift

@swift-ci Please test

@ahoppen
Copy link
Member Author

ahoppen commented Apr 3, 2020

Test failure seems to have been unrelated

@swift-ci Please test

@benlangmuir
Copy link
Contributor

CrashRecoveryTests.testSourcekitdCrashRecovery : API violation - multiple calls made to XCTestExpectation.fulfill() for clangd crashed.

@ahoppen
Copy link
Member Author

ahoppen commented Apr 11, 2020

assertForOverFulfill seems to not be working on Linux's XCTestExpectation. I have filed SR-12575 for it. I am now manually keeping track if fulfill() has already been called on the expectation.

@swift-ci Please test

@ahoppen
Copy link
Member Author

ahoppen commented Apr 13, 2020

@swift-ci Please test

@shahmishal shahmishal closed this Oct 6, 2020
@shahmishal
Copy link
Member

The Swift project moved the default branch to main and deleted master branch, so GitHub automatically closed the PR. Please re-create the pull request with main branch.

More detail about the branch update - https://forums.swift.org/t/updating-branch-names/40412

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants