Skip to content

[coro] Lower llvm.coro.await.suspend.handle to resume with tail call #89751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 15, 2024

Conversation

zmodem
Copy link
Collaborator

@zmodem zmodem commented Apr 23, 2024

The C++ standard requires that symmetric transfer from one coroutine to another is performed via a tail call. Failure to do so is a miscompile and often breaks programs by quickly overflowing the stack.

Until now, the coro split pass tried to ensure this in the addMustTailToCoroResumes() function by searching for llvm.coro.resume calls to lower as tail calls if the conditions were right: the right function arguments, attributes, calling convention etc., and if a ret void was sure to be reached after traversal with some ad-hoc constant folding following the call.

This was brittle, as the kind of implicit variants required for a tail call to happen could easily be broken by other passes (e.g. if some instruction got in between the resume and ret), see for example 9d1cb18 and 284da04.

Also the logic seemed backwards: instead of searching for possible tail call candidates and doing them if the circumstances are right, it seems better to start with the intention of making the tail calls we need, and forcing the circumstances to be right.

Now that we have the llvm.coro.await.suspend.handle intrinsic (since f786881) which corresponds exactly to symmetric transfer, change the lowering of that to also include the resume part, always lowered as a tail call.

@zmodem zmodem added the coroutines C++20 coroutines label Apr 23, 2024
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. PGO Profile Guided Optimizations llvm:ir llvm:transforms labels Apr 23, 2024
@llvmbot
Copy link
Member

llvmbot commented Apr 23, 2024

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-pgo

@llvm/pr-subscribers-coroutines

Author: Hans (zmodem)

Changes

The C++ standard requires that symmetric transfer from one coroutine to another is performed via a tail call. Failure to do so is a miscompile and often breaks programs by quickly overflowing the stack.

Until now, the coro split pass tried to ensure this in the addMustTailToCoroResumes() function by searching for llvm.coro.resume calls to lower as tail calls if the conditions were right: the right function arguments, attributes, calling convention etc., and if a ret void was sure to be reached after traversal with some ad-hoc constant folding following the call.

This was brittle, as the kind of implicit variants required for a tail call to happen could easily be broken by other passes (e.g. if some instruction got in between the resume and ret), see for example 9d1cb18 and 284da04.

Also the logic seemed backwards: instead of searching for possible tail call candidates and doing them if the circumstances are right, it seems better to start with the intention of making the tail calls we need, and forcing the circumstances to be right.

Now that we have the llvm.coro.await.suspend.handle intrinsic (since f786881) which corresponds exactly to symmetric transfer, change the lowering of that to also include the resume part, always lowered as a tail call.


Patch is 92.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89751.diff

37 Files Affected:

  • (modified) clang/lib/CodeGen/CGCoroutine.cpp (+1-4)
  • (modified) clang/test/CodeGenCoroutines/coro-await.cpp (+1-2)
  • (modified) clang/test/CodeGenCoroutines/coro-symmetric-transfer-01.cpp (+2-2)
  • (modified) clang/test/CodeGenCoroutines/coro-symmetric-transfer-02.cpp (+2-4)
  • (modified) llvm/docs/Coroutines.rst (+10-9)
  • (modified) llvm/docs/LangRef.rst (+1-1)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+1-1)
  • (modified) llvm/lib/Transforms/Coroutines/CoroInstr.h (+2-2)
  • (modified) llvm/lib/Transforms/Coroutines/CoroInternal.h (+2-1)
  • (modified) llvm/lib/Transforms/Coroutines/CoroSplit.cpp (+59-175)
  • (modified) llvm/lib/Transforms/Coroutines/Coroutines.cpp (+1-1)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail.ll (-63)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail1.ll (-97)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail10.ll (-55)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail11.ll (-55)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail12.ll (-85)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail13.ll (-76)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail2.ll (-68)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail3.ll (-91)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail4.ll (-66)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail5.ll (-63)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail6.ll (-112)
  • (removed) llvm/test/Instrumentation/InstrProfiling/Coro/coro-split-musttail7.ll (-115)
  • (modified) llvm/test/Transforms/Coroutines/coro-await-suspend-lower-invoke.ll (+2-3)
  • (modified) llvm/test/Transforms/Coroutines/coro-await-suspend-lower.ll (+2-3)
  • (removed) llvm/test/Transforms/Coroutines/coro-preserve-final.ll (-131)
  • (modified) llvm/test/Transforms/Coroutines/coro-split-musttail-chain-pgo-counter-promo.ll (+2-7)
  • (modified) llvm/test/Transforms/Coroutines/coro-split-musttail.ll (+7-10)
  • (modified) llvm/test/Transforms/Coroutines/coro-split-musttail1.ll (+15-17)
  • (modified) llvm/test/Transforms/Coroutines/coro-split-musttail10.ll (+7-6)
  • (removed) llvm/test/Transforms/Coroutines/coro-split-musttail11.ll (-55)
  • (modified) llvm/test/Transforms/Coroutines/coro-split-musttail2.ll (+5-7)
  • (modified) llvm/test/Transforms/Coroutines/coro-split-musttail3.ll (+15-18)
  • (modified) llvm/test/Transforms/Coroutines/coro-split-musttail4.ll (+3-2)
  • (modified) llvm/test/Transforms/Coroutines/coro-split-musttail5.ll (+3-2)
  • (modified) llvm/test/Transforms/Coroutines/coro-split-musttail6.ll (+5-4)
  • (modified) llvm/test/Transforms/Coroutines/coro-split-musttail7.ll (+9-6)
diff --git a/clang/lib/CodeGen/CGCoroutine.cpp b/clang/lib/CodeGen/CGCoroutine.cpp
index 93ca711f716fce..e976734898b9b8 100644
--- a/clang/lib/CodeGen/CGCoroutine.cpp
+++ b/clang/lib/CodeGen/CGCoroutine.cpp
@@ -307,10 +307,7 @@ static LValueOrRValue emitSuspendExpression(CodeGenFunction &CGF, CGCoroData &Co
     break;
   }
   case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle: {
-    assert(SuspendRet->getType()->isPointerTy());
-
-    auto ResumeIntrinsic = CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_resume);
-    Builder.CreateCall(ResumeIntrinsic, SuspendRet);
+    assert(SuspendRet->getType()->isVoidTy());
     break;
   }
   }
diff --git a/clang/test/CodeGenCoroutines/coro-await.cpp b/clang/test/CodeGenCoroutines/coro-await.cpp
index 75851d8805bb6e..7caaa6351844b2 100644
--- a/clang/test/CodeGenCoroutines/coro-await.cpp
+++ b/clang/test/CodeGenCoroutines/coro-await.cpp
@@ -370,8 +370,7 @@ extern "C" void TestTailcall() {
   // ---------------------------
   // Call coro.await.suspend
   // ---------------------------
-  // CHECK-NEXT: %[[RESUMED:.+]] = call ptr @llvm.coro.await.suspend.handle(ptr %[[AWAITABLE]], ptr %[[FRAME]], ptr @__await_suspend_wrapper_TestTailcall_await)
-  // CHECK-NEXT: call void @llvm.coro.resume(ptr %[[RESUMED]])
+  // CHECK-NEXT: call void @llvm.coro.await.suspend.handle(ptr %[[AWAITABLE]], ptr %[[FRAME]], ptr @__await_suspend_wrapper_TestTailcall_await)
   // CHECK-NEXT: %[[OUTCOME:.+]] = call i8 @llvm.coro.suspend(token %[[SUSPEND_ID]], i1 false)
   // CHECK-NEXT: switch i8 %[[OUTCOME]], label %[[RET_BB:.+]] [
   // CHECK-NEXT:   i8 0, label %[[READY_BB]]
diff --git a/clang/test/CodeGenCoroutines/coro-symmetric-transfer-01.cpp b/clang/test/CodeGenCoroutines/coro-symmetric-transfer-01.cpp
index da30e12c63cffb..0ae672de391c47 100644
--- a/clang/test/CodeGenCoroutines/coro-symmetric-transfer-01.cpp
+++ b/clang/test/CodeGenCoroutines/coro-symmetric-transfer-01.cpp
@@ -48,7 +48,7 @@ detached_task foo() {
 }
 
 // check that the lifetime of the coroutine handle used to obtain the address is contained within single basic block, and hence does not live across suspension points.
+// XXX: not sure this makes sense anymore?
 // CHECK-LABEL: final.suspend:
 // CHECK:         %{{.+}} = call token @llvm.coro.save(ptr null)
-// CHECK:         %[[HDL_TRANSFER:.+]] = call ptr @llvm.coro.await.suspend.handle
-// CHECK:         call void @llvm.coro.resume(ptr %[[HDL_TRANSFER]])
+// CHECK:         call void @llvm.coro.await.suspend.handle
diff --git a/clang/test/CodeGenCoroutines/coro-symmetric-transfer-02.cpp b/clang/test/CodeGenCoroutines/coro-symmetric-transfer-02.cpp
index ca6cf74115a3b1..f36f89926505f3 100644
--- a/clang/test/CodeGenCoroutines/coro-symmetric-transfer-02.cpp
+++ b/clang/test/CodeGenCoroutines/coro-symmetric-transfer-02.cpp
@@ -89,8 +89,7 @@ Task bar() {
 // CHECK:         br i1 %{{.+}}, label %[[CASE1_AWAIT_READY:.+]], label %[[CASE1_AWAIT_SUSPEND:.+]]
 // CHECK:       [[CASE1_AWAIT_SUSPEND]]:
 // CHECK-NEXT:    %{{.+}} = call token @llvm.coro.save(ptr null)
-// CHECK-NEXT:    %[[HANDLE1_PTR:.+]] = call ptr @llvm.coro.await.suspend.handle
-// CHECK-NEXT:    call void @llvm.coro.resume(ptr %[[HANDLE1_PTR]])
+// CHECK-NEXT:    call void @llvm.coro.await.suspend.handle
 // CHECK-NEXT:    %{{.+}} = call i8 @llvm.coro.suspend
 // CHECK-NEXT:    switch i8 %{{.+}}, label %coro.ret [
 // CHECK-NEXT:      i8 0, label %[[CASE1_AWAIT_READY]]
@@ -104,8 +103,7 @@ Task bar() {
 // CHECK:         br i1 %{{.+}}, label %[[CASE2_AWAIT_READY:.+]], label %[[CASE2_AWAIT_SUSPEND:.+]]
 // CHECK:       [[CASE2_AWAIT_SUSPEND]]:
 // CHECK-NEXT:    %{{.+}} = call token @llvm.coro.save(ptr null)
-// CHECK-NEXT:    %[[HANDLE2_PTR:.+]] = call ptr @llvm.coro.await.suspend.handle
-// CHECK-NEXT:    call void @llvm.coro.resume(ptr %[[HANDLE2_PTR]])
+// CHECK-NEXT:    call void @llvm.coro.await.suspend.handle
 // CHECK-NEXT:    %{{.+}} = call i8 @llvm.coro.suspend
 // CHECK-NEXT:    switch i8 %{{.+}}, label %coro.ret [
 // CHECK-NEXT:      i8 0, label %[[CASE2_AWAIT_READY]]
diff --git a/llvm/docs/Coroutines.rst b/llvm/docs/Coroutines.rst
index 83369d93c309a7..36092325e536fb 100644
--- a/llvm/docs/Coroutines.rst
+++ b/llvm/docs/Coroutines.rst
@@ -1922,7 +1922,7 @@ Example:
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ::
 
-  declare ptr @llvm.coro.await.suspend.handle(
+  declare void @llvm.coro.await.suspend.handle(
                 ptr <awaiter>,
                 ptr <handle>,
                 ptr <await_suspend_function>)
@@ -1967,7 +1967,9 @@ The intrinsic must be used between corresponding `coro.save`_ and
 `await_suspend_function` call during `CoroSplit`_ pass.
 
 `await_suspend_function` must return a pointer to a valid
-coroutine frame, which is immediately resumed
+coroutine frame. The intrinsic will be lowered to a tail call resuming the
+returned coroutine frame. It will be marked `musttail` on targets that support
+that. Instructions following the intrinsic will become unreachable.
 
 Example:
 """"""""
@@ -1977,11 +1979,10 @@ Example:
   ; before lowering
   await.suspend:
     %save = call token @llvm.coro.save(ptr %hdl)
-    %next = call ptr @llvm.coro.await.suspend.handle(
-                ptr %awaiter,
-                ptr %hdl,
-                ptr @await_suspend_function)
-    call void @llvm.coro.resume(%next)
+    call void @llvm.coro.await.suspend.handle(
+        ptr %awaiter,
+        ptr %hdl,
+        ptr @await_suspend_function)
     %suspend = call i8 @llvm.coro.suspend(token %save, i1 false)
     ...
 
@@ -1992,8 +1993,8 @@ Example:
     %next = call ptr @await_suspend_function(
                 ptr %awaiter,
                 ptr %hdl)
-    call void @llvm.coro.resume(%next)
-    %suspend = call i8 @llvm.coro.suspend(token %save, i1 false)
+    musttail call void @llvm.coro.resume(%next)
+    ret void
     ...
 
   ; wrapper function example
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 9592929d79feb4..0e87a8e2ace0e2 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -12517,7 +12517,7 @@ This instruction requires several arguments:
       ``llvm::GuaranteedTailCallOpt`` is ``true``, or the calling convention
       is ``tailcc``
    -  `Platform-specific constraints are
-      met. <CodeGenerator.html#tailcallopt>`_
+      met. <CodeGenerator.html#tail-call-optimization>`_
 
 #. The optional ``notail`` marker indicates that the optimizers should not add
    ``tail`` or ``musttail`` markers to the call. It is used to prevent tail
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 1d20f7e1b19854..2bbdd0e4627c62 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -1711,7 +1711,7 @@ def int_coro_await_suspend_bool : Intrinsic<[llvm_i1_ty],
                                             [llvm_ptr_ty, llvm_ptr_ty, llvm_ptr_ty],
                                             [Throws]>;
 
-def int_coro_await_suspend_handle : Intrinsic<[llvm_ptr_ty],
+def int_coro_await_suspend_handle : Intrinsic<[],
                                               [llvm_ptr_ty, llvm_ptr_ty, llvm_ptr_ty],
                                               [Throws]>;
 
diff --git a/llvm/lib/Transforms/Coroutines/CoroInstr.h b/llvm/lib/Transforms/Coroutines/CoroInstr.h
index 79e745bb162cdb..a31703fe01304c 100644
--- a/llvm/lib/Transforms/Coroutines/CoroInstr.h
+++ b/llvm/lib/Transforms/Coroutines/CoroInstr.h
@@ -78,10 +78,10 @@ class LLVM_LIBRARY_VISIBILITY CoroAllocInst : public IntrinsicInst {
   }
 };
 
-/// This represents the llvm.coro.await.suspend instruction.
+/// This represents the llvm.coro.await.suspend.{void,bool,handle} instructions.
 // FIXME: add callback metadata
 // FIXME: make a proper IntrinisicInst. Currently this is not possible,
-// because llvm.coro.await.suspend can be invoked.
+// because llvm.coro.await.suspend.* can be invoked.
 class LLVM_LIBRARY_VISIBILITY CoroAwaitSuspendInst : public CallBase {
   enum { AwaiterArg, FrameArg, WrapperArg };
 
diff --git a/llvm/lib/Transforms/Coroutines/CoroInternal.h b/llvm/lib/Transforms/Coroutines/CoroInternal.h
index 84fd88806154e3..5716fd0ea4ab96 100644
--- a/llvm/lib/Transforms/Coroutines/CoroInternal.h
+++ b/llvm/lib/Transforms/Coroutines/CoroInternal.h
@@ -47,7 +47,7 @@ struct LowererBase {
   ConstantPointerNull *const NullPtr;
 
   LowererBase(Module &M);
-  Value *makeSubFnCall(Value *Arg, int Index, Instruction *InsertPt);
+  CallInst *makeSubFnCall(Value *Arg, int Index, Instruction *InsertPt);
 };
 
 enum class ABI {
@@ -85,6 +85,7 @@ struct LLVM_LIBRARY_VISIBILITY Shape {
   SmallVector<AnyCoroSuspendInst *, 4> CoroSuspends;
   SmallVector<CallInst*, 2> SwiftErrorOps;
   SmallVector<CoroAwaitSuspendInst *, 4> CoroAwaitSuspends;
+  SmallVector<CallInst *, 2> SymmetricTransfers;
 
   // Field indexes for special fields in the switch lowering.
   struct SwitchFieldIndex {
diff --git a/llvm/lib/Transforms/Coroutines/CoroSplit.cpp b/llvm/lib/Transforms/Coroutines/CoroSplit.cpp
index 3a43b1edcaba37..01eb75617e39fe 100644
--- a/llvm/lib/Transforms/Coroutines/CoroSplit.cpp
+++ b/llvm/lib/Transforms/Coroutines/CoroSplit.cpp
@@ -113,21 +113,24 @@ class CoroCloner {
   /// ABIs.
   AnyCoroSuspendInst *ActiveSuspend = nullptr;
 
+  TargetTransformInfo &TTI;
+
 public:
   /// Create a cloner for a switch lowering.
   CoroCloner(Function &OrigF, const Twine &Suffix, coro::Shape &Shape,
-             Kind FKind)
+             Kind FKind, TargetTransformInfo &TTI)
       : OrigF(OrigF), NewF(nullptr), Suffix(Suffix), Shape(Shape), FKind(FKind),
-        Builder(OrigF.getContext()) {
+        Builder(OrigF.getContext()), TTI(TTI) {
     assert(Shape.ABI == coro::ABI::Switch);
   }
 
   /// Create a cloner for a continuation lowering.
   CoroCloner(Function &OrigF, const Twine &Suffix, coro::Shape &Shape,
-             Function *NewF, AnyCoroSuspendInst *ActiveSuspend)
+             Function *NewF, AnyCoroSuspendInst *ActiveSuspend,
+             TargetTransformInfo &TTI)
       : OrigF(OrigF), NewF(NewF), Suffix(Suffix), Shape(Shape),
         FKind(Shape.ABI == coro::ABI::Async ? Kind::Async : Kind::Continuation),
-        Builder(OrigF.getContext()), ActiveSuspend(ActiveSuspend) {
+        Builder(OrigF.getContext()), ActiveSuspend(ActiveSuspend), TTI(TTI) {
     assert(Shape.ABI == coro::ABI::Retcon ||
            Shape.ABI == coro::ABI::RetconOnce || Shape.ABI == coro::ABI::Async);
     assert(NewF && "need existing function for continuation");
@@ -171,7 +174,8 @@ class CoroCloner {
 // Lower the intrinisc in CoroEarly phase if coroutine frame doesn't escape
 // and it is known that other transformations, for example, sanitizers
 // won't lead to incorrect code.
-static void lowerAwaitSuspend(IRBuilder<> &Builder, CoroAwaitSuspendInst *CB) {
+static void lowerAwaitSuspend(IRBuilder<> &Builder, CoroAwaitSuspendInst *CB,
+                              coro::Shape &Shape) {
   auto Wrapper = CB->getWrapperFunction();
   auto Awaiter = CB->getAwaiter();
   auto FramePtr = CB->getFrame();
@@ -206,6 +210,28 @@ static void lowerAwaitSuspend(IRBuilder<> &Builder, CoroAwaitSuspendInst *CB) {
     llvm_unreachable("Unexpected coro_await_suspend invocation method");
   }
 
+  if (CB->getCalledFunction()->getIntrinsicID() ==
+      Intrinsic::coro_await_suspend_handle) {
+    // Follow the await_suspend by a lowered resume call to the returned coroutine.
+    if (auto *Invoke = dyn_cast<InvokeInst>(CB))
+      Builder.SetInsertPoint(Invoke->getNormalDest()->getFirstInsertionPt());
+
+    coro::LowererBase LB(*Wrapper->getParent());
+    auto *ResumeAddr = LB.makeSubFnCall(NewCall, CoroSubFnInst::ResumeIndex,
+                                        &*Builder.GetInsertPoint());
+
+    LLVMContext& Ctx = Builder.getContext();
+    FunctionType *ResumeTy = FunctionType::get(
+        Type::getVoidTy(Ctx), PointerType::getUnqual(Ctx), false);
+    auto *ResumeCall = Builder.CreateCall(ResumeTy, ResumeAddr, {NewCall});
+
+    // We can't insert the 'ret' instruction and adjust the cc until the
+    // function has been split, so remember this for later.
+    Shape.SymmetricTransfers.push_back(ResumeCall);
+
+    NewCall = ResumeCall;
+  }
+
   CB->replaceAllUsesWith(NewCall);
   CB->eraseFromParent();
 }
@@ -213,7 +239,7 @@ static void lowerAwaitSuspend(IRBuilder<> &Builder, CoroAwaitSuspendInst *CB) {
 static void lowerAwaitSuspends(Function &F, coro::Shape &Shape) {
   IRBuilder<> Builder(F.getContext());
   for (auto *AWS : Shape.CoroAwaitSuspends)
-    lowerAwaitSuspend(Builder, AWS);
+    lowerAwaitSuspend(Builder, AWS, Shape);
 }
 
 static void maybeFreeRetconStorage(IRBuilder<> &Builder,
@@ -1056,6 +1082,22 @@ void CoroCloner::create() {
   // Set up the new entry block.
   replaceEntryBlock();
 
+  // Turn symmetric transfers into musttail calls.
+  for (CallInst *ResumeCall : Shape.SymmetricTransfers) {
+    ResumeCall = cast<CallInst>(VMap[ResumeCall]);
+    ResumeCall->setCallingConv(NewF->getCallingConv());
+    if (TTI.supportsTailCallFor(ResumeCall))
+      ResumeCall->setTailCallKind(CallInst::TCK_MustTail);
+
+    // Put a 'ret void' after the call, and split any remaining instructions to
+    // an unreachable block.
+    BasicBlock *BB = ResumeCall->getParent();
+    BB->splitBasicBlock(ResumeCall->getNextNode());
+    Builder.SetInsertPoint(BB->getTerminator());
+    Builder.CreateRetVoid();
+    BB->getTerminator()->eraseFromParent();
+  }
+
   Builder.SetInsertPoint(&NewF->getEntryBlock().front());
   NewFramePtr = deriveNewFramePointer();
 
@@ -1186,130 +1228,6 @@ scanPHIsAndUpdateValueMap(Instruction *Prev, BasicBlock *NewBlock,
   }
 }
 
-// Replace a sequence of branches leading to a ret, with a clone of a ret
-// instruction. Suspend instruction represented by a switch, track the PHI
-// values and select the correct case successor when possible.
-static bool simplifyTerminatorLeadingToRet(Instruction *InitialInst) {
-  // There is nothing to simplify.
-  if (isa<ReturnInst>(InitialInst))
-    return false;
-
-  DenseMap<Value *, Value *> ResolvedValues;
-  assert(InitialInst->getModule());
-  const DataLayout &DL = InitialInst->getModule()->getDataLayout();
-
-  auto TryResolveConstant = [&ResolvedValues](Value *V) {
-    auto It = ResolvedValues.find(V);
-    if (It != ResolvedValues.end())
-      V = It->second;
-    return dyn_cast<ConstantInt>(V);
-  };
-
-  Instruction *I = InitialInst;
-  while (true) {
-    if (isa<ReturnInst>(I)) {
-      assert(!cast<ReturnInst>(I)->getReturnValue());
-      ReplaceInstWithInst(InitialInst, I->clone());
-      return true;
-    }
-
-    if (auto *BR = dyn_cast<BranchInst>(I)) {
-      unsigned SuccIndex = 0;
-      if (BR->isConditional()) {
-        // Handle the case the condition of the conditional branch is constant.
-        // e.g.,
-        //
-        //     br i1 false, label %cleanup, label %CoroEnd
-        //
-        // It is possible during the transformation. We could continue the
-        // simplifying in this case.
-        ConstantInt *Cond = TryResolveConstant(BR->getCondition());
-        if (!Cond)
-          return false;
-
-        SuccIndex = Cond->isOne() ? 0 : 1;
-      }
-
-      BasicBlock *Succ = BR->getSuccessor(SuccIndex);
-      scanPHIsAndUpdateValueMap(I, Succ, ResolvedValues);
-      I = Succ->getFirstNonPHIOrDbgOrLifetime();
-      continue;
-    }
-
-    if (auto *Cmp = dyn_cast<CmpInst>(I)) {
-      // If the case number of suspended switch instruction is reduced to
-      // 1, then it is simplified to CmpInst in llvm::ConstantFoldTerminator.
-      // Try to constant fold it.
-      ConstantInt *Cond0 = TryResolveConstant(Cmp->getOperand(0));
-      ConstantInt *Cond1 = TryResolveConstant(Cmp->getOperand(1));
-      if (Cond0 && Cond1) {
-        ConstantInt *Result =
-            dyn_cast_or_null<ConstantInt>(ConstantFoldCompareInstOperands(
-                Cmp->getPredicate(), Cond0, Cond1, DL));
-        if (Result) {
-          ResolvedValues[Cmp] = Result;
-          I = I->getNextNode();
-          continue;
-        }
-      }
-    }
-
-    if (auto *SI = dyn_cast<SwitchInst>(I)) {
-      ConstantInt *Cond = TryResolveConstant(SI->getCondition());
-      if (!Cond)
-        return false;
-
-      BasicBlock *Succ = SI->findCaseValue(Cond)->getCaseSuccessor();
-      scanPHIsAndUpdateValueMap(I, Succ, ResolvedValues);
-      I = Succ->getFirstNonPHIOrDbgOrLifetime();
-      continue;
-    }
-
-    if (I->isDebugOrPseudoInst() || I->isLifetimeStartOrEnd() ||
-        wouldInstructionBeTriviallyDead(I)) {
-      // We can skip instructions without side effects. If their values are
-      // needed, we'll notice later, e.g. when hitting a conditional branch.
-      I = I->getNextNode();
-      continue;
-    }
-
-    break;
-  }
-
-  return false;
-}
-
-// Check whether CI obeys the rules of musttail attribute.
-static bool shouldBeMustTail(const CallInst &CI, const Function &F) {
-  if (CI.isInlineAsm())
-    return false;
-
-  // Match prototypes and calling conventions of resume function.
-  FunctionType *CalleeTy = CI.getFunctionType();
-  if (!CalleeTy->getReturnType()->isVoidTy() || (CalleeTy->getNumParams() != 1))
-    return false;
-
-  Type *CalleeParmTy = CalleeTy->getParamType(0);
-  if (!CalleeParmTy->isPointerTy() ||
-      (CalleeParmTy->getPointerAddressSpace() != 0))
-    return false;
-
-  if (CI.getCallingConv() != F.getCallingConv())
-    return false;
-
-  // CI should not has any ABI-impacting function attributes.
-  static const Attribute::AttrKind ABIAttrs[] = {
-      Attribute::StructRet,    Attribute::ByVal,     Attribute::InAlloca,
-      Attribute::Preallocated, Attribute::InReg,     Attribute::Returned,
-      Attribute::SwiftSelf,    Attribute::SwiftError};
-  AttributeList Attrs = CI.getAttributes();
-  for (auto AK : ABIAttrs)
-    if (Attrs.hasParamAttr(0, AK))
-      return false;
-
-  return true;
-}
-
 // Coroutine has no suspend points. Remove heap allocation for the coroutine
 // frame if possible.
 static void handleNoSuspendCoroutine(coro::Shape &Shape) {
@@ -1523,24 +1441,16 @@ struct SwitchCoroutineSplitter {
 
     createResumeEntryBlock(F, Shape);
     auto *ResumeClone =
-        createClone(F, ".resume", Shape, CoroCloner::Kind::SwitchResume);
+        createClone(F, ".resume", Shape, CoroCloner::Kind::SwitchResume, TTI);
     auto *DestroyClone =
-        createClone(F, ".destroy", Shape, CoroCloner::Kind::SwitchUnwind);
+        createClone(F, ".destroy", Shape, CoroCloner::Kind::SwitchUnwind, TTI);
     auto *CleanupClone =
-        createClone(F, ".cleanup", Shape, CoroCloner::Kind::SwitchCleanup);
+        createClone(F, ".cleanup", Shape, CoroCloner::Kind::SwitchCleanup, TTI);
 
     postSplitCleanup(*ResumeClone);
     postSplitCleanup(*DestroyClone);
     postSplitCleanup(*CleanupClone);
 
-    // Adding musttail call to support symmetric transfer.
-    // Skip targets which don't support tail call.
-    //
-    // FIXME: Could we support symmetric transfer effectively without musttail
-    // call?
-    if (TTI.supportsTailCalls())
-      addMustTailToCoroResumes(*ResumeClone, TTI);
-
     // Store addresses resume/destroy/cleanup functions in the coroutine frame.
     updateCoroFrame(Shape, ResumeClone, DestroyClone, CleanupClone);
 
@@ -1560,8 +1470,9 @@ struct SwitchCoroutineSplitter {
   // new entry block and replacing coro.suspend an appropriate value to force
   // resume or cleanup pass for every suspend point.
   static Function *createClone(Function &F, const Twine &Suffix,
-                               coro::Shape &Shape, CoroCloner::Kind FKind) {
-    CoroCloner Cloner(F, Suffix, Shape, FKind);
+                               coro::Shape &Shape, CoroCloner::Kind FKind,
+                               TargetTransformInfo &TTI) {
+    CoroCloner Cloner(F, Suffix, Shape, FKind, TTI);
     Cloner.create();
     return Cloner.getFunction();
   }
@@ -1662,34 +1573,6 @@ struct SwitchCoroutineSplitter {
     Shape.SwitchLowering.ResumeEntryBlock = NewEntry;
   }
 
-  // Add musttail to any resume instructions that is immediately followed by a
-  // suspend (i.e. ret). We do this even in -O0 to support guaranteed tail call
-  // for symmetrical coroutine control transfer (C++ Coroutines TS extension).
-  // This transformation is done only in the resume part of the coroutine that
-  // has identical signature and calling convention as the coro.resume c...
[truncated]

@zmodem
Copy link
Collaborator Author

zmodem commented Apr 23, 2024

What do you think?

(The test updates are a bit hacky for now, but I wanted to get feedback before spending more time on them. Also I think many of the tests become uninteresting after this patch and could probably be deleted.)

Copy link

github-actions bot commented Apr 23, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@mtrofin
Copy link
Member

mtrofin commented Apr 23, 2024

Does this address https://discourse.llvm.org/t/coro-pre-split-handling-of-the-suspend-edge/75043? Could you add a note there in that direction - a few folks were looking at going at the direction @jyknight was suggesting there, and it'd be good to have closure on the topic.

@mtrofin mtrofin requested a review from jyknight April 23, 2024 15:07
@zmodem
Copy link
Collaborator Author

zmodem commented Apr 23, 2024

Does this address https://discourse.llvm.org/t/coro-pre-split-handling-of-the-suspend-edge/75043? Could you add a note there in that direction - a few folks were looking at going at the direction @jyknight was suggesting there, and it'd be good to have closure on the topic.

Thanks for the pointer! If I understand things correctly:

  • Yes, my patch addresses the problem of instructions in the suspend block getting in the way of lowering symmetric transfer as a tail call.
  • With my patch, something could still put instructions between the llvm.coro.await.suspend.handle and the suspend, and those would become unreachable. I don't think that's a problem in practice though. The frontend shouldn't do that, and I don't think any transform would move instructions across the intrinsic call.
  • My patch doesn't address the more general concern in that thread about that edge being a strange representation.

Copy link
Member

@ChuanqiXu9 ChuanqiXu9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The higher level idea looks fine.

A detail is that, in this patch, we must not mark llvm.coro.await.suspend.handle as nounwind. Previously, llvm.coro.await.suspend.handle may be marked as nounwind if the await_suspend is noexcept. But we can't do it in this patch since llvm.coro.await.suspend.handle will include resume calls, which technically is not nounwind.

@zmodem
Copy link
Collaborator Author

zmodem commented Apr 24, 2024

The higher level idea looks fine.

Thanks!

A detail is that, in this patch, we must not mark llvm.coro.await.suspend.handle as nounwind. Previously, llvm.coro.await.suspend.handle may be marked as nounwind if the await_suspend is noexcept. But we can't do it in this patch since llvm.coro.await.suspend.handle will include resume calls, which technically is not nounwind.

Done.

zmodem added 3 commits April 29, 2024 15:55
It was testing that "the lifetime of the coroutine handle used to obtain
the address is contained within single basic block, and hence does not
live across suspension points"

Since we no longer separately obtain the handle and pass it to
@llvm.coro.resume, this test doesn't really test anything.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was testing that "the lifetime of the coroutine handle used to obtain
the address is contained within single basic block, and hence does not
live across suspension points"

Since we no longer separately obtain the handle and pass it to
@llvm.coro.resume, this test doesn't really test anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to figure out what this test does. There is a Clang level test which was added at the same time: llvm/test/Transforms/Coroutines/coro-preserve-final.ll presumable that's enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my patch, some of these tests become less interesting. The class of problems arising from instructions between the resume and suspend doesn't exist with the new lowering. We could probably drop coro-split-musttail{5,6,7}.ll?

@@ -40,10 +38,8 @@ exit:

; Verify that in the resume part resume call is marked with musttail.
; CHECK-LABEL: @f.resume(
; CHECK: %[[addr2:.+]] = call ptr @llvm.coro.subfn.addr(ptr null, i8 0)
; NOPGO-NEXT: musttail call fastcc void %[[addr2]](ptr null)
; PGO: call void @llvm.instrprof
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These llvm.instrprof.value.profile calls between the llvm.coro.subfn.addr call and tail call disappear with my patch. That's because the test runs pgo-instr-gen before coro-split. Is that representative of how the normal pipeline looks?

If the instrumentation is done after coro-split, it will see the indirect call and add the llvm.instrprof.value.profile as expected.

Copy link
Collaborator Author

@zmodem zmodem May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this, and I don't think my patch changes the situation for value profiling. PGO instrumentation is performed early. Before my patch it would see a call to @llvm.coro.resume, and now it will see @llvm.coro.await.suspend.handle. In neither case does it see the lowered indirect call to the resumed function, and so no @llvm.instrprof.value.profile call gets inserted. Supporting that would be a nice future improvement.

(This is the example I tried: https://gist.github.com/zmodem/8f148a9ff992603ee053582f9ac3aceb)

@zmodem
Copy link
Collaborator Author

zmodem commented Apr 29, 2024

I've cleaned up the patch a bit and added some comments. Please take a look.

@zmodem
Copy link
Collaborator Author

zmodem commented May 14, 2024

I think my patch is a significant improvement, both in terms of simplicity and reliability of the codegen for symmetric transfer, and would like to move forward. @ChuanqiXu9 @mtrofin do you have any further comments?

@mtrofin
Copy link
Member

mtrofin commented May 14, 2024

I think my patch is a significant improvement, both in terms of simplicity and reliability of the codegen for symmetric transfer, and would like to move forward. @ChuanqiXu9 @mtrofin do you have any further comments?

Nothing on my side, lgtm for my narrow concern. Please wait for @ChuanqiXu9 's lgtm tho.

Copy link
Member

@ChuanqiXu9 ChuanqiXu9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a comment.

@zmodem zmodem merged commit 3bb3969 into llvm:main May 15, 2024
4 of 5 checks passed
@zmodem zmodem deleted the coro_aggressive_musttail.squash2 branch May 15, 2024 13:29
@ChuanqiXu9
Copy link
Member

I received some reports for this patch breaking some codes. I am reproducing it. It looks like related to exceptions.

@zmodem
Copy link
Collaborator Author

zmodem commented May 23, 2024

Thanks, let me know if I can help.

@ChuanqiXu9
Copy link
Member

It is pretty interesting that I can pass the test by:

$git diff -U10
diff --git a/llvm/lib/Transforms/Coroutines/CoroSplit.cpp b/llvm/lib/Transforms/Coroutines/CoroSplit.cpp
index 450ea8234371..003bfbf7138a 100644
--- a/llvm/lib/Transforms/Coroutines/CoroSplit.cpp
+++ b/llvm/lib/Transforms/Coroutines/CoroSplit.cpp
@@ -220,20 +220,21 @@ static void lowerAwaitSuspend(IRBuilder<> &Builder, CoroAwaitSuspendInst *CB,
     }

     coro::LowererBase LB(*Wrapper->getParent());
     auto *ResumeAddr = LB.makeSubFnCall(NewCall, CoroSubFnInst::ResumeIndex,
                                         &*Builder.GetInsertPoint());

     LLVMContext &Ctx = Builder.getContext();
     FunctionType *ResumeTy = FunctionType::get(
         Type::getVoidTy(Ctx), PointerType::getUnqual(Ctx), false);
     auto *ResumeCall = Builder.CreateCall(ResumeTy, ResumeAddr, {NewCall});
+    ResumeCall->setCallingConv(CallingConv::Fast);

     // We can't insert the 'ret' instruction and adjust the cc until the
     // function has been split, so remember this for later.
     Shape.SymmetricTransfers.push_back(ResumeCall);

     NewCall = ResumeCall;
   }

   CB->replaceAllUsesWith(NewCall);
   CB->eraseFromParent();

The breaking only happens with optimizations in a workload.

ChuanqiXu9 added a commit that referenced this pull request May 23, 2024
…call from 'llvm.coro.await.suspend.handle' as fast

See the post commit message in
#89751

We met a regression due to a change of calling convention of this patch.
Previously, the calling convention of indirect resume calls is always
fast.

And in this patch, although we tried to take care of it in the cloner,
we forget the case that we have to update the resuming calls in the
ramp functions. So this is the root cause of the downstream failure.

This patch tries to mark the generated resuming calls as fast
immediately after they got created to make sure the calling convention
is correct.
@ChuanqiXu9
Copy link
Member

I land it directly given it looks straightforward and trivial in 31f1590

See the commit message if you're interested.

@ChuanqiXu9
Copy link
Member

Oh, it looks like some bots are not happy. I send the PR at #93167

ChuanqiXu9 added a commit to ChuanqiXu9/llvm-project that referenced this pull request May 23, 2024
…call from 'llvm.coro.await.suspend.handle' as fast

See the post commit message in
llvm#89751

We met a regression due to a change of calling convention of this patch.
Previously, the calling convention of indirect resume calls is always
fast.

And in this patch, although we tried to take care of it in the cloner,
we forget the case that we have to update the resuming calls in the
ramp functions. So this is the root cause of the downstream failure.

This patch tries to mark the generated resuming calls as fast
immediately after they got created to make sure the calling convention
is correct.
ChuanqiXu9 added a commit that referenced this pull request May 23, 2024
…call from 'llvm.coro.await.suspend.handle' as fast (#93167)

See the post commit message in
#89751

We met a regression due to a change of calling convention of this patch.
Previously, the calling convention of indirect resume calls is always
fast.

And in this patch, although we tried to take care of it in the cloner,
we forget the case that we have to update the resuming calls in the ramp
functions. So this is the root cause of the downstream failure.

This patch tries to mark the generated resuming calls as fast
immediately after they got created to make sure the calling convention
is correct.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:codegen IR generation bugs: mangling, exceptions, etc. clang Clang issues not falling into any other category coroutines C++20 coroutines llvm:ir llvm:transforms PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants