Skip to content

[llvm][Support][Windows] Avoid crash calling remove_directories() #118677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 12, 2024

Conversation

slydiman
Copy link
Contributor

@slydiman slydiman commented Dec 4, 2024

We faced an unexpected crash in SHELL32_CallFileCopyHooks() on the buildbot lldb-remote-linux-win. The host is Windows Server 2022 w/o any 3rd party shell extensions. See #118032 for more details.
Based on this article.

We faced an unexpected crash in SHELL32_CallFileCopyHooks() on the buildbot [lldb-remote-linux-win](https://lab.llvm.org/staging/#/builders/197/builds/1066).
The host is Windows Server 2022 w/o any 3rd party shell extensions.
See llvm#118032 for more details.
Based on [this article](https://devblogs.microsoft.com/oldnewthing/20120330-00/?p=7963).
@llvmbot
Copy link
Member

llvmbot commented Dec 4, 2024

@llvm/pr-subscribers-llvm-support

@llvm/pr-subscribers-platform-windows

Author: Dmitry Vasilyev (slydiman)

Changes

We faced an unexpected crash in SHELL32_CallFileCopyHooks() on the buildbot lldb-remote-linux-win. The host is Windows Server 2022 w/o any 3rd party shell extensions. See #118032 for more details.
Based on this article.


Full diff: https://github.com/llvm/llvm-project/pull/118677.diff

1 Files Affected:

  • (modified) llvm/lib/Support/Windows/Path.inc (+24-5)
diff --git a/llvm/lib/Support/Windows/Path.inc b/llvm/lib/Support/Windows/Path.inc
index c4bd5e24723517..ecea1efefe0098 100644
--- a/llvm/lib/Support/Windows/Path.inc
+++ b/llvm/lib/Support/Windows/Path.inc
@@ -1387,12 +1387,31 @@ std::error_code remove_directories(const Twine &path, bool IgnoreErrors) {
   Path16.push_back(0);
   Path16.push_back(0);
 
-  SHFILEOPSTRUCTW shfos = {};
-  shfos.wFunc = FO_DELETE;
-  shfos.pFrom = Path16.data();
-  shfos.fFlags = FOF_NO_UI;
+  HRESULT HR = CoInitializeEx(NULL, COINIT_MULTITHREADED);
+  if (SUCCEEDED(HR)) {
+    IFileOperation *FileOp;
+    HR = CoCreateInstance(CLSID_FileOperation, NULL, CLSCTX_ALL,
+                          IID_PPV_ARGS(&FileOp));
+    if (SUCCEEDED(HR)) {
+      HR = FileOp->SetOperationFlags(FOF_NO_UI | FOFX_NOCOPYHOOKS);
+      if (SUCCEEDED(HR)) {
+        IShellItem *ShItem = NULL;
+        HR = SHCreateItemFromParsingName(Path16.data(), NULL,
+                                         IID_PPV_ARGS(&ShItem));
+        if (SUCCEEDED(HR)) {
+          HR = FileOp->DeleteItem(ShItem, NULL);
+          ShItem->Release();
+        }
+        if (SUCCEEDED(HR)) {
+          HR = FileOp->PerformOperations();
+        }
+      }
+      FileOp->Release();
+    }
+  }
+  CoUninitialize();
 
-  int result = ::SHFileOperationW(&shfos);
+  int result = FAILED(HR) ? HRESULT_CODE(HR) : 0;
   if (result != 0 && !IgnoreErrors)
     return mapWindowsError(result);
   return std::error_code();

@slydiman slydiman changed the title [Windows] Avoid crash calling remove_directories() [llvm][Support][Windows] Avoid crash calling remove_directories() Dec 5, 2024
Copy link
Collaborator

@labath labath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really feel qualified to review this, so take this with a large grain of salt, but my thoughts are:

  • I'm not convinced that this is the same problem as in the referenced article. Unlike the article, where the crash was in some third party extension, this appears to be a built-in OS component which (I would hope) doesn't have these kinds of bugs. I'm not saying we shouldn't suppress these hooks (this ties into the second point), but I'm not sure that this will actually resolve the underlying issue (which could be some sort of corruption/race in lldb/llvm which manifests as a crash in this place).
  • is there any advantage in doing this instead of the "manual" iterate-over-subdirectories approach (which is what the unix version of this does)? If not, we could probably even share code with that impl. I'm asking this because SHFileOperationW sounds like its meant for a shell, and we're not trying to be a shell -- we just want to make a directory disappear.

@labath labath requested review from labath and removed request for labath December 5, 2024 08:50
@slydiman
Copy link
Contributor Author

slydiman commented Dec 5, 2024

Unlike the article, where the crash was in some third party extension, this appears to be a built-in OS component which (I would hope) doesn't have these kinds of bugs.

The article refers buggy Contoso shell extensions. But it seems Contoso is an alias of Microsoft. No guarantees.
It is possible that the cause is the built-in antivirus that cannot be completely disabled on Windows Server 2022.
Perhaps the modules created by thу lldb-api::TestModuleLoadedNotifys.py test are particularly attractive to antivirus software.
The only thing we know for sure is that the crash occured in SHELL32_CallFileCopyHooks(), which is not needed for this operation. By disabling CopyHooks we will increase performance and improve stability in any case.

@mstorsjo
Copy link
Member

mstorsjo commented Dec 5, 2024

Unlike the article, where the crash was in some third party extension, this appears to be a built-in OS component which (I would hope) doesn't have these kinds of bugs.

The article refers buggy Contoso shell extensions. But it seems Contoso is an alias of Microsoft.

No, Contoso is not an alias of Microsoft. It's a placeholder name of any random third party company/project, used in these docs to not need to name the actual third parties involved.

@slydiman slydiman requested a review from mstorsjo December 5, 2024 14:46
@labath
Copy link
Collaborator

labath commented Dec 6, 2024

The only thing we know for sure is that the crash occured in SHELL32_CallFileCopyHooks(),

Yes, but I don't think that's a sufficient reason to make this change.

which is not needed for this operation. By disabling CopyHooks we will increase performance and improve stability in any case.

This might be a sufficient reason, if true, which I simply just don't know. @mstorsjo, what do you make of this?

@slydiman
Copy link
Contributor Author

slydiman commented Dec 6, 2024

Can we use std::filesystem::remove_all()? Note it requres C++17.

BTW, look at llvm/utils/lit/lit/TestRunner.py, executeBuiltinRm(). It calls SHFileOperationW() directly from python.

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 6, 2024

We now build llvm for C++17 so I don't see why not. Does it work? And could we then move remove_directories out of the windows/Linux Path.inc custom code?

@slydiman
Copy link
Contributor Author

slydiman commented Dec 7, 2024

std::filesystem::remove_all() works. But it uses std::filesystem::path as the argument. std::filesystem::path uses char on POSIX, wchar_t on Windows. So some custom code is still necessary. But remove_directories() may be simplified and unified very much for both POSIX and Windows. I will try to prepare such PR.

@mstorsjo
Copy link
Member

mstorsjo commented Dec 7, 2024

which is not needed for this operation. By disabling CopyHooks we will increase performance and improve stability in any case.

This might be a sufficient reason, if true, which I simply just don't know. @mstorsjo, what do you make of this?

I'm not entirely sure. Based on the linked blog post, this does seem like a recommended way of doing it, but invoking COM objects just for deleting a set of files/directories does also feel a bit like overkill.

We now build llvm for C++17 so I don't see why not.

Even though the compilers generally support C++17, std::filesystem was generally added quite late in many C++ standard library implementations, in particular on Windows. (In libc++, support for it was merged only a couple years ago.)

So if possible, using common code shared with Unix, for traversing the file tree and recursively deleting files/directories, sounds like my preferred way forward as well, unless we have other arguments for not doing that. (Why was this codepath using shell32 for deleting directories in the first place?)

@compnerd
Copy link
Member

compnerd commented Dec 7, 2024

Manually performing the DFS and doing the removal would require that we properly handle long paths and that we handle some of the permission work (e.g. IIRC backup semantics may sometimes be required to remove the file). Even del in cmd has trouble with long paths, so we should be careful to not re-invent the removal logic if not absolutely necessary. I expect that @rnk would also have some thoughts here.

@slydiman
Copy link
Contributor Author

slydiman commented Dec 8, 2024

but invoking COM objects just for deleting a set of files/directories does also feel a bit like overkill.

Please note my second commit is the result of disassembling SHFileOperationW(). I have figured out the parameters for CoInitializeEx() and used ILCreateFromPathW() and SHCreateItemFromIDList() instead of SHCreateItemFromParsingName().
So SHFileOperationW() does it anyway.

So if possible, using common code shared with Unix, for traversing the file tree and recursively deleting files/directories, sounds like my preferred way forward as well

Note executeBuiltinRm() in llvm/utils/lit/lit/TestRunner.py uses SHFileOperationW() on Windows and just shutil.rmtree() for other OS. Why don't we use shutil.rmtree() for Windows too? I know that shutil.rmtree() cannot delete files marked RO and it is possible to fix it using a custom error handler. But I don't see such handler around shutil.rmtree() in executeBuiltinRm() for non-Windows OS.

@mstorsjo
Copy link
Member

mstorsjo commented Dec 8, 2024

Manually performing the DFS and doing the removal would require that we properly handle long paths and that we handle some of the permission work (e.g. IIRC backup semantics may sometimes be required to remove the file). Even del in cmd has trouble with long paths, so we should be careful to not re-invent the removal logic if not absolutely necessary. I expect that @rnk would also have some thoughts here.

FWIW, I’m not sure if the libc++ implementation of these operations have got any extra logic to take care of those corner cases, at least I don’t remember anything extra relating to long paths.

@slydiman
Copy link
Contributor Author

slydiman commented Dec 8, 2024

FYI
microsoft/STL#1921

@slydiman slydiman force-pushed the remove_directories-NoCopyHooks branch from 973a23b to 2cb0548 Compare December 10, 2024 10:27
@slydiman
Copy link
Contributor Author

Can we make a decision regarding this patch as #119146 seems to be unacceptable due to issues with libstdc++. Thanks.

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 11, 2024

Unless we can find a workaround I don't see how we can move to std::filesystem, which is very, very annoying 😞

@mstorsjo
Copy link
Member

Unless we can find a workaround I don't see how we can move to std::filesystem, which is very, very annoying 😞

I can try to file a bug to libstdc++ about it, so that there at least is some path towards being able to use it at some point in the future.

@mstorsjo
Copy link
Member

Unless we can find a workaround I don't see how we can move to std::filesystem, which is very, very annoying 😞

I can try to file a bug to libstdc++ about it, so that there at least is some path towards being able to use it at some point in the future.

Reported that bug in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118003. That's of course no guarantee that it'll be fixed anytime soon though (and from personal experience, implementing std::filesystem entirely correctly on Windows is a bit fiddly), but at least there is a bug report.

@rnk
Copy link
Collaborator

rnk commented Dec 11, 2024

  • is there any advantage in doing this instead of the "manual" iterate-over-subdirectories approach (which is what the unix version of this does)? If not, we could probably even share code with that impl. I'm asking this because SHFileOperationW sounds like its meant for a shell, and we're not trying to be a shell -- we just want to make a directory disappear.

I think reliably recursively deleting files in a directory is actually a difficult problem (see this 2015 talk), so Zach Turner probably reached for the native all-in-one API that was at hand. LLVM's directory deletion doesn't need to be bullet-proof, it needs to be good enough, especially given that the only clients are LLDB, dsymutil, and some unit tests. So, I think maybe Pavel is right, and maybe we should implement this using generic portable code.

@rnk
Copy link
Collaborator

rnk commented Dec 11, 2024

Manually performing the DFS and doing the removal would require that we properly handle long paths and that we handle some of the permission work (e.g. IIRC backup semantics may sometimes be required to remove the file). Even del in cmd has trouble with long paths, so we should be careful to not re-invent the removal logic if not absolutely necessary. I expect that @rnk would also have some thoughts here.

We should already handle long paths, I think I would be more worried about "backup semantics" and other cases where we empty a directory, but then cannot remove the parent directory because it is not yet empty. Those are the kinds of cases where you just want to ask some Windows system DLL to do it for you.

@slydiman slydiman merged commit 925471e into llvm:main Dec 12, 2024
8 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 12, 2024

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 6 "test-openmp".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/10129

Here is the relevant piece of the build log for the reference
Step 6 (test-openmp) failure: test (failure)
******************** TEST 'libomp :: tasking/issue-94260-2.c' FAILED ********************
Exit Code: -11

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp   -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -fno-omit-frame-pointer -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test/ompt /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test/tasking/issue-94260-2.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp -lm -latomic && /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -fno-omit-frame-pointer -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test/ompt /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/openmp/runtime/test/tasking/issue-94260-2.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp -lm -latomic
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp
# note: command had no output on stdout or stderr
# error: command failed with exit status: -11

--

********************


@mstorsjo
Copy link
Member

Unfortunately, this implementation caused other build issues; it uses atlbase.h which is not available in mingw environments at all. (Also within MSVC, ATL is an optional install, even if it quite often is included, and LLVM didn't use to require ATL before either.) So now building for mingw targets is entirely broken.

I guess this affects the use of CComPtr here. We don't seem to have all that much COM code within LLVM so far, but we have llvm/include/llvm/Support/COM.h with the InitializeCOMRAII wrapper for the COM init itself.

Is it possible to write a similar COM pointer wrapper that we can use, instead of relying on ATL for this? Covering the uses we have here should be quite trivial I guess?

@slydiman
Copy link
Contributor Author

slydiman commented Dec 13, 2024

I will prepare the path w/o CComPtr and atlbase.h in few hours.

BTW, it seems this patch fixed the issue #118032
https://lab.llvm.org/staging/#/builders/197

slydiman added a commit to slydiman/llvm-project that referenced this pull request Dec 13, 2024
slydiman added a commit that referenced this pull request Dec 13, 2024
…and atlbase.h (#119843)

This is the update of #118677. This patch fixes building with mingw.
@slydiman slydiman deleted the remove_directories-NoCopyHooks branch April 18, 2025 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants