[Support] Avoid a VirtualBox shared folders mmap bug #78597

mstorsjo · 2024-01-18T15:14:03Z

In acd8791, a call to FlushFileBuffers was added to work around a rare kernel bug. In 3b9b4d2, the scope of that workaround was limited, for performance reasons, as the flushes are quite expensive.

On VirtualBox shared folders, closing a memory mapping that has been written to, also needs to be explicitly flushed, if renaming the output file before it is closed. Contrary to the kernel bug, this always happens on such mounts. In these cases, the output ends up as a file of the right size, but the contents are all zeros.

The sequence to trigger the issue on the VirtualBox Shared Folders is this, summarized:

file = CreateFile()
mapping = CreateFileMapping(file)
mem = MapViewOfFile()
CloseHandle(mapping)
write(mem)
UnmapViewOfFile(mem)
SetFileInformationByHandle(file, FileRenameInfo)
CloseHandle(file)

With this sequence, the output file always ends up with all zeros. See mstorsjo/llvm-mingw#393 for a full reproduction example.

To avoid this issue, call FlushFileBuffers() when the file may reside on a VitualBox shared folder. As the flushes are expensive, only do them when the output isn't on a local file system.

The issue with VirtualBox shared folders could also be fixed by calling FlushViewOfFile before UnmapViewOfFile, and doing that could be slightly less expensive than FlushFileBuffers.

Empirically, the difference between the two is very small though, and as it's not easy to verify whether switching FlushFileBuffers to FlushViewOfFile helps with the rare kernel bug, keep using FlushFileBuffers for both cases, for code simplicity.

This fixes downstream bug mstorsjo/llvm-mingw#393.

llvmbot · 2024-01-18T15:14:21Z

@llvm/pr-subscribers-llvm-support
@llvm/pr-subscribers-lld-coff
@llvm/pr-subscribers-platform-windows

@llvm/pr-subscribers-lld

Author: Martin Storsjö (mstorsjo)

Changes

This reverts the functional parts of commit
3b9b4d2.

The conditional call to FlushFileBuffers was added to work around a rare kernel bug, see acd8791 and 3b9b4d2.

However, closing a memory mapping that has been written to, can also need a call to FlushFileBuffers in other cases. When operating on a VirtualBox Shared Folder file system, the output file ends up unwritten/unflushed (with the right size but all zeros), if written via a memory mapping, then the file is renamed before the output handle is closed (this is what is done in lld-link).

The sequence to trigger the issue on the VirtualBox Shared Folders is this, summarized:

file = CreateFile()
mapping = CreateFileMapping(file)
mem = MapViewOfFile()
CloseHandle(mapping)
write(mem)
UnmapViewOfFile(mem)
SetFileInformationByHandle(file, FileRenameInfo)
CloseHandle(file)

With this sequence, the output file always ends up with all zeros. See mstorsjo/llvm-mingw#393 (comment) for a full reproduction example.

The issue can seem elusive, as we conditionally do the call to FlushFileBuffers() to work around the rare kernel bug, whenever running on versions of Windows older than 10.0.0.17763.

This fixes downstream bug
mstorsjo/llvm-mingw#393.

Full diff: https://github.com/llvm/llvm-project/pull/78597.diff

1 Files Affected:

(modified) llvm/lib/Support/Windows/Path.inc (+6-20)

diff --git a/llvm/lib/Support/Windows/Path.inc b/llvm/lib/Support/Windows/Path.inc
index 2bf68b7972e746f..204dc412972d63d 100644
--- a/llvm/lib/Support/Windows/Path.inc
+++ b/llvm/lib/Support/Windows/Path.inc
@@ -936,30 +936,11 @@ mapped_file_region::mapped_file_region(sys::fs::file_t fd, mapmode mode,
     copyFrom(mapped_file_region());
 }
 
-static bool hasFlushBufferKernelBug() {
-  static bool Ret{GetWindowsOSVersion() < llvm::VersionTuple(10, 0, 0, 17763)};
-  return Ret;
-}
-
-static bool isEXE(StringRef Magic) {
-  static const char PEMagic[] = {'P', 'E', '\0', '\0'};
-  if (Magic.starts_with(StringRef("MZ")) && Magic.size() >= 0x3c + 4) {
-    uint32_t off = read32le(Magic.data() + 0x3c);
-    // PE/COFF file, either EXE or DLL.
-    if (Magic.substr(off).starts_with(StringRef(PEMagic, sizeof(PEMagic))))
-      return true;
-  }
-  return false;
-}
-
 void mapped_file_region::unmapImpl() {
   if (Mapping) {
-
-    bool Exe = isEXE(StringRef((char *)Mapping, Size));
-
     ::UnmapViewOfFile(Mapping);
 
-    if (Mode == mapmode::readwrite && Exe && hasFlushBufferKernelBug()) {
+    if (Mode == mapmode::readwrite) {
       // There is a Windows kernel bug, the exact trigger conditions of which
       // are not well understood.  When triggered, dirty pages are not properly
       // flushed and subsequent process's attempts to read a file can return
@@ -967,6 +948,11 @@ void mapped_file_region::unmapImpl() {
       // sufficient to ensure that this bug is not triggered.
       // The bug only occurs when writing an executable and executing it right
       // after, under high I/O pressure.
+      //
+      // Separately, on VirtualBox Shared Folder mounts, writes via memory maps
+      // always end up unflushed (regardless of version of Windows, unless flushed
+      // with this explicit call, if they are renamed with
+      // SetFileInformationByHandle(FileRenameInfo) before closing the output handle.
       ::FlushFileBuffers(FileHandle);
     }

mstorsjo · 2024-01-18T15:14:35Z

Also CC @cjacek who might have some more insights into the internal workings of these Windows functions and their interactions.

github-actions · 2024-01-18T15:16:26Z

✅ With the latest revision this PR passed the C/C++ code formatter.

aganea · 2024-01-18T15:26:35Z

@mstorsjo Are you able to test linking a large PDB by chance? 4GB+ PDB are becoming quite common nowadays, and since the PDB isn’t needed just after link, the previous optimization I did was saving several seconds of link time. Corectness is needed before performance, but this PR would impact cases which don’t need the flush (I assume most of LLD usages on Windows). Can we rather detect problematic cases on LLD startup, like writing to a virtual/network drive, and only disable the optimization there?

compnerd · 2024-01-18T15:30:23Z

In general, I think that if a memory map is involved, then it makes sense to flush the data. I don't think that the default behaviour is to do write through for performance reasons.

mstorsjo · 2024-01-18T15:32:37Z

@mstorsjo Are you able to test linking a large PDB by chance? 4GB+ PDB are becoming quite common nowadays, and since the PDB isn’t needed just after link, the previous optimization I did was saving several seconds of link time.

I don't have any such scenarios available to test off the bat at least. To understand - what bit is it that takes the extra time in that case? Does FlushFileBuffers() force flushing the large PDB to actual disk, while it otherwise only would keep it in disk cache in memory (and maybe not flush it to disk at all in the end, if it doesn't end up used/needed)?

Corectness is needed before performance, but this PR would impact cases which don’t need the flush (I assume most of LLD usages on Windows). Can we rather detect problematic cases on LLD startup, like writing to a virtual/network drive, and only disable the optimization there?

I'm not sure how we would detect those problematic cases. We could use the same heuristic that we use elsewhere, is_local, which we use internally in setDeleteDisposition. As we saw in the PR about using FileIndex vs hashing the path, is_local can be expensive though, but here we only operate on a few files, not many like clang inputs, so the heuristic would probably still be cheaper than writing a 4 GB PDB to disk. On the file we use in this case in LLD, we will already have called setDeleteDisposition though, but that information isn't available later when doing the unmapping, unless we try to propagate and pass that information through somehow.

cjacek · 2024-01-18T16:06:22Z

I think that instead of FlushFileBuffers, we could use FlushViewOfFile before unmapping. In theory it should be cheaper and doing that with your test case fixes the problem on VirtualBox for me (and hopefully the kernel bug too).

rnk · 2024-01-18T17:25:24Z

I'm pretty hesitant about the idea of flushing files to disk on shutdown. My understanding is that the vast majority of software improvements to filesystem performance are attributable to caching data in memory and deferring disk writes. I can't find a quotation, but I believe I picked this up from Ted T'so somewhere (maybe here), but reading that again, I wonder if he's trying to make the opposite point.

Maybe the way forward here is to land the flush operation under a flag so folks can measure performance in different scenarios, since it doesn't seem reasonable to put the burden of proving there is no performance regression on Martin.

tru · 2024-01-18T18:07:53Z

What would be the right way to test this? We produce PDB files bigger than 4GB so it's not a huge deal for me to test.

aganea · 2024-01-18T18:15:33Z

@mstorsjo Are you able to test linking a large PDB by chance? 4GB+ PDB are becoming quite common nowadays, and since the PDB isn’t needed just after link, the previous optimization I did was saving several seconds of link time.

I don't have any such scenarios available to test off the bat at least. To understand - what bit is it that takes the extra time in that case? Does FlushFileBuffers() force flushing the large PDB to actual disk, while it otherwise only would keep it in disk cache in memory (and maybe not flush it to disk at all in the end, if it doesn't end up used/needed)?

FlushFileBuffers() does indeed wait until all data is physically written on the storage medium. If I remember correctly the Windows driver for that medium also asks the ROM on the storage device to flush its internal buffers as well, which it doesn't normally do, so this can be slow even with SSDs.

What would be the right way to test this? We produce PDB files bigger than 4GB so it's not a huge deal for me to test.

Try lld-link .. /time before/after patch? Or Measure-Command { lld-link .. } with powershell if you want to be more precise.

aganea · 2024-01-18T18:49:00Z

Maybe the way forward here is to land the flush operation under a flag so folks can measure performance in different scenarios, since it doesn't seem reasonable to put the burden of proving there is no performance regression on Martin.

We had the same discussion on this. I'd prefer auto-detection of the problematic target drive because new flags are not easily discoverable by users and we will end with this same bug report. But if that's not possible we can also do a flag. In that case what about making things correct by default and a flag for disabling the flush?

rnk · 2024-01-18T19:10:31Z

We had the same discussion on this. I'd prefer auto-detection of the problematic target drive because new flags are not easily discoverable by users and we will end with this same bug report. But if that's not possible we can also do a flag. In that case what about making things correct by default and a flag for disabling the flush?

I agree that the flag is not a long term solution for the users affected by the virtualbox issue. It would be a way to commit the code change, allow unaffected users to measure the cost of additional flushing with low effort (flag flips instead of local patches), and then we can refine the auto-detection logic for which devices require the flush, or decide how important it is to have a narrow criteria.

mstorsjo · 2024-01-18T20:17:18Z

@mstorsjo Are you able to test linking a large PDB by chance? 4GB+ PDB are becoming quite common nowadays, and since the PDB isn’t needed just after link, the previous optimization I did was saving several seconds of link time.

I don't have any such scenarios available to test off the bat at least. To understand - what bit is it that takes the extra time in that case? Does FlushFileBuffers() force flushing the large PDB to actual disk, while it otherwise only would keep it in disk cache in memory (and maybe not flush it to disk at all in the end, if it doesn't end up used/needed)?

FlushFileBuffers() does indeed wait until all data is physically written on the storage medium. If I remember correctly the Windows driver for that medium also asks the ROM on the storage device to flush its internal buffers as well, which it doesn't normally do, so this can be slow even with SSDs.

What would be the right way to test this? We produce PDB files bigger than 4GB so it's not a huge deal for me to test.

Try lld-link .. /time before/after patch? Or Measure-Command { lld-link .. } with powershell if you want to be more precise.

If you can do that, and also measure the performance impact of this change (on top of this one), that'd be great:

diff --git a/llvm/lib/Support/Windows/Path.inc b/llvm/lib/Support/Windows/Path.inc
index 204dc412972d..9c75fafc5fdf 100644
--- a/llvm/lib/Support/Windows/Path.inc
+++ b/llvm/lib/Support/Windows/Path.inc
@@ -938,6 +938,7 @@ mapped_file_region::mapped_file_region(sys::fs::file_t fd, mapmode mode,
 
 void mapped_file_region::unmapImpl() {
   if (Mapping) {
+    ::FlushViewOfFile(Mapping, Size);
     ::UnmapViewOfFile(Mapping);
 
     if (Mode == mapmode::readwrite) {
@@ -953,7 +954,7 @@ void mapped_file_region::unmapImpl() {
       // always end up unflushed (regardless of version of Windows, unless flushed
       // with this explicit call, if they are renamed with
       // SetFileInformationByHandle(FileRenameInfo) before closing the output handle.
-      ::FlushFileBuffers(FileHandle);
+//      ::FlushFileBuffers(FileHandle);
     }
 
     ::CloseHandle(FileHandle);

mstorsjo · 2024-01-20T14:52:14Z

Try lld-link .. /time before/after patch? Or Measure-Command { lld-link .. } with powershell if you want to be more precise.

If you can do that, and also measure the performance impact of this change (on top of this one), that'd be great:

I took a shot at measuring this now after all. I don't have any case where I build binaries that big, but in building a statically linked clang.exe at 93 MB, I get an 854 MB PDB file, so that's certainly big enough to be measurable.

The difference in speed is indeed notable. Originally (with no flush calls, i.e. running on a new enough version of Windows), linking takes 4.2 seconds. By always calling FlushFileBuffers, this rises to 10.6 seconds, and by using FlushViewOfFile as suggested by @cjacek, it takes 10.4 seconds. (I'm not sure if the difference between the latter two is statistically significant or not, I didn't do a very large number of samples.)

With a breakdown with lld-link /time, I get the following profiles:

No flush
--------
  Input File Reading:             575 ms ( 14.9%)
  GC:                             262 ms (  6.8%)
  Code Layout:                    213 ms (  5.5%)
  Commit Output File:              15 ms (  0.4%)
  PDB Emission (Cumulative):     2736 ms ( 70.7%)
    Add Objects:                 1705 ms ( 44.1%)
      Global Type Hashing:        240 ms (  6.2%)
      GHash Type Merging:         437 ms ( 11.3%)
      Symbol Merging:            1023 ms ( 26.4%)
    Publics Stream Layout:         27 ms (  0.7%)
    TPI Stream Layout:             24 ms (  0.6%)
    Commit to Disk:               719 ms ( 18.6%)
--------------------------------------------------
Total Linking Time:              3871 ms (100.0%)

FlushFileBuffers
----------------

  Input File Reading:             579 ms (  5.5%)
  GC:                             260 ms (  2.5%)
  Code Layout:                    208 ms (  2.0%)
  Commit Output File:             620 ms (  5.8%)
  PDB Emission (Cumulative):     8881 ms ( 83.7%)
    Add Objects:                 1848 ms ( 17.4%)
      Global Type Hashing:        375 ms (  3.5%)
      GHash Type Merging:         444 ms (  4.2%)
      Symbol Merging:            1023 ms (  9.6%)
    Publics Stream Layout:         27 ms (  0.3%)
    TPI Stream Layout:             24 ms (  0.2%)
    Commit to Disk:              6721 ms ( 63.3%)
--------------------------------------------------
Total Linking Time:             10614 ms (100.0%)

FlushViewOfFile
---------------
  Input File Reading:             582 ms (  5.7%)
  GC:                             261 ms (  2.6%)
  Code Layout:                    207 ms (  2.0%)
  Commit Output File:             574 ms (  5.7%)
  PDB Emission (Cumulative):     8448 ms ( 83.4%)
    Add Objects:                 1708 ms ( 16.9%)
      Global Type Hashing:        235 ms (  2.3%)
      GHash Type Merging:         437 ms (  4.3%)
      Symbol Merging:            1030 ms ( 10.2%)
    Publics Stream Layout:         27 ms (  0.3%)
    TPI Stream Layout:             24 ms (  0.2%)
    Commit to Disk:              6428 ms ( 63.4%)
--------------------------------------------------
Total Linking Time:             10134 ms (100.0%)

So this is clearly a regression wrt performance for these cases, while it improves correctness.

I'll try to do a PoC for hooking up heuristics via the is_local function. (I'm not sure if we have any better guess for cases where we'd need to be extra cautious with flushing files?) The heuristic itself will add some runtime cost, but certainly less than the extra overhead in linking large binaries/PDBs.

In acd8791, a call to FlushFileBuffers was added to work around a rare kernel bug. In 3b9b4d2, the scope of that workaround was limited, for performance reasons, as the flushes are quite expensive. On VirtualBox shared folders, closing a memory mapping that has been written to, also needs to be explicitly flushed, if renaming the output file before it is closed. Contrary to the kernel bug, this always happens on such mounts. In these cases, the output ends up as a file of the right size, but the contents are all zeros. The sequence to trigger the issue on the VirtualBox Shared Folders is this, summarized: file = CreateFile() mapping = CreateFileMapping(file) mem = MapViewOfFile() CloseHandle(mapping) write(mem) UnmapViewOfFile(mem) SetFileInformationByHandle(file, FileRenameInfo) CloseHandle(file) With this sequence, the output file always ends up with all zeros. See mstorsjo/llvm-mingw#393 for a full reproduction example. To avoid this issue, call FlushFileBuffers() when the file may reside on a VitualBox shared folder. As the flushes are expensive, only do them when the output isn't on a local file system. The issue with VirtualBox shared folders could also be fixed by calling FlushViewOfFile before UnmapViewOfFile, and doing that could be slightly less expensive than FlushFileBuffers. Empirically, the difference between the two is very small though, and as it's not easy to verify whether switching FlushFileBuffers to FlushViewOfFile helps with the rare kernel bug, keep using FlushFileBuffers for both cases, for code simplicity. This fixes downstream bug mstorsjo/llvm-mingw#393.

mstorsjo · 2024-01-20T22:50:16Z

I updated the patch to call is_local, and do a flush only if that indicates that the file system isn't local, unless we're flushing to avoid the kernel bug anyway.

For the rare kernel bug case, we don't know if calling FlushViewOfFile instead of FlushFileBuffers works. (It probably does, but verifying it is nontrivial.) Therefore, we probably don't want to change that case. For the new heuristic case, I kept calling FlushFileBuffers just for code simplicity reasons, to allow using the same codepath for both cases, because the performance difference between the two flush alternatives seems to be very small in practice.

This keeps the usual much faster performance for writing large PDB files on normal storage, while fixing the correctness in the VirtualBox case.

mstorsjo · 2024-01-22T22:17:41Z

I updated the patch to call is_local, and do a flush only if that indicates that the file system isn't local, unless we're flushing to avoid the kernel bug anyway.

For the rare kernel bug case, we don't know if calling FlushViewOfFile instead of FlushFileBuffers works. (It probably does, but verifying it is nontrivial.) Therefore, we probably don't want to change that case. For the new heuristic case, I kept calling FlushFileBuffers just for code simplicity reasons, to allow using the same codepath for both cases, because the performance difference between the two flush alternatives seems to be very small in practice.

This keeps the usual much faster performance for writing large PDB files on normal storage, while fixing the correctness in the VirtualBox case.

@aganea Does this seem like a reasonable compromise, performance wise?

rnk · 2024-01-22T22:29:30Z

I think flushing network output files is actually reasonable behavior in the general case. For local files, it seems reasonable to trust the kernel to flush the file contents to disk "soon", but network filesystems are less reliable, and it is probably better to wait for the bytes to get pushed out over the network.

aganea · 2024-01-22T22:42:11Z

I updated the patch to call is_local, and do a flush only if that indicates that the file system isn't local, unless we're flushing to avoid the kernel bug anyway.
For the rare kernel bug case, we don't know if calling FlushViewOfFile instead of FlushFileBuffers works. (It probably does, but verifying it is nontrivial.) Therefore, we probably don't want to change that case. For the new heuristic case, I kept calling FlushFileBuffers just for code simplicity reasons, to allow using the same codepath for both cases, because the performance difference between the two flush alternatives seems to be very small in practice.
This keeps the usual much faster performance for writing large PDB files on normal storage, while fixing the correctness in the VirtualBox case.

@aganea Does this seem like a reasonable compromise, performance wise?

Yes, thank you for doing this! I would assume the is_local check shouldn't affect the current timings?

aganea

LGTM

mstorsjo · 2024-01-23T07:48:58Z

I updated the patch to call is_local, and do a flush only if that indicates that the file system isn't local, unless we're flushing to avoid the kernel bug anyway.
For the rare kernel bug case, we don't know if calling FlushViewOfFile instead of FlushFileBuffers works. (It probably does, but verifying it is nontrivial.) Therefore, we probably don't want to change that case. For the new heuristic case, I kept calling FlushFileBuffers just for code simplicity reasons, to allow using the same codepath for both cases, because the performance difference between the two flush alternatives seems to be very small in practice.
This keeps the usual much faster performance for writing large PDB files on normal storage, while fixing the correctness in the VirtualBox case.

@aganea Does this seem like a reasonable compromise, performance wise?

Yes, thank you for doing this! I would assume the is_local check shouldn't affect the current timings?

It shouldn't affect the timings notably, no.

In https://reviews.llvm.org/D155579 we concluded that is_local, and in particular, GetVolumePathNameW, is slow, but that was from the point of view of essentially stat()ing thousands of files. From the point of view of the linker, where we close 1-2 files that we've written, the effect of this is well below the level of noise in the benchmark numbers anyway.

mstorsjo added lld lld:COFF platform:windows labels Jan 18, 2024

mstorsjo requested review from rnk and aganea January 18, 2024 15:14

llvmbot added the llvm:support label Jan 18, 2024

mstorsjo mentioned this pull request Jan 18, 2024

Compilation/Execution Failure on Network Drive mstorsjo/llvm-mingw#393

Open

mstorsjo force-pushed the flush-mmap branch from 76f6dbe to 1097157 Compare January 18, 2024 15:19

ronpinkas mentioned this pull request Jan 18, 2024

Bug: Null Byte Executables Created on VirtualBox Shared Folders with LLVM's LLD linker Virtualbox-OSE/Virtualbox#2

Open

mstorsjo force-pushed the flush-mmap branch from 1097157 to 4d4272d Compare January 20, 2024 22:44

mstorsjo changed the title ~~[Support] Always call FlushFileBuffers() when unmapping memory on Windows~~ [Support] Avoid a VirtualBox shared folders mmap bug Jan 20, 2024

aganea approved these changes Jan 23, 2024

View reviewed changes

mstorsjo merged commit 7ec078e into llvm:main Jan 23, 2024

mstorsjo deleted the flush-mmap branch January 23, 2024 11:40

[Support] Avoid a VirtualBox shared folders mmap bug #78597

[Support] Avoid a VirtualBox shared folders mmap bug #78597

Uh oh!

Conversation

mstorsjo commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mstorsjo commented Jan 18, 2024

Uh oh!

github-actions bot commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aganea commented Jan 18, 2024

Uh oh!

compnerd commented Jan 18, 2024

Uh oh!

mstorsjo commented Jan 18, 2024

Uh oh!

cjacek commented Jan 18, 2024

Uh oh!

rnk commented Jan 18, 2024

Uh oh!

tru commented Jan 18, 2024

Uh oh!

aganea commented Jan 18, 2024

Uh oh!

aganea commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rnk commented Jan 18, 2024

Uh oh!

mstorsjo commented Jan 18, 2024

Uh oh!

mstorsjo commented Jan 20, 2024

Uh oh!

mstorsjo commented Jan 20, 2024

Uh oh!

mstorsjo commented Jan 22, 2024

Uh oh!

rnk commented Jan 22, 2024

Uh oh!

aganea commented Jan 22, 2024

Uh oh!

aganea left a comment

Choose a reason for hiding this comment

Uh oh!

mstorsjo commented Jan 23, 2024

Uh oh!

Uh oh!

mstorsjo commented Jan 18, 2024 •

edited

Loading

llvmbot commented Jan 18, 2024 •

edited

Loading

github-actions bot commented Jan 18, 2024 •

edited

Loading

aganea commented Jan 18, 2024 •

edited

Loading