Skip to content

[lldb][test] Fix flaky DIL array subscript test by reducing array indexes #141738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 28, 2025

Conversation

DavidSpickett
Copy link
Collaborator

@DavidSpickett DavidSpickett commented May 28, 2025

This test has been flaky on Linaro's Windows on Arm bot and I was
able to reproduce it within 10 or so runs locally.

When it fails it's because we failed to read the value of int_arr[100].
When that happens the memory looks like this:

[0x0000006bf88fd000-0x0000006bf8900000) rw-      <-- sp (0x0000006bf88ffe20)
[0x0000006bf8900000-0x0000025fec900000) ---      <-- int_arr[100] (0x0000006bf8900070)

The first region is the stack and the stack pointer is pointing
within that region, as expected.

The second region is where we are trying to read int_arr[100] from
and this is not mapped because we're trying to read above the start
of the stack.

Sometimes the test passes I think because ASLR / DYNAMICBASE moves
the start of the stack down enough so there is some readable memory
at the top.

https://learn.microsoft.com/en-us/cpp/build/reference/dynamicbase?view=msvc-170

Note "Because ASLR can't be disabled on ARM, ARM64, or ARM64EC architectures,
/DYNAMICBASE:NO isn't supported for these targets.". Which means on this bot,
the layout is definitely being randomised.

We don't need to be testing indexes this large. So I've changed the two
test indexes to 3 (1 beyond the end) and 10 (a larger distance beyond the end).
We know that index 42 always worked on the bot, so 10 should be fine, and
did not fail locally.

@llvmbot
Copy link
Member

llvmbot commented May 28, 2025

@llvm/pr-subscribers-lldb

Author: David Spickett (DavidSpickett)

Changes

This test has been flaky on Linaro's Windows on Arm bot and I was able to reproduce it within 10 or so runs locally.

When it fails it's because we failed to read the value of int_arr[100]. When that happens the memory looks like this:

[0x0000006bf88fd000-0x0000006bf8900000) rw-      &lt;-- sp (0x0000006bf88ffe20)
[0x0000006bf8900000-0x0000025fec900000) ---      &lt;-- int_arr[100] (0x0000006bf8900070)

The first region is the stack and the stack pointer is pointing within that region, as expected.

The second region is where we are trying to read int_arr[100] from and this is not mapped because we're trying to read above the start of the stack.

Sometimes the test passes I think because ASLR / DYNAMICBASE moves the start of the stack down enough so there is some readable memory at the top.

https://learn.microsoft.com/en-us/cpp/build/reference/dynamicbase?view=msvc-170

Note "Because ASLR can't be disabled on ARM, ARM64, or ARM64EC architectures, /DYNAMICBASE:NO isn't supported for these targets.". Which means on this bot, the layout is definitely being randomised.

To fix this, I've setup main to pad its stack frame with 100 integers, then called a new function where we declare all the variables that used to be in main.

We have to pad this way because if we pad around int_arr, in the same function, the compiler could decide to move int_arr to the end of the frame, and defeat the padding.

By padding main, it doesn't matter how the compiler treats that as long as it makes main's frame larger.

We saw this failure on Windows only but I wouldn't be surprised if it can happen on Linux as well.


Full diff: https://github.com/llvm/llvm-project/pull/141738.diff

2 Files Affected:

  • (modified) lldb/test/API/commands/frame/var-dil/basics/ArraySubscript/TestFrameVarDILArraySubscript.py (-2)
  • (modified) lldb/test/API/commands/frame/var-dil/basics/ArraySubscript/main.cpp (+21-1)
diff --git a/lldb/test/API/commands/frame/var-dil/basics/ArraySubscript/TestFrameVarDILArraySubscript.py b/lldb/test/API/commands/frame/var-dil/basics/ArraySubscript/TestFrameVarDILArraySubscript.py
index 66f4c62761004..740a64f1477ff 100644
--- a/lldb/test/API/commands/frame/var-dil/basics/ArraySubscript/TestFrameVarDILArraySubscript.py
+++ b/lldb/test/API/commands/frame/var-dil/basics/ArraySubscript/TestFrameVarDILArraySubscript.py
@@ -19,8 +19,6 @@ def expect_var_path(self, expr, compare_to_framevar=False, value=None, type=None
             self.runCmd("settings set target.experimental.use-DIL true")
             self.assertEqual(value_dil.GetValue(), value_frv.GetValue())
 
-    # int_arr[100] sometimes points to above the stack region, fix coming soon.
-    @skipIfWindows
     def test_subscript(self):
         self.build()
         lldbutil.run_to_source_breakpoint(
diff --git a/lldb/test/API/commands/frame/var-dil/basics/ArraySubscript/main.cpp b/lldb/test/API/commands/frame/var-dil/basics/ArraySubscript/main.cpp
index 485666ae46c20..5efbefeea684b 100644
--- a/lldb/test/API/commands/frame/var-dil/basics/ArraySubscript/main.cpp
+++ b/lldb/test/API/commands/frame/var-dil/basics/ArraySubscript/main.cpp
@@ -1,6 +1,21 @@
 #include <vector>
 
-int main(int argc, char **argv) {
+// The variables being tested here are not delcared in main for a specific
+// reason. This is because during testing we access some of them out of their
+// bounds, such as int_arr[100]. If we put all these variables in main, the
+// compiler could decide to arrange them in such a way, that when combined with
+// how the stack offset was randomised by the OS, and how much was already used
+// by the libc's startup code, int_arr[100] would point to somewhere beyond the
+// top of the stack. Unmapped memory.
+//
+// To fix this we could pad around int_arr, but the compiler is free to shuffle
+// the order of stack variables. So instead, we pad main's frame so that the
+// stack pointer during fn will be far enough down that int_arr[100] will still
+// point to stack memory. It does not matter what the compiler does with the
+// padding in main as far as ordering, as long as it increases main's stack
+// frame size.
+
+int fn() {
   int int_arr[] = {1, 2, 3};
   int *int_ptr = int_arr;
   int(&int_arr_ref)[3] = int_arr;
@@ -30,3 +45,8 @@ int main(int argc, char **argv) {
 
   return 0; // Set a breakpoint here
 }
+
+int main(int argc, char **argv) {
+  int stack_frame_padding[100];
+  return fn();
+}

@DavidSpickett
Copy link
Collaborator Author

DavidSpickett commented May 28, 2025

This does not fix the test that is xfailed, where you can't read the first element of a vector. That's something entirely different that I don't plan to look into myself.

@@ -19,8 +19,6 @@ def expect_var_path(self, expr, compare_to_framevar=False, value=None, type=None
self.runCmd("settings set target.experimental.use-DIL true")
self.assertEqual(value_dil.GetValue(), value_frv.GetValue())

# int_arr[100] sometimes points to above the stack region, fix coming soon.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does frame var suffer from the same problem? If not, should DIL handle things differently?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean it should be able to say:

int_arr[100] = <unavailable>;

Like registers can do?

I'll check what frame var does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I'm not even sure what DIL is :) So feel free to educate me.

Copy link
Member

@Michael137 Michael137 May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summoning @labath @kuilpd who've been much more involved in DIL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the point of this test is to check that the expression really allows you to go beyond the array bounds, so fixing this by ensuring the memory at that address exists is the right thing to do. However, I think the index 100 is unnecessarily big. I think we could just change the test to use indexes 3 and 10 (a "one past" and "reasonably beyond" value), and then you don't need to write the very long comment.

Printing <unavailable> or similar when this indexing lands you into unaddressable memory would be nice (maybe it happens already -- I don't know), but that's a different test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh but the long comments are the most fun :)

I agree though, I'll find a smaller out of bounds that is consistently within the stack.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does frame var suffer from the same problem?

(lldb) frame var int_arr[200]
(int) int_arr[200] = <read memory from 0xc812500160 failed (0 of 4 bytes read)>

With DIL on:

(lldb) settings set target.experimental.use-DIL true
(lldb) frame var int_arr[200]
(int) int_arr[200] = <read memory from 0xc812500160 failed (0 of 4 bytes read)>

Which I think means both are reporting it the same way, and that way seems fine to me.

…exes

This test has been flaky on Linaro's Windows on Arm bot and I was
able to reproduce it within 10 or so runs locally.

When it fails it's because we failed to read the value of int_arr[100].
When that happens the memory looks like this:
```
[0x0000006bf88fd000-0x0000006bf8900000) rw-      <-- sp (0x0000006bf88ffe20)
[0x0000006bf8900000-0x0000025fec900000) ---      <-- int_arr[100] (0x0000006bf8900070)
```
The first region is the stack and the stack pointer is pointing
within that region, as expected.

The second region is where we are trying to read int_arr[100] from
and this is not mapped because we're trying to read above the start
of the stack.

Sometimes the test passes I think because ASLR / DYNAMICBASE moves
the start of the stack down enough so there is some readable memory
at the top.

https://learn.microsoft.com/en-us/cpp/build/reference/dynamicbase?view=msvc-170

Note "Because ASLR can't be disabled on ARM, ARM64, or ARM64EC architectures,
/DYNAMICBASE:NO isn't supported for these targets.". Which means on this bot,
the layout is definitely being randomised.

We don't need to be testing indexes this large. So I've changed the two
test indexes to 3 (1 beyond the end) and 10 (a larger distance beyond the end).
We know that index 42 always worked on the bot, so 10 should be fine, and
did not fail locally.
@DavidSpickett DavidSpickett changed the title [lldb][test] Add stack frame padding to fix flaky DIL array subscript test [lldb][test] Fix flaky DIL array subscript test by recuding array indexes May 28, 2025
@DavidSpickett DavidSpickett changed the title [lldb][test] Fix flaky DIL array subscript test by recuding array indexes [lldb][test] Fix flaky DIL array subscript test by reducing array indexes May 28, 2025
@DavidSpickett DavidSpickett merged commit 0ac3f5e into llvm:main May 28, 2025
8 of 9 checks passed
@DavidSpickett DavidSpickett deleted the lldb-winstack branch May 28, 2025 15:29
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Jun 3, 2025
…exes (llvm#141738)

This test has been flaky on Linaro's Windows on Arm bot and I was 
able to reproduce it within 10 or so runs locally.

When it fails it's because we failed to read the value of int_arr[100].
When that happens the memory looks like this:
```
[0x0000006bf88fd000-0x0000006bf8900000) rw-      <-- sp (0x0000006bf88ffe20)
[0x0000006bf8900000-0x0000025fec900000) ---      <-- int_arr[100] (0x0000006bf8900070)
```
The first region is the stack and the stack pointer is pointing
within that region, as expected.

The second region is where we are trying to read int_arr[100] from
and this is not mapped because we're trying to read above the start
of the stack.

Sometimes the test passes I think because ASLR / DYNAMICBASE moves
the start of the stack down enough so there is some readable memory
at the top.

https://learn.microsoft.com/en-us/cpp/build/reference/dynamicbase?view=msvc-170

Note "Because ASLR can't be disabled on ARM, ARM64, or ARM64EC
architectures,
/DYNAMICBASE:NO isn't supported for these targets.". Which means on this
bot,
the layout is definitely being randomised.

We don't need to be testing indexes this large. So I've changed the two 
test indexes to 3 (1 beyond the end) and 10 (a larger distance beyond
the end).
We know that index 42 always worked on the bot, so 10 should be fine,
and
did not fail locally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants