Skip to content

[analyzer] Improve bug report hashing, merge similar reports #98621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 22, 2024

Conversation

NagyDonat
Copy link
Contributor

@NagyDonat NagyDonat commented Jul 12, 2024

Previously there were certain situations where alpha.security.ArrayBoundV2 produced lots of very similar and redundant reports that only differed in their full Description that contained the (negative) byte offset value. (See #86969 for details.)

This change updates the Profile() method of PathSensitiveBugReport to ensure that it uses getShortDescription() instead of the full Description so the standard report deduplication eliminates most of these redundant reports.

Note that the effects of this change are very limited because there are very few checkers that specify a separate short description, and so getShortDescription() practically always defaults to returning the full Description.

For the sake of consistency BasicBugReport::Profile() is also updated to use the short description. (If I recall correctly there are no checkers that use BasicBugReport with separate short and long descriptions.)

This commit also includes some small code quality improvements in ArrayBoundV2 that are IMO too trivial to be moved into a separate commit.

Previously alpha.security.ArrayBoundV2 displayed the (negative) offset
value when it reported an underflow, but this produced lots of very
similar and redundant reports in certain situations.

After this commit the offset won't be printed so the usual deduplication
will handle these reports as equivalent (and print only one of them).

See llvm#86969 for background.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:static analyzer labels Jul 12, 2024
@llvmbot
Copy link
Member

llvmbot commented Jul 12, 2024

@llvm/pr-subscribers-clang

Author: Donát Nagy (NagyDonat)

Changes

Previously alpha.security.ArrayBoundV2 displayed the (negative) offset value when it reported an underflow, but this produced lots of very similar and redundant reports in certain situations.

After this commit the offset won't be printed so the usual deduplication will handle these reports as equivalent (and print only one of them).

See #86969 for background.


Full diff: https://github.com/llvm/llvm-project/pull/98621.diff

2 Files Affected:

  • (modified) clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp (+8-7)
  • (modified) clang/test/Analysis/out-of-bounds-diagnostics.c (+24-2)
diff --git a/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp b/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
index f82288f1099e8..f177041abc411 100644
--- a/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
+++ b/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
@@ -374,13 +374,14 @@ static std::optional<int64_t> getConcreteValue(std::optional<NonLoc> SV) {
 
 static Messages getPrecedesMsgs(const SubRegion *Region, NonLoc Offset) {
   std::string RegName = getRegionName(Region);
-  SmallString<128> Buf;
-  llvm::raw_svector_ostream Out(Buf);
-  Out << "Access of " << RegName << " at negative byte offset";
-  if (auto ConcreteIdx = Offset.getAs<nonloc::ConcreteInt>())
-    Out << ' ' << ConcreteIdx->getValue();
+
+  // We're not reporting the Offset, because we don't want to spam the user
+  // with similar reports that differ only in different offset values.
+  // See https://github.com/llvm/llvm-project/issues/86969 for details.
+  (void)Offset;
+
   return {formatv("Out of bound access to memory preceding {0}", RegName),
-          std::string(Buf)};
+          formatv("Access of {0} at negative byte offset", RegName)};
 }
 
 /// Try to divide `Val1` and `Val2` (in place) by `Divisor` and return true if
@@ -609,7 +610,7 @@ void ArrayBoundCheckerV2::performCheck(const Expr *E, CheckerContext &C) const {
   // CHECK UPPER BOUND
   DefinedOrUnknownSVal Size = getDynamicExtent(State, Reg, SVB);
   if (auto KnownSize = Size.getAs<NonLoc>()) {
-    // In a situation where both overflow and overflow are possible (but the
+    // In a situation where both underflow and overflow are possible (but the
     // index is either tainted or known to be invalid), the logic of this
     // checker will first assume that the offset is non-negative, and then
     // (with this additional assumption) it will detect an overflow error.
diff --git a/clang/test/Analysis/out-of-bounds-diagnostics.c b/clang/test/Analysis/out-of-bounds-diagnostics.c
index 92f983d8b1561..10cfe95a79a55 100644
--- a/clang/test/Analysis/out-of-bounds-diagnostics.c
+++ b/clang/test/Analysis/out-of-bounds-diagnostics.c
@@ -6,7 +6,7 @@ int TenElements[10];
 void arrayUnderflow(void) {
   TenElements[-3] = 5;
   // expected-warning@-1 {{Out of bound access to memory preceding 'TenElements'}}
-  // expected-note@-2 {{Access of 'TenElements' at negative byte offset -12}}
+  // expected-note@-2 {{Access of 'TenElements' at negative byte offset}}
 }
 
 int underflowWithDeref(void) {
@@ -14,7 +14,29 @@ int underflowWithDeref(void) {
   --p;
   return *p;
   // expected-warning@-1 {{Out of bound access to memory preceding 'TenElements'}}
-  // expected-note@-2 {{Access of 'TenElements' at negative byte offset -4}}
+  // expected-note@-2 {{Access of 'TenElements' at negative byte offset}}
+}
+
+int rng(void);
+int getIndex(void) {
+  switch (rng()) {
+    case 1: return -152;
+    case 2: return -160;
+    case 3: return -168;
+    default: return -172;
+  }
+}
+
+void gh86959(void) {
+  // Previously code like this produced many almost-identical bug reports that
+  // only differed in the byte offset value (which was reported by the checker
+  // at that time). Verify that now we only see one report.
+
+  // expected-note@+1 {{Entering loop body}}
+  while (rng())
+    TenElements[getIndex()] = 10;
+  // expected-warning@-1 {{Out of bound access to memory preceding 'TenElements'}}
+  // expected-note@-2 {{Access of 'TenElements' at negative byte offset}}
 }
 
 int scanf(const char *restrict fmt, ...);

@llvmbot
Copy link
Member

llvmbot commented Jul 12, 2024

@llvm/pr-subscribers-clang-static-analyzer-1

Author: Donát Nagy (NagyDonat)

Changes

Previously alpha.security.ArrayBoundV2 displayed the (negative) offset value when it reported an underflow, but this produced lots of very similar and redundant reports in certain situations.

After this commit the offset won't be printed so the usual deduplication will handle these reports as equivalent (and print only one of them).

See #86969 for background.


Full diff: https://github.com/llvm/llvm-project/pull/98621.diff

2 Files Affected:

  • (modified) clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp (+8-7)
  • (modified) clang/test/Analysis/out-of-bounds-diagnostics.c (+24-2)
diff --git a/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp b/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
index f82288f1099e8..f177041abc411 100644
--- a/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
+++ b/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
@@ -374,13 +374,14 @@ static std::optional<int64_t> getConcreteValue(std::optional<NonLoc> SV) {
 
 static Messages getPrecedesMsgs(const SubRegion *Region, NonLoc Offset) {
   std::string RegName = getRegionName(Region);
-  SmallString<128> Buf;
-  llvm::raw_svector_ostream Out(Buf);
-  Out << "Access of " << RegName << " at negative byte offset";
-  if (auto ConcreteIdx = Offset.getAs<nonloc::ConcreteInt>())
-    Out << ' ' << ConcreteIdx->getValue();
+
+  // We're not reporting the Offset, because we don't want to spam the user
+  // with similar reports that differ only in different offset values.
+  // See https://github.com/llvm/llvm-project/issues/86969 for details.
+  (void)Offset;
+
   return {formatv("Out of bound access to memory preceding {0}", RegName),
-          std::string(Buf)};
+          formatv("Access of {0} at negative byte offset", RegName)};
 }
 
 /// Try to divide `Val1` and `Val2` (in place) by `Divisor` and return true if
@@ -609,7 +610,7 @@ void ArrayBoundCheckerV2::performCheck(const Expr *E, CheckerContext &C) const {
   // CHECK UPPER BOUND
   DefinedOrUnknownSVal Size = getDynamicExtent(State, Reg, SVB);
   if (auto KnownSize = Size.getAs<NonLoc>()) {
-    // In a situation where both overflow and overflow are possible (but the
+    // In a situation where both underflow and overflow are possible (but the
     // index is either tainted or known to be invalid), the logic of this
     // checker will first assume that the offset is non-negative, and then
     // (with this additional assumption) it will detect an overflow error.
diff --git a/clang/test/Analysis/out-of-bounds-diagnostics.c b/clang/test/Analysis/out-of-bounds-diagnostics.c
index 92f983d8b1561..10cfe95a79a55 100644
--- a/clang/test/Analysis/out-of-bounds-diagnostics.c
+++ b/clang/test/Analysis/out-of-bounds-diagnostics.c
@@ -6,7 +6,7 @@ int TenElements[10];
 void arrayUnderflow(void) {
   TenElements[-3] = 5;
   // expected-warning@-1 {{Out of bound access to memory preceding 'TenElements'}}
-  // expected-note@-2 {{Access of 'TenElements' at negative byte offset -12}}
+  // expected-note@-2 {{Access of 'TenElements' at negative byte offset}}
 }
 
 int underflowWithDeref(void) {
@@ -14,7 +14,29 @@ int underflowWithDeref(void) {
   --p;
   return *p;
   // expected-warning@-1 {{Out of bound access to memory preceding 'TenElements'}}
-  // expected-note@-2 {{Access of 'TenElements' at negative byte offset -4}}
+  // expected-note@-2 {{Access of 'TenElements' at negative byte offset}}
+}
+
+int rng(void);
+int getIndex(void) {
+  switch (rng()) {
+    case 1: return -152;
+    case 2: return -160;
+    case 3: return -168;
+    default: return -172;
+  }
+}
+
+void gh86959(void) {
+  // Previously code like this produced many almost-identical bug reports that
+  // only differed in the byte offset value (which was reported by the checker
+  // at that time). Verify that now we only see one report.
+
+  // expected-note@+1 {{Entering loop body}}
+  while (rng())
+    TenElements[getIndex()] = 10;
+  // expected-warning@-1 {{Out of bound access to memory preceding 'TenElements'}}
+  // expected-note@-2 {{Access of 'TenElements' at negative byte offset}}
 }
 
 int scanf(const char *restrict fmt, ...);

Copy link
Contributor

@steakhal steakhal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense.
I wonder if we could have something in between. I'm thinking of having the concrete offset as a separate note, instead of having it part of the primary message.
That way after BR selection, we would still deterministically pick the shortest parh, and also have the offset in the path notes.

I find this better than dropping the concrete offset you worked so hard acquiring. WDYT?

This commit re-adds the concrete offset value at the end of the (long)
`Description` of the underflow report, but ensures that the `Profile()`
method of `PathSensitiveBugReport` only uses the _short_ description (as
returned by `getShortDescription()`) instead of the the `Description`
field.

For the sake of consistency, the same modification is also applied to
`BasicBugReport::Profile()`.

This modification of `Profile()` is a no-op for most of the checkers,
because there are very few checkers that set a separate short
description for their bug reports and `getShortDescription()` defaults
to returning the long `Description` when the field `ShortDescription` is
an empty string (i.e. unspecified).

I'd say that it was a bug that the short description (which is arguably
_the_ human-readable "hash" of the report) wasn't included in the hash
calculations performed by `Profile()`.

On the other hand, I think it'll be useful that `Profile()` ignores the
long description, because then we'll have a nice place where we can
print the "nice to mention, but not enough to create a fundamentally
different report" secondary information.

(Note that the long description is essentially the "final note on the
bug path" and the other notes are also ignored by `Profile()` -- because
they are calculated later by the visitors.)
@NagyDonat
Copy link
Contributor Author

I wonder if we could have something in between. I'm thinking of having the concrete offset as a separate note, instead of having it part of the primary message. That way after BR selection, we would still deterministically pick the shortest parh, and also have the offset in the path notes.

Actually I had a better solution for achieving this: during my earlier experimentation I noticed that the Profile() methods of the BugReporter use the (long) Description field (which is "Access of <region> at negative byte offset -124", the final note on the bug path), but ignore the ShortDescription field (which is "Out of bound access to memory preceding <region>", the stand-alone warning message). In the commit that I just pushed, I tweaked Profile() to ensure that it uses the short description and ignores the long one -- which yields the right deduplication behavior.

I'd guess that ShortDescription was missing from the Profile() method because it was introduced at a later point in the development, and there are very few checkers that actually set different short and long descriptions.

(For more details see the commit message of the commit that I just pushed onto this PR.)

@NagyDonat
Copy link
Contributor Author

NagyDonat commented Jul 12, 2024

(Technical detail: I'll be on vacation during the next week, so I won't see updates on this PR until the 22nd of July. If you want to merge this PR, feel free to do so.)

@NagyDonat NagyDonat changed the title [analyzer] Don't display the offset value in underflows [analyzer] Improve bug report hashing, merge similar reports Jul 12, 2024
@steakhal
Copy link
Contributor

(Technical detail: I'll be on vacation during the next week, so I won't see updates on this PR until the 22nd of July. If you want to merge this PR, feel free to do so.)

No need to rush. Have a nice one!

@NagyDonat
Copy link
Contributor Author

NagyDonat commented Jul 22, 2024

I'm back from the vacation, so I would like to restart and conclude this review process. @steakhal or anybody else please have a look at the PR when it's convenient for you.

@NagyDonat
Copy link
Contributor Author

(Btw it seems that I accidentally added everyone as an "assignee" instead of a reviewer...)

Copy link
Contributor

@steakhal steakhal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I only had one minor question inline.

hash.AddString(Description);
hash.AddString(getShortDescription());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that you use a member function here instead of the member variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's intentional, because the method call getShortDescription() is equivalent to ShortDescripton.empty() ? Description : ShortDescription. (The common case when the short and full descriptions are identical is represented internally by an empty ShortDescription and the shared value stored in Description.)

@NagyDonat NagyDonat merged commit 2bc38dc into llvm:main Jul 22, 2024
4 of 7 checks passed
@NagyDonat NagyDonat deleted the arrayboundv2_simplify_underflow_report branch July 22, 2024 09:44
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
Summary:
Previously there were certain situations where
alpha.security.ArrayBoundV2 produced lots of very similar and redundant
reports that only differed in their full `Description` that contained
the (negative) byte offset value. (See
#86969 for details.)

This change updates the `Profile()` method of `PathSensitiveBugReport`
to ensure that it uses `getShortDescription()` instead of the full
`Description` so the standard report deduplication eliminates most of
these redundant reports.

Note that the effects of this change are very limited because there are
very few checkers that specify a separate short description, and so
`getShortDescription()` practically always defaults to returning the
full `Description`.

For the sake of consistency `BasicBugReport::Profile()` is also updated
to use the short description. (Right now there are no checkers that use
`BasicBugReport` with separate long and short descriptions.)

This commit also includes some small code quality improvements in
`ArrayBoundV2` that are IMO too trivial to be moved into a separate
commit.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60251171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:static analyzer clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants