Skip to content

[CDRIVER-5611] Refactor of validation fuzz test to expand coverage and speed #1822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

vector-of-bool
Copy link
Contributor

Summary

The existing fuzz_test_validate suffered from two issues that made fuzzing less than optimal:

  1. We unconditionally pass all validation flags to the bson_validate, which will reduce the branch coverage that is reachable by the fuzzer.
  2. The fuzz input is treated as a full BSON document, and would immediately reject if the header/trailer was invalid. This can massively slow down the fuzzer's ability to discover valid inputs, since inserting or removing any bytes from the test data would immediately reject in bson_init_static until the fuzzer would coincidentally discover the need to update the document header to match the new data size. Fuzzing would stagnate at around 475 corpus items.

This changeset addresses both issues:

  1. The first byte of test data is treated as a set of flags that control the behavior of bson_validate. This allows the fuzzer to test multiple branches within validation.
  2. The remaining bytes of test data are treated as the inner contents of a BSON object. A new valid header and trailer is constructed on-the-fly to wrap the data, ensuring that bson_init_static will always succeed and we can actually get to testing bson_validate. This allows the fuzzer to generate over 500 valid inputs in the first 30 seconds of running (with -fork=32).

Additionally:

  1. This changeset introduces a libFuzzer dictionary of valid/invalid BSON byte snippets that the fuzzer can use to insert into test data. Because generating these byte sequences by hand is non-trivial, a small Python script generates this file.
  2. Because test input is no longer a simple BSON doc, an extra exe is added that can be used to execute bson_validate on a corpus item directly. This can be used to repro any issue discovered by the fuzz tester.

- Fuzz test accepts the inner body of a document as
  the test input, rather than the full document.
  This greatly accelerates fuzz corpus discovery.
- Treat the first byte of the test input as a set of
  flags to control behavior. This allows the fuzzer
  to discover different branches that are hidden
  behind the presence/absence of certain validation
  flags.
- Add a separate exe that reads a corpus item and
  runs the same validation code on it (for issue repro)
Copy link
Contributor

@eramongodb eramongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor feedback; otherwise, LGTM.

@@ -5,8 +5,10 @@ set_property(TARGET fuzz_test_properties APPEND PROPERTY INTERFACE_LINK_OPTIONS
add_executable(fuzz_test_init_from_json EXCLUDE_FROM_ALL fuzz_test_init_from_json.c)
target_link_libraries(fuzz_test_init_from_json PRIVATE fuzz_test_properties)

add_executable(fuzz_test_validate EXCLUDE_FROM_ALL fuzz_test_validate.c)
add_executable(fuzz_test_validate EXCLUDE_FROM_ALL fuzz_test_validate.cpp)
Copy link
Contributor

@eramongodb eramongodb Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend updating the description of the MONGO_FUZZ CMake option to include "(requires C++)" to document the additional toolchain dependency required to build fuzz testing targets. Also consider guarding these targets with if (CMAKE_CXX_COMPILER) as done for the mongoc-cxx-check target.

@@ -5,8 +5,10 @@ set_property(TARGET fuzz_test_properties APPEND PROPERTY INTERFACE_LINK_OPTIONS
add_executable(fuzz_test_init_from_json EXCLUDE_FROM_ALL fuzz_test_init_from_json.c)
target_link_libraries(fuzz_test_init_from_json PRIVATE fuzz_test_properties)

add_executable(fuzz_test_validate EXCLUDE_FROM_ALL fuzz_test_validate.c)
add_executable(fuzz_test_validate EXCLUDE_FROM_ALL fuzz_test_validate.cpp)
add_executable(validate-repro EXCLUDE_FROM_ALL validate-repro.cpp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
add_executable(validate-repro EXCLUDE_FROM_ALL validate-repro.cpp)
add_executable(validate_repro EXCLUDE_FROM_ALL validate-repro.cpp)

Consistent casing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to switch everything else to be hyphenated? Most components are hyphenated currently.

Copy link
Contributor

@eramongodb eramongodb Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

Aside: a quick grep reveals the list of user-defined CMake targets (which are --target-able, excluding those generated by the build system) which use _ is currently limited to the following list:

bson_shared
bson_static
hello_mongoc
mcd_rpc
mongo_c_driver_examples
mongo_c_driver_tests
mongoc_shared
mongoc_static

@@ -0,0 +1,9 @@
#include <stdint.h>

#include "./fuzz_test_validate.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "./fuzz_test_validate.hpp"
#include "fuzz_test_validate.hpp"

The #include "" form is already relative to current file's directory (?).

#include <fstream>
#include <sstream>

#include "./fuzz_test_validate.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "./fuzz_test_validate.hpp"
#include "fuzz_test_validate.hpp"

The #include "" form is already relative to current file's directory (?).

// Treat the first byte as the flags, and the BSON data is what remains:
const uint8_t *bson_doc_data = data + 1;
const size_t bson_data_size = size - 1;
if (bson_data_size > INT32_MAX) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (bson_data_size > INT32_MAX) {
if (bson_data_size > static_cast<std::size_t>(INT32_MAX)) {

Signedness consistency.



def len_prefix(b: BytesIter) -> bytes:
"""Prepend a u32le byte-length prefix to a set of bytes"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Prepend a u32le byte-length prefix to a set of bytes"""
"""Prepend a i32le byte-length prefix to a set of bytes"""

I expect this is encoded as a signed int32 (which matches the BSON spec for encoded lengths).

Comment on lines 283 to 297
def escape(b: bytes) -> Iterable[str]:
s = b.decode("ascii", "backslashreplace")
for u8 in b:
try:
s = chr(u8)
s.encode("ascii") # Try to encode as ASCII
except ValueError:
# Byte is not valid UTF-8, or not valid ASCII
pass
else:
if s.isprintable():
yield s
continue
# Byte is not valid ASCII, or is not a printable char
yield f"\\x{u8:0>2x}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested simplification:

Suggested change
def escape(b: bytes) -> Iterable[str]:
s = b.decode("ascii", "backslashreplace")
for u8 in b:
try:
s = chr(u8)
s.encode("ascii") # Try to encode as ASCII
except ValueError:
# Byte is not valid UTF-8, or not valid ASCII
pass
else:
if s.isprintable():
yield s
continue
# Byte is not valid ASCII, or is not a printable char
yield f"\\x{u8:0>2x}"
def escape(b: bytes) -> Iterable[str]:
for u8 in b:
s = chr(u8)
if s.isascii() and s.isprintable():
yield s
continue
# Byte is not valid ASCII, or is not a printable char
yield f"\\x{u8:0>2x}"

if (rc == 0) {
std::fprintf (stderr, "Validation returned normally (no bug?)\n");
} else if (rc == -1) {
std::fprintf (stderr, "Test case is was rejected by the fuzzer (not a valid corpus item?)\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::fprintf (stderr, "Test case is was rejected by the fuzzer (not a valid corpus item?)\n");
std::fprintf (stderr, "Test case is/was rejected by the fuzzer (not a valid corpus item?)\n");

- `fuzz_test_` prefix is redundant in a directory
  for fuzz tests.
- Use `kebab-case` for target and file names.
@vector-of-bool vector-of-bool merged commit 1e4d7fb into mongodb:master Jan 16, 2025
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants