Skip to content

Commit b6bfc61

Browse files
committed
[llvm-profdata] Remove MD5 collision check in D147740
This is the patch at https://reviews.llvm.org/D153692, migrating to Github After testing D147740 with multiple industrial projects with ~10 million FunctionSamples, no MD5 collision has been found. In perfect hashing, the probability of collision for N symbols over K possible hash value is 1 - K!/((K-N)! * K^N). When N is 1 million and K is 2^64, the probability is 3*10^-8, when N is 10 million the probability is 3*10^-6, so we are probably not going to find an actual case in real world application. (However if K is 2^32, the probability of collision is almost 1, this is indeed a problem, if anyone still use a large profile on 32-bit machine, as hash_code is tied to size_t). Furthermore, when a collision happens we can't do anything to recover it, unless using a multi-map, but that is significantly slower, which contradicts the purpose of optimizing the profile reader. One more thing, since we have been using profiles with MD5 names, and they have to be coming from non-MD5 sources, so if hash collision is to happen, it already happened when we convert a non-MD5 profile to a MD5 one, so there's no point to check for that in the reader, and this feature can be removed.
1 parent 401296f commit b6bfc61

File tree

3 files changed

+10
-225
lines changed

3 files changed

+10
-225
lines changed

llvm/include/llvm/ProfileData/SampleProf.h

Lines changed: 10 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1299,19 +1299,13 @@ raw_ostream &operator<<(raw_ostream &OS, const FunctionSamples &FS);
12991299
/// performance of insert and query operations especially when hash values of
13001300
/// keys are available a priori, and reduces memory usage if KeyT has a large
13011301
/// size.
1302-
/// When performing any action, if an existing entry with a given key is found,
1303-
/// and the interface "KeyT ValueT::getKey<KeyT>() const" to retrieve a value's
1304-
/// original key exists, this class checks if the given key actually matches
1305-
/// the existing entry's original key. If they do not match, this class behaves
1306-
/// as if the entry did not exist (for insertion, this means the new value will
1307-
/// replace the existing entry's value, as if it is newly inserted). If
1308-
/// ValueT::getKey<KeyT>() is not available, all keys with the same hash value
1309-
/// are considered equivalent (i.e. hash collision is silently ignored). Given
1310-
/// such feature this class should only be used where it does not affect
1311-
/// compilation correctness, for example, when loading a sample profile.
1302+
/// All keys with the same hash value are considered equivalent (i.e. hash
1303+
/// collision is silently ignored). Given such feature this class should only be
1304+
/// used where it does not affect compilation correctness, for example, when
1305+
/// loading a sample profile.
13121306
/// Assuming the hashing algorithm is uniform, the probability of hash collision
13131307
/// with 1,000,000 entries is
1314-
/// (2^64)!/((2^64-1000000)!*(2^64)^1000000) ~= 3*10^-8.
1308+
/// 1 - (2^64)!/((2^64-1000000)!*(2^64)^1000000) ~= 3*10^-8.
13151309
template <template <typename, typename, typename...> typename MapT,
13161310
typename KeyT, typename ValueT, typename... MapTArgs>
13171311
class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
@@ -1325,42 +1319,12 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
13251319
using iterator = typename base_type::iterator;
13261320
using const_iterator = typename base_type::const_iterator;
13271321

1328-
private:
1329-
// If the value type has getKey(), retrieve its original key for comparison.
1330-
template <typename U = mapped_type,
1331-
typename = decltype(U().template getKey<original_key_type>())>
1332-
static bool
1333-
CheckKeyMatch(const original_key_type &Key, const mapped_type &ExistingValue,
1334-
original_key_type *ExistingKeyIfDifferent = nullptr) {
1335-
const original_key_type &ExistingKey =
1336-
ExistingValue.template getKey<original_key_type>();
1337-
bool Result = (Key == ExistingKey);
1338-
if (!Result && ExistingKeyIfDifferent)
1339-
*ExistingKeyIfDifferent = ExistingKey;
1340-
return Result;
1341-
}
1342-
1343-
// If getKey() does not exist, this overload is selected, which assumes all
1344-
// keys with the same hash are equivalent.
1345-
static bool CheckKeyMatch(...) { return true; }
1346-
1347-
public:
13481322
template <typename... Ts>
13491323
std::pair<iterator, bool> try_emplace(const key_type &Hash,
13501324
const original_key_type &Key,
13511325
Ts &&...Args) {
13521326
assert(Hash == hash_value(Key));
1353-
auto Ret = base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
1354-
if (!Ret.second) {
1355-
original_key_type ExistingKey;
1356-
if (LLVM_UNLIKELY(!CheckKeyMatch(Key, Ret.first->second, &ExistingKey))) {
1357-
dbgs() << "MD5 collision detected: " << Key << " and " << ExistingKey
1358-
<< " has same hash value " << Hash << "\n";
1359-
Ret.second = true;
1360-
Ret.first->second = mapped_type(std::forward<Ts>(Args)...);
1361-
}
1362-
}
1363-
return Ret;
1327+
return base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
13641328
}
13651329

13661330
template <typename... Ts>
@@ -1382,17 +1346,15 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
13821346
key_type Hash = hash_value(Key);
13831347
auto It = base_type::find(Hash);
13841348
if (It != base_type::end())
1385-
if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
1386-
return It;
1349+
return It;
13871350
return base_type::end();
13881351
}
13891352

13901353
const_iterator find(const original_key_type &Key) const {
13911354
key_type Hash = hash_value(Key);
13921355
auto It = base_type::find(Hash);
13931356
if (It != base_type::end())
1394-
if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
1395-
return It;
1357+
return It;
13961358
return base_type::end();
13971359
}
13981360

@@ -1433,19 +1395,9 @@ class SampleProfileMap
14331395
Ctx);
14341396
}
14351397

1436-
// Overloaded find() to lookup a function by name. This is called by IPO
1437-
// passes with an actual function name, and it is possible that the profile
1438-
// reader converted function names in the profile to MD5 strings, so we need
1439-
// to check if either representation matches.
1398+
// Overloaded find() to lookup a function by name.
14401399
iterator find(StringRef Fname) {
1441-
uint64_t Hash = hashFuncName(Fname);
1442-
auto It = base_type::find(hash_code(Hash));
1443-
if (It != end()) {
1444-
StringRef CtxName = It->second.getContext().getName();
1445-
if (LLVM_LIKELY(CtxName == Fname || CtxName == std::to_string(Hash)))
1446-
return It;
1447-
}
1448-
return end();
1400+
return base_type::find(hashFuncName(Fname));
14491401
}
14501402

14511403
size_t erase(const SampleContext &Ctx) {

llvm/unittests/tools/llvm-profdata/CMakeLists.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ set(LLVM_LINK_COMPONENTS
66

77
add_llvm_unittest(LLVMProfdataTests
88
OutputSizeLimitTest.cpp
9-
MD5CollisionTest.cpp
109
)
1110

1211
target_link_libraries(LLVMProfdataTests PRIVATE LLVMTestingSupport)

llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp

Lines changed: 0 additions & 166 deletions
This file was deleted.

0 commit comments

Comments
 (0)