Skip to content

Commit f4f85e0

Browse files
authored
[llvm-profdata] Remove MD5 collision check in D147740 (#66544)
This is the patch at https://reviews.llvm.org/D153692, migrating to Github After testing D147740 with multiple industrial projects with ~10 million FunctionSamples, no MD5 collision has been found. In perfect hashing, the probability of collision for N symbols over K possible hash value is 1 - K!/((K-N)! * K^N). When N is 1 million and K is 2^64, the probability is 3*10^-8, when N is 10 million the probability is 3*10^-6, so we are probably not going to find an actual case in real world application. (However if K is 2^32, the probability of collision is almost 1, this is indeed a problem, if anyone still use a large profile on 32-bit machine, as hash_code is tied to size_t). Furthermore, when a collision happens we can't do anything to recover it, unless using a multi-map, but that is significantly slower, which contradicts the purpose of optimizing the profile reader. One more thing, since we have been using profiles with MD5 names, and they have to be coming from non-MD5 sources, so if hash collision is to happen, it already happened when we convert a non-MD5 profile to a MD5 one, so there's no point to check for that in the reader, and this feature can be removed.
1 parent 80e810f commit f4f85e0

File tree

3 files changed

+15
-227
lines changed

3 files changed

+15
-227
lines changed

llvm/include/llvm/ProfileData/SampleProf.h

Lines changed: 15 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1299,19 +1299,16 @@ raw_ostream &operator<<(raw_ostream &OS, const FunctionSamples &FS);
12991299
/// performance of insert and query operations especially when hash values of
13001300
/// keys are available a priori, and reduces memory usage if KeyT has a large
13011301
/// size.
1302-
/// When performing any action, if an existing entry with a given key is found,
1303-
/// and the interface "KeyT ValueT::getKey<KeyT>() const" to retrieve a value's
1304-
/// original key exists, this class checks if the given key actually matches
1305-
/// the existing entry's original key. If they do not match, this class behaves
1306-
/// as if the entry did not exist (for insertion, this means the new value will
1307-
/// replace the existing entry's value, as if it is newly inserted). If
1308-
/// ValueT::getKey<KeyT>() is not available, all keys with the same hash value
1309-
/// are considered equivalent (i.e. hash collision is silently ignored). Given
1310-
/// such feature this class should only be used where it does not affect
1311-
/// compilation correctness, for example, when loading a sample profile.
1312-
/// Assuming the hashing algorithm is uniform, the probability of hash collision
1313-
/// with 1,000,000 entries is
1314-
/// (2^64)!/((2^64-1000000)!*(2^64)^1000000) ~= 3*10^-8.
1302+
/// All keys with the same hash value are considered equivalent (i.e. hash
1303+
/// collision is silently ignored). Given such feature this class should only be
1304+
/// used where it does not affect compilation correctness, for example, when
1305+
/// loading a sample profile.
1306+
/// Assuming the hashing algorithm is uniform, we use the formula
1307+
/// 1 - Permute(n, k) / n ^ k where n is the universe size and k is number of
1308+
/// elements chosen at random to calculate the probability of collision. With
1309+
/// 1,000,000 entries the probability is negligible:
1310+
/// 1 - (2^64)!/((2^64-1000000)!*(2^64)^1000000) ~= 3*10^-8.
1311+
/// Source: https://en.wikipedia.org/wiki/Birthday_problem
13151312
template <template <typename, typename, typename...> typename MapT,
13161313
typename KeyT, typename ValueT, typename... MapTArgs>
13171314
class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
@@ -1325,42 +1322,12 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
13251322
using iterator = typename base_type::iterator;
13261323
using const_iterator = typename base_type::const_iterator;
13271324

1328-
private:
1329-
// If the value type has getKey(), retrieve its original key for comparison.
1330-
template <typename U = mapped_type,
1331-
typename = decltype(U().template getKey<original_key_type>())>
1332-
static bool
1333-
CheckKeyMatch(const original_key_type &Key, const mapped_type &ExistingValue,
1334-
original_key_type *ExistingKeyIfDifferent = nullptr) {
1335-
const original_key_type &ExistingKey =
1336-
ExistingValue.template getKey<original_key_type>();
1337-
bool Result = (Key == ExistingKey);
1338-
if (!Result && ExistingKeyIfDifferent)
1339-
*ExistingKeyIfDifferent = ExistingKey;
1340-
return Result;
1341-
}
1342-
1343-
// If getKey() does not exist, this overload is selected, which assumes all
1344-
// keys with the same hash are equivalent.
1345-
static bool CheckKeyMatch(...) { return true; }
1346-
1347-
public:
13481325
template <typename... Ts>
13491326
std::pair<iterator, bool> try_emplace(const key_type &Hash,
13501327
const original_key_type &Key,
13511328
Ts &&...Args) {
13521329
assert(Hash == hash_value(Key));
1353-
auto Ret = base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
1354-
if (!Ret.second) {
1355-
original_key_type ExistingKey;
1356-
if (LLVM_UNLIKELY(!CheckKeyMatch(Key, Ret.first->second, &ExistingKey))) {
1357-
dbgs() << "MD5 collision detected: " << Key << " and " << ExistingKey
1358-
<< " has same hash value " << Hash << "\n";
1359-
Ret.second = true;
1360-
Ret.first->second = mapped_type(std::forward<Ts>(Args)...);
1361-
}
1362-
}
1363-
return Ret;
1330+
return base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
13641331
}
13651332

13661333
template <typename... Ts>
@@ -1382,17 +1349,15 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
13821349
key_type Hash = hash_value(Key);
13831350
auto It = base_type::find(Hash);
13841351
if (It != base_type::end())
1385-
if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
1386-
return It;
1352+
return It;
13871353
return base_type::end();
13881354
}
13891355

13901356
const_iterator find(const original_key_type &Key) const {
13911357
key_type Hash = hash_value(Key);
13921358
auto It = base_type::find(Hash);
13931359
if (It != base_type::end())
1394-
if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
1395-
return It;
1360+
return It;
13961361
return base_type::end();
13971362
}
13981363

@@ -1433,19 +1398,9 @@ class SampleProfileMap
14331398
Ctx);
14341399
}
14351400

1436-
// Overloaded find() to lookup a function by name. This is called by IPO
1437-
// passes with an actual function name, and it is possible that the profile
1438-
// reader converted function names in the profile to MD5 strings, so we need
1439-
// to check if either representation matches.
1401+
// Overloaded find() to lookup a function by name.
14401402
iterator find(StringRef Fname) {
1441-
uint64_t Hash = hashFuncName(Fname);
1442-
auto It = base_type::find(hash_code(Hash));
1443-
if (It != end()) {
1444-
StringRef CtxName = It->second.getContext().getName();
1445-
if (LLVM_LIKELY(CtxName == Fname || CtxName == std::to_string(Hash)))
1446-
return It;
1447-
}
1448-
return end();
1403+
return base_type::find(hashFuncName(Fname));
14491404
}
14501405

14511406
size_t erase(const SampleContext &Ctx) {

llvm/unittests/tools/llvm-profdata/CMakeLists.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ set(LLVM_LINK_COMPONENTS
66

77
add_llvm_unittest(LLVMProfdataTests
88
OutputSizeLimitTest.cpp
9-
MD5CollisionTest.cpp
109
)
1110

1211
target_link_libraries(LLVMProfdataTests PRIVATE LLVMTestingSupport)

llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp

Lines changed: 0 additions & 166 deletions
This file was deleted.

0 commit comments

Comments
 (0)