You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, pattern frequency (required for calculating the empirical probability during surprisingness evaluation) is calculated by enumerating all its matches and dividing by the universe count. Such enumeration is costly, especially in RAM. On a real world dataset, such as used in
it easily maxes out 32GB of RAM. This has been improved by subsampling/bootstrapping the dataset based on an estimate of the empirical probability. Such estimate can be very wrong though, leading to under or over subsampling, thus innacurracies or RAM explosions.
Solutions
Improve the subsampling/bootstrapping mechanism, maybe auto-tuned via binary search, etc.
Introduce a dedicated pattern matcher callback that takes less memory, maybe only saving the atom hashes rather than the atoms themselves, or maybe saving nothing at all but still somehow guarantying not to recount matches.
The text was updated successfully, but these errors were encountered:
Boostrapping seems to work fairly well, however it's still too slow for large data sets, thus introducing a dedicated pattern matcher callback could be welcome.
ngeiswei
added a commit
to ngeiswei/miner
that referenced
this issue
Jan 28, 2021
Problem
Currently, pattern frequency (required for calculating the empirical probability during surprisingness evaluation) is calculated by enumerating all its matches and dividing by the universe count. Such enumeration is costly, especially in RAM. On a real world dataset, such as used in
https://github.com/opencog/miner/tree/master/examples/miner/mozi-ai
or
https://github.com/ngeiswei/reasoning-bio-as-xp
it easily maxes out 32GB of RAM. This has been improved by subsampling/bootstrapping the dataset based on an estimate of the empirical probability. Such estimate can be very wrong though, leading to under or over subsampling, thus innacurracies or RAM explosions.
Solutions
The text was updated successfully, but these errors were encountered: