Optimize pattern frequency calculation #26

ngeiswei · 2020-01-22T08:52:32Z

Problem

Currently, pattern frequency (required for calculating the empirical probability during surprisingness evaluation) is calculated by enumerating all its matches and dividing by the universe count. Such enumeration is costly, especially in RAM. On a real world dataset, such as used in

https://github.com/opencog/miner/tree/master/examples/miner/mozi-ai

or

https://github.com/ngeiswei/reasoning-bio-as-xp

it easily maxes out 32GB of RAM. This has been improved by subsampling/bootstrapping the dataset based on an estimate of the empirical probability. Such estimate can be very wrong though, leading to under or over subsampling, thus innacurracies or RAM explosions.

Solutions

Improve the subsampling/bootstrapping mechanism, maybe auto-tuned via binary search, etc.
Introduce a dedicated pattern matcher callback that takes less memory, maybe only saving the atom hashes rather than the atoms themselves, or maybe saving nothing at all but still somehow guarantying not to recount matches.

ngeiswei · 2020-03-30T12:06:26Z

Boostrapping seems to work fairly well, however it's still too slow for large data sets, thus introducing a dedicated pattern matcher callback could be welcome.

Ignore test_ignore_var_2() till it gets fixed

ngeiswei self-assigned this Jan 22, 2020

ngeiswei added the enhancement New feature or request label Jan 22, 2020

ngeiswei added a commit to ngeiswei/miner that referenced this issue Jan 28, 2021

Merge pull request opencog#26 from ngeiswei/disable-ignore-var-utest

96bbd42

Ignore test_ignore_var_2() till it gets fixed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize pattern frequency calculation #26

Optimize pattern frequency calculation #26

ngeiswei commented Jan 22, 2020

ngeiswei commented Mar 30, 2020

Optimize pattern frequency calculation #26

Optimize pattern frequency calculation #26

Comments

ngeiswei commented Jan 22, 2020

Problem

Solutions

ngeiswei commented Mar 30, 2020