Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup connectedness by doing it on a smaller hashtable #603

Closed
wants to merge 3 commits into from

Conversation

michael-schwarz
Copy link
Member

For FFmpeg, the CFG computation phase did previously not terminate in a reasonable amount of time.
With these changes (use a separate smaller hashtable to check for connectedness, this seems to work). Also this drastically reduces RAM usage, which makes me slightly suspicious

TODO:

  • Check this is actually correct
  • Cleanup

@sim642
Copy link
Member

sim642 commented Feb 18, 2022

The diff is impossible to follow, where's the hashtable that you made smaller?

@sim642 sim642 added the performance Analysis time, memory usage label Feb 18, 2022
@michael-schwarz
Copy link
Member Author

Most of things marked as different is due to wrapping things into calls to Stats.time. I'll comment on the relevant changes.

Comment on lines 166 to 167
let fun_cfgF = H.create 113 in
let fun_cfgB = H.create 113 in
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashtables that are reset after every function, used for connectedness check.

Comment on lines 177 to 178
H.add cfgB toNode (edges,fromNode); H.add fun_cfgB toNode (edges,fromNode);
H.add cfgF fromNode (edges,toNode); H.add fun_cfgF fromNode (edges,toNode);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add edges to per-function table

Comment on lines 252 to 253
H.clear fun_cfgB;
H.clear fun_cfgF;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset per function tables

Comment on lines 401 to 402
let next = H.find_all fun_cfgF
let prev = H.find_all fun_cfgB
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use per-function table here instead of global table

@sim642
Copy link
Member

sim642 commented Feb 18, 2022

But the connectedness check is only done using the current function's nodes using fd_nodes anyway, which follows the exact same clearing logic:

let (sccs, node_scc) = computeSCCs (module TmpCfg) (NH.keys fd_nodes |> BatList.of_enum) in

@michael-schwarz
Copy link
Member Author

For FFmpeg, timing goes from aborting with:

    makeCFG                         1.047 s
    connect                        171.697 s

to finishing this phase with

    makeCFG                         7.298 s
    connect                        55.658 s

@michael-schwarz
Copy link
Member Author

But the connectedness check is only done using the current function's nodes using fd_nodes anyway

Yes, but it still uses the large CFG hash table of all functions. We could be seeing effects where the per-function hashtable fits into a cache, and the huge one for all functions does not...

@sim642
Copy link
Member

sim642 commented Feb 18, 2022

Yes, but it still uses the large CFG hash table of all functions. We could be seeing effects where the per-function hashtable fits into a cache, and the huge one for all functions does not...

But that's far from specific to the connectedness check, Goblint has been using a single hashtable for all CFGs of all functions since the very beginning of time. And that also affects every single transfer function evaluation as well. If that's really the entirety of the bottleneck, then we have far greater problems and might have to reengineer our whole CFG representation throughout Goblint and all the downstream tools.

@sim642
Copy link
Member

sim642 commented Feb 18, 2022

I would really love if there was a profilable reproduction of the issue (~10s runtime) that already reveals the bottleneck, because that would reveal whether the true problem is cache locality (which no part of Goblint ever considers), unsuitable hashing or something else.

@sim642
Copy link
Member

sim642 commented Feb 18, 2022

Also this drastically reduces RAM usage, which makes me slightly suspicious

Indeed, this doesn't fit with the rest of the story. The huge hashtable is still there, so no memory is saved through this change. Thus it would be especially useful to profile.

@sim642
Copy link
Member

sim642 commented Feb 18, 2022

Maybe H.stats reveals something insightful about the hashtable?

@michael-schwarz
Copy link
Member Author

I would really love if there was a profilable reproduction of the issue (~10s runtime) that already reveals the bottleneck, because that would reveal whether the true problem is cache locality (which no part of Goblint ever considers), unsuitable hashing or something else.

I tried with zstd, but there is no observable difference between both there.

@michael-schwarz
Copy link
Member Author

Same for git, no difference there.

@sim642
Copy link
Member

sim642 commented Feb 18, 2022

Here's a wild guess: the ffmpeg code by coincidence contains functions and statements whose IDs interact particularly poorly with the hash function and the size of the large hashtable, such that unreasonably many nodes fall into the same bucket. Although the hashes for nodes (so statements and functions) are based on sequentially generated IDs, modulo the number of buckets might particularly collide.

This in turn causes find_all on the large hashtable to be particularly inefficient, because it always has to traverse the entire bucket unconditionally. Moreover, that bucket traversal is recursive, but not tail-recursive (ocaml/ocaml#8676), so memory usage skyrockets as well.

Of course this wouldn't be then limited to the large hashtable lookups in CFG construction, but also in transfer functions, but we haven't gotten far enough to investigate the bottleneck there. Also it would be far less obvious, since transfer functions are already the most expensive part, so it'd be hard to judge, what's unreasonably slow for them.

@michael-schwarz
Copy link
Member Author

Yes, that seems to make sense, but still doesn't quite explain the memory usage...

@michael-schwarz
Copy link
Member Author

michael-schwarz commented Feb 18, 2022

I ran it again on the server for the openSSL benchmark (from what CIL dynamically combined from the compiledb) as opposed to a single already combined file (what I had done before), and there I observe the same gigantic speed difference again:

Without the optimization, I ran out of patience after a while:

    makeCFG                         2.576 s
    connect                        633.225 s
    
    max=61843.65MB

with it, it finished blazingly fast:

    makeCFG                         4.165 s
    connect                        14.605 s
    
    max=20216.80MB

Maybe this is related to the combined files being built before we disabled merge_inline and us having more functions after 🤔

  • I also noticed some functions have names such as ossl_check_ERR_STRING_DATA_lh_doallfunc_type___NNN with NNN some number.
  • In total we have >290k functions, so that seems fishy too.

I might want to benchmark this suspicion on Monday.

Base automatically changed from cil_universial_character_names to master February 21, 2022 09:20
@sim642 sim642 mentioned this pull request Feb 23, 2022
3 tasks
@sim642
Copy link
Member

sim642 commented Feb 28, 2022

I've spent extensive amount of time profiling the counterintuitive memory usage reduction and have finally come to a conclusion.
The reason isn't something about OCaml's memory management, but rather a blatantly simple one: this changes the amount of computations (and allocations) done by computeSCCs! If Cfg.next is based on cfgF, then it sometimes returns more elements than Cfg.next based on fun_cfgF when looking up the same node! How is this possible?

Function entry (and return) nodes are the culprit: their identity (equal and hash) is based on their underlying variable (svar) only. So the difference appears in the form of a single function entry node having multiple successors with different sids although semantically identical content. These conflicts appear after merging of inlines was disabled in goblint/cil#72. Turning that on again fixes both the time and memory issue.

@sim642
Copy link
Member

sim642 commented Feb 28, 2022

Thus, I believe that this PR alone isn't what we need. Even if we use this optimization just during the CFG construction, it does not and cannot do anything to avoid the issue in transfer functions during solving. There the same conflict would probably cause a slowdown of a similar factor (the number of unmerged copies of a function per each unmerged copy of that function), but to the analysis itself, not to a lightweight graph algorithm.

@sim642 sim642 closed this Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Analysis time, memory usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants