-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tools/flow: high memory usage #1795
Comments
Note: we are experiencing a version of this within a Dagger user's config. |
This reminds me of "Abuse of CUE" https://github.com/hofstadter-io/cuetorials.com/tree/main/code/puzzles/advent-of-code (I used lists, this is a struct, but the same idea applies) There is a non-flow reproducer more than likely. I believe what happens is that each successive entry refers to the prev, so you end up with references (in the ADT) which need to be walked / evaluated. So the last element essentially has a link to every element, and every element effectively links to every element before it. You can imagine this hits the runtime complexity pretty hard. I'm 99% sure the ideas @mpvl has to fix performance issues should cover this case, in particular the structural or reference sharing (forgetting the name a.t.m.) [edit: I believe the idea is called "structural sharing"] As a dagger user, I've run into this situation and another (where I cannot control parallelism and end up with too many tasks running at once). One thought I had, which may address both cases, is a "task list" that runs sequentially, without parallelism. |
Dagger delegates parallelism to buildkit. It's correct that there's no currently friendly way to configure it through Dagger. Having said that, it can be configured in buidkit through the following configuration. https://github.com/moby/buildkit/blob/8700be396100c7e8bdde1aaec5b590d5338948b0/docs/buildkitd.toml.md?plain=1#L59 |
There is task parallelism in CUE/flow that cannot be controlled when more than one tasks is ready to run. This is the issue I am referring to, which lives above buildkit. Dagger has some customization around CUE/flow and task discovery, which is where it could be controlled, by introducing a task list and managing the dependency ordering outside of the config |
Correct. Since all Dagger tasks ultimately translate to buildkit ops, in Dagger, this parallelism is being "limited" by buildkit's |
I think there is a misunderstanding, the issue I am describing is in no way related to buildkit. The issue lives in the Go code that implements cue/flow. To work around this, we do things like the original description, i.e. inject artificial dependencies between tasks so they are not released from dagger & cue/flow to buildkit. This however has unintended consequences, the hidden and high growth of references between sequential elements, which result in long runtimes in the CUE evaluator. (structural sharing should alleviate much of this reference / performance issue, but it is still a t.b.d. It however does not address controlling parallelism and needing this workaround in the first place) The idea I am suggesting is to introduce a "task list" which iterates over the tasks, marking them ready to run to the cue/flow engine. Again, all of this is separate from anything buildkit related. I have the same issue in |
Pinning versions is super important. `alpine/helm:3.9.1` in this case means limiting potential `alpine/helm:latest` breakages. Including a link that shows the source & other available tags makes future bumps easier. Another option would be to "dependabot" it. This matches an emerging best practice for actions: dagger#2781 It also avoids using cueflow which maks this CUE high memory issue 2x worse: cue-lang/cue#1795 Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
I don't think there's a misunderstanding. I was talking about dagger specifics and I understand that you're referring to a different scenario when using cue/flow. In any case, let's move on from that topic since it doesn't help this discussion. One thing that I'd like to reinforce is that this high memory usage exponentially gets worse when using disjunctions in the cue/flow workflow. So it's not only the amount of tasks and their dependencies which is the only issue, but as observed in the attached heap dumps, reducing the task count while increasing the amount of disjunctions in the CUE tree (which dagger uses/abuses), also shows unusable memory limits. |
This issue is not specific to cue/flow, it's in the evaluator which cue/flow also uses, though cue/flow may exacerbate it due to re-evaluation every time a task completes. There are several issues with similar reproducers: https://github.com/cue-lang/cue/issues?q=is%3Aopen+is%3Aissue+label%3Aroadmap%2Fperformance This is probably a duplicate of one of those |
A simpler reproducer without cue/flow
|
awesome, thx for sharing! |
Thanks for the report, @jlongtine. Thanks also for the analysis @verdverm and @marcosnils. Taking a look now. |
* feat: helm.#Upgrade * chore: allow to override the image in the helm install action Signed-off-by: Nico Braun <rainbowstack@gmail.com> * chore: Add make target to create kind cluster This is the simplest thing for now. The next step is to add it to ci.cue. Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk> * chore: Add make target that installs the correct CUE version This is a follow-up to #2801 Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk> * Simplify the helm.#Image definition Pinning versions is super important. `alpine/helm:3.9.1` in this case means limiting potential `alpine/helm:latest` breakages. This matches an emerging best practice for actions: #2781 It also avoids using cueflow which makes this CUE high memory issue 2x worse: cue-lang/cue#1795 Next step: restore Kubernetes tests. These have been temporarily disabled in #936. In other words, CI does not test these packages. It took me a while to track this down and understand what the simplest thing to do would be. Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Drive-by response for this comment whilst I'm investigating.
With I'll update here with any further insights on the performance issue. |
@samalba - @mpvl and I are just working through some remaining work on cycle and related changes, at which point we will be doing a deep dive on this and all open issues that affect Dagger, picking up on my investigation from a couple of weeks ago. I pinged @jlongtine for confirmation from your side on what that list looks like from your perspective. |
@jlongtine is the example in this description actual Dagger code, or is this a reproducer? i.e. are you constructing CUE via |
Hi Paul, it's a reproducer. The main reason to use the last here is the
convenience to generate the cueflow definitions and their references.
sent from mobile
Em sex, 12 de ago de 2022 04:40, Paul Jolly ***@***.***>
escreveu:
… @jlongtine <https://github.com/jlongtine> is the example in this
description actual Dagger code, or is this a reproducer? i.e. are you
constructing CUE via cue/ast in this way?
—
Reply to this email directly, view it on GitHub
<#1795 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMBLWTHXZ4BEADD6WSVXZDVYX5VHANCNFSM535JHQYQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @marcosnils. I just want to make sure as we start to dig deep on this that we don't miss a critical piece of the puzzle. Creating reduced versions of the actual problem is often great, but I want to be sure that we don't "miss" the actual scenario along the way. Because that's entirely possible when creating repros. So to summarise my understanding of the context of the original problem: the problem as you/your users are experiencing it essentially boils down to That said, as a first step I'm pulling together some analysis which breaks down your example (or at least an example that captures the essence of your example), showing, from the ground up, the explosion in the number of conjunctions/disjunctions that are triggered within the evaluation engine with each step. |
I'd also add to this that the use of conjunctions in those deep nested tasks makes the issue exponentially worse.
yes, that's correct.
awesome! really appreciate that. LMK if there's anything we can do to help out 🙏 |
There are several issues going on here:
Fixing the last would, of course, give by far the biggest win. It is also the hardest to fix. In principle, CUE does the right thing: in general, the expression of the referenced field may evaluate differently in the new context, as it may be unified with new structure. However, in the vast majority of cases it would be fine to just unify the resulting value of the referenced field, rather than recomputing the expression. An experiment with a test implementation of such an optimization shows that it would speedup this case here by about 1000x (the depth of the references), thus confirming the hypotheses. The idea is that one can only not reuse the result if the referenced subtree contains a FieldReference (among some others) directly to the root of the node. Having references to both ancestor or internal references does not change the result. In practice it is a little harder: the CloseInfo of the conjuncts needs to be rewritten to handle cycles (easy) but also to reflect the closedNess info itself. The current algorithm is not prepared for that and thus will need to be changed. Investigating this also uncovered some other issues with the closedness algorithm, so a revisit is in order either way. Seems doable, but will require some work. |
This test is used for evaluation: https://gist.github.com/mpvl/ddd37202677606ab7edf80066e7f74fe |
Adding to @mpvl's comment with some detail. Thanks for your patience here. Firstly, here is a rewritten version of @jlongtine's repro (the original didn't run for me). Joel, perhaps you can confirm this is faithful to your original?
In the original example the Note that I am using the current tip version of CUE: For reference with the following performance results, I'm running VMWare Fusion on top of an M1 Max. The VM has 10 CPUs available, with 48GB of RAM. Firstly looking at the output of the above to give rough evaluation time results:
I can also roughly reproduce the heap sizes differences you have seen:
The multiple is ~3.7x, but of the same order. @jlongtine, please let me know if these numbers are off from what you were seeing? Whilst the memory usage is very high, the evaluation time is even more strange. We look at that first, because it's possible/likely that redundant computation, or a CPU "leak", is the cause of the memory "leak". Looking at the timings above, what's interesting to note is that whilst the number of tasks between each run increases linearly, the increase in evaluation time is most definitely super linear. But before diving into the detail to analyse timings, we first look at whether we can strip down the example. @jlongtine's example uses As a first step, we looked to analyse a single evaluation run with varying sizes of "chain". We simplified the "chain" to references between fields, as follows:
This creates the same effect of a chain between fields in the original example. Looking at "chain" lengths of 1, 10, 100 and 1000, we see broadly the same CPU and memory increases at each step (more details below). We took a CPU profile of the 1000 length "chain" to see where time was being spent:
Adding some crude instrumentation to provide key measures gave the following, where
This shows the number of unifications, disjunctions and calls to The number of unifications and disjunctions is entirely as expected. The number of calls to Whilst the current implementation is correct, this analysis seems to point to an algorithmic anomaly when it comes to evaluating "chains" of references. This manifests as a CPU "leak" which very likely has a significant impact on memory usage. We believe we have identified a number of places where the current memory pressure can be alleviated and are working on solutions for each. These are shorter term fixes. Per @mpvl, we also believe we have identified a different algorithm/approach which reduces the complexity of such chains by a factor of Once we have made progress on either/both fronts, we will be in a position to re-evaluate using the updated version of the reproducer in my comment. Specifically we will:
As before, we will post updates to this issue. |
This loop was showing up in benchmarks, so do a fairly trivial strength reduction: - comparing two identical interface types is faster than comparing different interface types - we can exit the loop early when we find a match A benchmark designed to highlight the issue shows a reasonable performance improvement: name old time/op new time/op delta /chain.txtar-8 2.87s ± 3% 2.01s ± 1% -30.00% (p=0.008 n=5+5) Issue #1795. Signed-off-by: Roger Peppe <rogpeppe@gmail.com> Change-Id: Ib2c723d30bf33576f4050c96c8211e74727d8fdb Reviewed-on: https://review.gerrithub.io/c/cue-lang/cue/+/542604 Reviewed-by: Daniel Martí <mvdan@mvdan.cc> Unity-Result: CUEcueckoo <cueckoo@cuelang.org> Reviewed-by: Marcel van Lohuizen <mpvl@gmail.com> TryBot-Result: CUEcueckoo <cueckoo@cuelang.org>
Quick pass looks spot on. The basic
That looks about right, yes. I'll read back through these comments and reply with additional thoughts tomorrow. Thanks for looking into it! |
See comments in code. This coincidentally also tightens some structural cycles, which tends to have a positive effect on performance. Issue #1795 Signed-off-by: Marcel van Lohuizen <mpvl@gmail.com> Change-Id: Ida714373e549595c656b3033c7807d6c74d2fdbe Reviewed-on: https://review.gerrithub.io/c/cue-lang/cue/+/542678 Reviewed-by: Roger Peppe <rogpeppe@gmail.com> Unity-Result: CUEcueckoo <cueckoo@cuelang.org> TryBot-Result: CUEcueckoo <cueckoo@cuelang.org>
This is a useful metric for tracking some non-linear behavior. Issue #1795 Signed-off-by: Marcel van Lohuizen <mpvl@gmail.com> Change-Id: Ic444887de591d68281bdb3b1c333bfc391c49615 Reviewed-on: https://review.gerrithub.io/c/cue-lang/cue/+/542748 Reviewed-by: Paul Jolly <paul@myitcv.io> Unity-Result: CUEcueckoo <cueckoo@cuelang.org> TryBot-Result: CUEcueckoo <cueckoo@cuelang.org>
The functionality is hoisted from adt, but the package should for all practical purposes be seen as a new implementation. Note that some of the methods are now by value. This will make it easier to make things thread-safe in the future. We may also opt to hide the struct fields later on. But for now, as this has been working for a long time, we expect this to be sufficient for a while to come. The stats have been aggregated as experimental functionality to the tools/flow package. Issue #1795 Signed-off-by: Marcel van Lohuizen <mpvl@gmail.com> Change-Id: Ifd3458d0fa03f3b2d91c50385e34e24b5a93bc70 Reviewed-on: https://review.gerrithub.io/c/cue-lang/cue/+/542778 Reviewed-by: Paul Jolly <paul@myitcv.io> Unity-Result: CUEcueckoo <cueckoo@cuelang.org> TryBot-Result: CUEcueckoo <cueckoo@cuelang.org>
These tests expose cases where structure sharing is currently not working, and where we would like it to work. Fixing these can be a significant performance boost, especially for chained references. Issue #2854 Issue #2850 Issue #1795 Signed-off-by: Marcel van Lohuizen <mpvl@gmail.com> Change-Id: I723b2fbd4c0e44431f1cce47023e535b1d81043a Reviewed-on: https://review.gerrithub.io/c/cue-lang/cue/+/1202208 Reviewed-by: Matthew Sackman <matthew@cue.works> Unity-Result: CUE porcuepine <cue.porcuepine@gmail.com> TryBot-Result: CUEcueckoo <cueckoo@cuelang.org>
These tests expose cases where structure sharing is currently not working, and where we would like it to work. Fixing these can be a significant performance boost, especially for chained references. Issue cue-lang#2854 Issue cue-lang#2850 Issue cue-lang#1795 Signed-off-by: Marcel van Lohuizen <mpvl@gmail.com> Change-Id: I723b2fbd4c0e44431f1cce47023e535b1d81043a Reviewed-on: https://review.gerrithub.io/c/cue-lang/cue/+/1202208 Reviewed-by: Matthew Sackman <matthew@cue.works> Unity-Result: CUE porcuepine <cue.porcuepine@gmail.com> TryBot-Result: CUEcueckoo <cueckoo@cuelang.org>
What version of CUE are you using (
cue version
)?v0.4.3
Does this issue reproduce with the latest release?
Yes.
What did you do?
What did you expect to see?
Light memory usage. This
tools/flow
run is literally copyinginput
->output
.What did you see instead?
4-5GB of memory usage over the course of this run.
The executable will dump pprof files (named
heap*
).There is a fairly different memory usage pattern for
-disjunctions
(2.5x more).Dependencies
The text was updated successfully, but these errors were encountered: