[graph] visualization to indicate where a dataflow is stuck #14

utaal · 2020-01-01T09:58:35Z

It is hard to diagnose a "stuck" timely dataflow computation, where for some reason there is a capability (or perhaps message) in the system that prevents forward progress. In the system there is fairly clear information (in the progress tracking) about which pointstamps have non-zero accumulation, and although perhaps not strictly speaking a "visualization" we could imagine extracting and presenting this information.

@antiguru recently had a similar issue, in which he wanted to "complete" a dataflow without simply exiting the worker (to take some measurements), and when he attempts this the dataflow never reports completion. The root cause was ultimately that a forgotten input was left un-closed.

One idiom that seemed helpful here was to imagine a version of the dataflow graph that reports e.g. whether operators have been tombstoned or not (closed completely, memory reclaimed). This would reveal who was keeping a dataflow open, which is a rougher version of what is holding a dataflow back. We might also look for similar idioms that allow people to ask, for a given timestamp/frontier, which operators have moved past that frontier and which have not, revealing where in the dataflow graph a time is "stuck".

quentusrex · 2021-01-19T15:33:00Z

Any more thoughts here?

frankmcsherry · 2021-01-19T15:54:58Z

The closest is an open issue in the timely repo for logging progress computation, which (ideally) would allow one to track where there remains outstanding work (usually that incriminates some operators). It is languishing a bit for lack of requirements (it was formed in support of a research project that wanted lots of information, but should we actually aim at minimizing the information to e.g. the frontier of available work?).

quentusrex · 2021-04-04T22:58:51Z

When I think about the points in time that I'd want to use a feature like this, there are two groups of situations that come to mind: 1. a computation that does complete, or at least makes progress, but is behaving unexpectedly(too fast, or too slow) for the current data set. 2. computation that like the author mentions, just gets 'stuck' unexpectedly, which usually is when a larger input set is used vs the development data set.

For the first group of cases, it's generally been on a computation that has been through many development iterations, and is in need of a refactor or a wholistic review. Being able to visualize the computational flow and some info about the relative memory and cpu resources for each stage would be both useful and actionable. In my experiences it's often been when an incorrect variable is being used, and the variable names are too close to each other to make it obvious in a code review.

For the second case, I'm not sure what would be most helpful. Being able to see something about the available work seems it would be actionable in tracking down the issue, but at the same time being able to see how much of the computation remains(if there are 10 stages, and only the operators for stage 2 and 3 are available, that would be helpful to narrow down). I get the sense that seeing some approximation for what resources would be needed for each of the operators, based on some sample(or profiled) data set, would point out clearly where there is an unexpected order(s) of magnitude difference.

Not sure if the above are possible.

utaal added the enhancement New feature or request label Jan 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[graph] visualization to indicate where a dataflow is stuck #14

[graph] visualization to indicate where a dataflow is stuck #14

utaal commented Jan 1, 2020

quentusrex commented Jan 19, 2021

frankmcsherry commented Jan 19, 2021

quentusrex commented Apr 4, 2021

[graph] visualization to indicate where a dataflow is stuck #14

[graph] visualization to indicate where a dataflow is stuck #14

Comments

utaal commented Jan 1, 2020

quentusrex commented Jan 19, 2021

frankmcsherry commented Jan 19, 2021

quentusrex commented Apr 4, 2021