add more observability to Execution#1422
add more observability to Execution#1422declerambaul wants to merge 1 commit intotwitter:developfrom
Conversation
| .foreach(_ => flow.setFlowStepStrategy(ReducerEstimatorStepStrategy)) | ||
|
|
||
| config.get(Config.FlowListeners).foreach { clsNames: String => | ||
| println(s"XXX adding FlowListeners $clsNames") |
There was a problem hiding this comment.
we should use proper logging here (same as cascading, not sure which framework they use).
|
This seems reasonable to me, but it is limited in that you can't give any arguments to your flowlistener. Does this cover most of the cases you have? The good part about this API is that it doesn't clutter up the Execution type, but the downside is that it uses reflection and has the limitation mentioned above. |
| /** | ||
| * configure flow listeneres for observability | ||
| */ | ||
| def addFlowListener[T](cls: Class[T]): Config = |
There was a problem hiding this comment.
Can we put a better type on this? Is there no interface/root class for flow listeners?
|
What do we think about maybe having the type be (Config, Mode) => FlowListener, and then we externalize and base64 the closure we get so we can store it inside the config. It would keep it configurable. |
|
@ianoc we could add that later if we needed. The class based one could always be a version of the above: (c: Config, m: Mode) => clazz.newInstance()so, we have a weaker API that we could lift into the new API as needed. What do you think about starting more restricted then adding if needed? |
|
I had a look at #1426 which makes sense to me - I meant to keep it as simple as possible. In terms of Oscar's comment about covering use cases, I see how this could lead to some confusion&frustration - especially since in the case of the flowstep listeners we are sort of abusing the Config class. I.e. the serialized Using my current usecase, getting ambrose to work nicely with scalding, one way to do it would require adding the same instance of a class that extends both FlowListener and FlowStepListener to the Flow. With the current approach, that doesn't work. I tried getting around this using something like but strangely (or not), that only works if I add the exact same instance of HelperClass to the Flow (i.e. the current PR does not work in that scenario). If we are going that route, what prevents us from generalizing this more and provide a config method that modifies a Flow directly? And then in the ExecutionContext just fold the flow configurations into a single one |
|
One reason to avoid generalizing for better or worse is the less scope to shoot yourself in the foot, tightly controlled API's around use cases tend to lead to less leakage/less restrictive on future changes. Ideally i think we wouldn't expose these API's beyond platform helpers too I think. Its exposing a lot of internal cascading stuff we'd rather keep hidden. Which doesn't work with ambrose? the other PR works anyway, i fired up your ambrose code with it and it worked fine? |
|
Agreed about limiting scope. Though my argument is more about dry code, as it every new config will require a couple new methods that pretty much do the same thing (and since the serialized config itself is not human readable, I don't see much value in having separate hadoop options for the different flow configurations). My main point is - this PR would allow you to configure a flowlistener using a hadoop option - e.g. Dscalding.obs.flowlistener=com.twitter.xx.AmbroseListener - but as soon as you move to closures (your pr), (Mode,Config) => T, you are forced to do the config in code anyways. So to limit scope, you could also make the suggested |
|
Re: it generally seems to work well - sadly it is not that easy to tell (also I don't know ambrose well enough to know it should look). I was talking to @MansurAshraf about this - there are changes that need to done to ambrose. E.g. make it work with Execution, currently only an Execution resulting in a single FlowDef works out of the box; and also I suspect there are some bugs with how state is kept in ambrose. In addition, this PR seems to work for the FlowListener only, not the FlowStepListener - but if you change ExecutionContext to add the same instance as a FlowStepListener (which is only true for Ambrose) - individual job progress events (aka step X is 24% done) also start working. |
|
You do loose setting a flow listener on the command line, but on the flip side you don't need to have a fully pre-configured class for everything. You could attach a basic ambrose to everything, then enable it, or indeed configure it from those same hadoop options. So you can kind of do more with them per say. Which PR do you mean only works with the flow listener vs flow step listener? These comments are you on your PR not mine, i'm just a little confused. |
|
I meant that my PR (the one this thread is on) does not work for the FlowStepListener - see the snippet about the If it works using the closures in your PR, then hooray and even more reason to go with that approach. |
|
Closing this in favor of #1426 - let's also move the discussion about scope to that pr. |
This branch would need to be cleaned up, though I would be interested in hearing whether this would be a feasible approach to add more observability to Execution based jobs.
The pr adds support for attaching cascading flowlisteners to Flows created using Execution (this can already be done when using Job with https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/Job.scala#L304). I am not a fan of Config in general, though other options seems sparse - this code replicates the approach taken for configuring reducer estimators.
@MansurAshraf