-
Notifications
You must be signed in to change notification settings - Fork 4.1k
STORM-1632 Disable event logging by default #1217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The actual event logging happens only if user starts it from the topology page. Only then performance degradation will happen. isn't it? |
|
@roshannaik |
|
@roshannaik setting The event logging happens only when they enable it and we don't expect them to turn it on all the time. The actual overhead otherwise is to check if a few flags are turned on. Can you post more details on the topology you ran and the results you observed when you set topology.eventlogger.executors to 0 vs setting it to |
|
I agree with @arunmahadevan. You can also show an alert message in UI, when user turns on the event logging for a topology e.g. "Event logging may degrade the performance" or something similar. |
|
@abhishekagarwal87 @arunmahadevan Topology detail: 1 spout, 1 bolt, 1 acker. Spout generates random numbers. Bolt does this math on the tuple: (value+1)*2. In effect this is just a speed of light topology. Cause for perf hit: The perf hit that i noted is actually due to that very same checking of the flag. Specifically this is the problematic lookup for this code path : 'storm-component->debug-atom' I agree with @arunmahadevan 's concern that this will confuse the users when they don't see logs after enabling it on the UI. The alternative fix for this is to change the manner in which this flag is made available to the code. Basically make it more efficient. There are some other lookups in the critical path that are also are causing perf hits... which i plan to address in a separate jira. |
|
Thanks @roshannaik for detailed explanation. I wasn't expecting this. We should fix the perf hit in the check. |
|
@roshannaik Thanks for the details. I will try and run a similar topology to see the difference and see if there is something we can do. Do you have any suggestions on how we can improve the performance of the check in clojure ? |
|
@roshannaik profiled "org.apache.storm.starter.ExclamationTopology" after setting log level to ERROR and the |
|
@arunmahadevan I think your Visuam VM output is not presenting the data in a manner you are expecting. Here is the JProfiler output.: |
|
@roshannaik I ran the Exclamation topology with the TestWordSpout emitting a tuple every 10 ms with a spout parallelism of 10 and measured the throughput with and without event logger and I observed almost the same results. I am also attaching the Jprofiler output after tracing the topology for about 1.36 Million tuples where the send_to_eventlogger is consuming only around 0.2% time. |
|
@arunmahadevan |
|
I retired the throughput measurements with ackers=0 .. impact is even greater ... its 25% faster when event.logger=0 |
|
In my opinion, since this performance hit can be quite large, we should definitely set eventLoggers=0 by default, and additionally the UI should also disable the 'debug' button if eventLoggers=0 I tried switching the clojure lookups into a java implementation .. but that didnt help. So i don't see a way to speed up these lookups. |
|
@roshannaik |
|
Re-ran the exclamation topology with,
Throughput I observed :- I then ran Jprofiler and saw 1.3% time being spent in send_to_eventlogger. The profiler output itself might be offset due to the instrumentation overhead. Jprofiler detects the following with send_to_eventlogger
Are you taking the measurements while the profiler is running or with the debug flag turned on? I don't see this happening otherwise. Are you using the latest Jprofiler (9.1.1) ? Are you using any extra plugins to instrument the hashmap lookups (since I see the hashmap keys in your screenshot) ? If so that itself might be skewing your results. To avoid the persistentMap lookups, I also tried passing the storm-id and debug-atom values as args to send_to_eventlogger and the percentage reduced from 1.3 % to 1 % . You could try this change and see how it impacts your topology. I agree with @HeartSaVioR and don't think we need to set eventlogger to 0. |
|
@arunmahadevan @roshannaik |
|
@arunmahadevan :-) ... i am not taking the throughput measurements while profiler is attached. @HeartSaVioR I am a bit puzzled to see a 8% or 25% diff in perf (for a given topology) being referred to as micro optimization. This is a case of potentially significant overhead being imposed upon the common code path by a infrequently used code path. Quite the contrary, i feel, one should have to have a very good justification to leave this turned on. It is not feasible to do a full fledged Yahoo style benchmark to identify and fix all such issues. Micro-benchmarking is essential. Here we are looking at a simple case of emit() call dominating most of the time within nextTuple() ... the spout computation itself is taking negligible % of the time. I have deliberately separated out #1242 from this .. as this is PR about simply disabling a DEBUG config setting.. as opposed to modifying code to avoid repetitive lookups. Seeking and testing an alternative implementation for event logging (unless its trivial) i felt might be tricky at this late stage of 1.x. |
|
@arunmahadevan @HeartSaVioR Looking at the perf analysis from @roshannaik It looks to be there is enough evidence to consider this as serious issue in performance. Given that eventlogging is new feature and we do have evidence its causing perf issue I am +1 on disabling it by default. I understand that once they disabled they can't enable it in a running topology and that is OK. For most usecases this might be used in dev cluster than a production cluster. Also this is a blocker for 1.0 release , lets get this merged in and see if there is a better a way to enable it by default and we can that in 1.1 release. |
|
I did a quick benchmark on a real cluster (albeit on a VMware cluster) and found that there was a throughput hit, but it was small -- about 0.4%. I'm okay with leaving the defaults as is, and documenting how to disable the feature. If there's a better solution, I'm okay with waiting for a post 1.0 release. |
|
@roshannaik Btw, I guess @ptgoetz got me. And I guess this overhead (0.006720889 ms = 6720.889 ns per each tuple spend in send_to_eventlogger as @arunmahadevan posted) is relatively very small than what Storm has to do for process tuple - enqueue and dequeue, finding task id to send, serde, transfer - which we may find spots to improve. Anyway, I agree that's inside of critical path so we may want to find the alternative way with not touching functionality. |
|
We're debating six versus one half dozen. Do we disable it by default and explicitly tell users they have to turn it on for the UI functionality to work? Or do we enable it by default and tell users to disable it per topology to realize a small performance gain? I could go either way, but the latter seems like a better user experience for users new to the feature. Also, the minor performance hit is eclipsed by the performance improvements in 1.0. And it can be easily turned off. It just needs to be documented clearly, IMO. |
|
Perhaps we can wait a bit before concluding on this. Some of you feel confident that there is a minor or no hit based on whatever topology/setup you have used. Whereas it is quite significant on the one i used. |
|
OK I'm assuming this is valid performance hit whether it is small or huge. For choosing 6 or half a dozen, I think whether turning on or off by default should be decided on that its use cases are valid for production. |
|
@roshannaik did you try passing the map values are arguments and take the measurements? Based on your earlier results it appeared that PersistentMap lookup was causing the hit (I still think it could very well be due to the profiler overhead) Here are the changes I made - arunmahadevan@7eae5ec . I would like see how it affects your profiling. I don't think a 0.4 to 0.5 % increase in throughput should be a reason to completely disable a feature. And spouts that emit tuples in a tight loop would not be a very common use case whatsoever. I am for documenting this feature so that the users can adjust the config values based on their needs rather than turning it off. |
|
@arunmahadevan @ptgoetz we are not worried about 0.4 to 0.5% affect on throughput. For most cases no one going to notice that. Lets wait for @roshannaik topology and you can run it and see if its still 0.4% than we can ignore this. |
Can you also take a look at the results that I observed where the throughput difference is negligible ? I am for disabling it if theres a consensus on the results and that it really affects performance. |
|
@arunmahadevan I saw the earlier comment. Is that topology ran with 10ms sleep in the spout? |
|
@harshach I had removed the sleep in the latest run to match what @roshannaik was evaluating. |
|
@arunmahadevan , like i already said earlier, i did not take the throughput measurements with the profiler attached. it makes no sense to do so. profiling was done separately.. and it correlated with the throughput measurements. |
|
@harshach my numbers for this rewritten topology were .. Approx 12% improvement over EL=0 |
|
@HeartSaVioR the topology is internally overriding some of those values like worker count. You will need to comment it out if you'd like to try varying them from cmd line. |
|
@roshannaik |
|
I guess @arunmahadevan removes not only instantiation of Long but also Random.randInt() which makes sense to me. @roshannaik @arunmahadevan
I pasted functionality about printing metrics from cluster information periodically. (It came from FastWordCountTopology, and I modified some.) You may want to increase your benchmark period long enough (by modifying loop count in main) to see when its speed becomes stabilized. You may also want to make period as argument. |
|
@arunmahadevan @roshannaik |
|
@HeartSaVioR i think there is only so much effort I can keep putting into providing credible evidence that there is an issue or hand holding other people. IMO there is enough of already shared here. I prefer not to try to convince people against their own will. Like i mentioned before, @arunmahadevan 's theory is incorrect. On a side note, any spout that generates data from will do something similar. |
|
It should be fairly easy to alter the UI to make it clear to users when this is disabled, as well as how to enable it. If we can do that I would support disabling it by default (i.e. I would be +1). My main objection with this patch is that will make the UI appear broken by default and confuse users, especially new users. |
|
@ptgoetz , i fully support that line of thought. |
|
While I'd like to merge @arunmahadevan patch since it just improves performance with no change on functionality, I'm also with @ptgoetz's suggestion. |
…how tooltip on Debug button
|
@ptgoetz I have made the changes to the UI as discussed. |
|
@roshannaik, thank you for your patience, persistence, and dedication throughout the lifecycle of this patch. I hate to vote -1 on anything. I'm really glad we're working toward a solution that everyone supports. I have yet to review and test, but definitely plan to approve if all works out. |
|
@roshannaik I'm also talking a look into #1272, but it's for improving performance when topology event logger > 0, not for blocking this. |
|
1 . The tooltip takes some time to show up, whereas the tooltip on other elements like the title elements under Topology summary shows up immediately. I played around a bit and found a way to show the tooltip on disabled button immediately using the same style as other tooltips. You need to wrap the input element in a span like below, <span style="display:inline-block;" data-toggle="tooltip" title="" data-original-title="To debug set topology.eventlogger.executors to a value > 0 or nil">
<input disabled="" onclick="confirmAction('test-topology-1-1459402415', 'test-topology', 'debug/enable', true, 10, 'sampling percentage', 'debug')" type="button" value="Debug" class="btn btn-default">
</span>2 . The tooltip is shown even if the button is enabled. You could add the |
update: i notice that your solution differs slightly from what i tried in the below specification for the enclosing span element
On retrying, i see that works out better. thanks.. will make the fix. 2 & 3. Although nice to not show tooltip when enabled, I thought its OK to show the tooltip in both enabled/disabled states of the button. Given the desire for quick turn around, i felt it was wise to not pursue implementing/testing this nice to have, but non-critical feature. Shall give it a quick shot and make it part of this PR if t works out. 4 . Let me look into that. |
…are disabled. Also enable/disable of Debug button along with toltip in component page
|
Update: Have fixes for pts 1 2,3 & 4 raised by Arun. |
|
+1 |
|
Thanks for the last-minute effort @roshannaik. +1 I typically like to wait for 24 hrs. after the last commit on a PR to merge, even though it's not required by our bylaws. I intend to merge this earlier in the interest of getting the 1.0 release out. If there are any objections after the fact, there will plenty of time to cancel the VOTE, revert, etc. |
|
@ptgoetz +1 for quick merge |
|
@roshannaik thanks for fixing it in the right way. I have run some quick tests and patch it works as expected. +1 |
|
Thanks to each of you for lending your time and effort with this. |
| jsonData["rebalanceStatus"] = (status === "ACTIVE" || status === "INACTIVE" ) ? "enabled" : "disabled"; | ||
| jsonData["killStatus"] = (status !== "KILLED") ? "enabled" : "disabled"; | ||
| jsonData["startDebugStatus"] = (status === "ACTIVE" && !debug) ? "enabled" : "disabled"; | ||
| jsonData["startDebugStatus"] = (status === "ACTIVE" && loggersTotal!=null && loggersTotal!=0 && !debug) ? "enabled" : "disabled"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it disable the debug option even when topology.eventlogger.executors is set to null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I think we should remove loggersTotal != null so that "debug" is enabled if the value is set to null since one event logger task is created per worker if the value is set to null. This interpretation of "null" is not very intuitive, but its consistent with what "null" means with other variables like "topology.acker.executors".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the intent is to disable event logging by default, we should disable logging if topology.eventlogger.executors is null ... otherwise its confusing.
|
On further thought... i feel it maybe ok to just remove the loggersTotal != null check here. Shall I just update this PR ? |
|
@roshannaik yes, it will make the behavior consistent with "ackers". You may also want to update the tooltip. Since this PR is already merged, you may have to raise another one and not sure if that would make it to 1.0 release. |
|
@roshannaik can you also disable the events link in component-summary-template if loggers is disabled. It is clickable and points to blank page. If you want, I can put up a fix for this. |
|
@abhishekagarwal87 @arunmahadevan |




updating setting in defaults.yaml