-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15083] [Web UI] Limit total tasks displayed on Web UI to avoid History Server OOM #12990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| private def trimTasksIfNecessary(taskData: HashMap[Long, TaskUIData]) = synchronized { | ||
| if (taskData.size > retainedTasks) { | ||
| val toRemove = math.max(retainedTasks / 10, 1) | ||
| val oldIds = taskData.map(_._2.taskInfo.taskId).toList.sorted.take(toRemove) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain your reasoning behind doing this different than the equivalent stages and jobs functions below? This just seems a bit redundant comparatively (doing all this to get oldIds then going through a loop, rather than using trimStart) or I may just be missing some key scala understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also @tgravescs before I make the new pr, what's your opinion on this bit of code? Should I leave this as is or change it to match how jobs and stages are trimmed? I don't see a probe;em with this code, but I don't know why he decided to implement it differently than the precedent.
|
Some of our customers are hitting this issue, any chance you could take a look at this old pr with the fix? @srowen @tgravescs @JoshRosen |
|
Also @tankkyo are you still around, your github has been silent since this pr. If necessary I can reopen a pr based on this |
|
ok to test |
|
Test build #63390 has finished for PR 12990 at commit
|
|
Sorry I haven't had time to look at this in great detail but have some concerns. So the way I read this, it will only ever show 1000 tasks on the task table (by default at least)? I don't see that there is a way for it to dynamically load this data or that the user is informed that they hit this limit so I'm concerned it will be confusing. It looks like it applies to both active application and once it was re-read by the history server and I'm assuming history server is using the application setting for this? It seems like it would make more sense for the history server to have its own setting that would override the applications if that is what you are trying to protect running out of memory. I guess the other settings work that way so perhaps that is a separate jira. We have a ton of jobs that have stages with over 1000 tasks and if I can't see the data needed this page is useless. 1000 seems way low to me. I know the same thing exists for stages but that limit is much less likely to be hit and when you do you either have all data from that stage or none of it. Here you have partial stage data which could be confusing if we happen to remove the "interesting" task. If I read the screen shot for heap dump properly 65000 tasks were taking about 250MB. On a running application that doesn't seem to bad. On the history server where you can have thousands of applications obviously that can add up, but personally I would prefer us do something smart like not load it until someone clicks on the button and needs the data. I'm ok with adding a config for this but would prefer to see it default to all or very high number and those that want it smaller can decrease. |
|
@tgravescs I will look into a different solution for this issue (seems @tankkyo has vanished) If I'm understanding you correctly you're suggesting modifying the way the History Server caches applications so task data is not loaded until the application ui for a specific application is opened? |
|
Yeah. That might be a much larger change though too. Ideally the history server is smart and can load just meta data quickly to list application, then if application actually clicked it would load everything else. If this is currently an issue and we need to limit # of tasks shown I'm fine with config I just think it should be off or very large by default. If there are a bunch of others that disagree because their users hit it and 1000 for default is reasonable I guess its ok but would like to hear from others. I think you should open new PR is he hasn't responded on this one so we aren't blocked waiting. |
|
I'll create a new pr based on this one with a higher default max tasks (maybe 10,000?) and an on/off config when I have time this week |
## What changes were proposed in this pull request? Based on #12990 by tankkyo Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. `spark.ui.trimTasks` (default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` (default: 10000) (This is a "quick fix" to help those running into the problem until a update of how the history server loads app data can be done) ## How was this patch tested? Manual testing and dev/run-tests  Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14673 from ajbozarth/spark15083.
Based on apache#12990 by tankkyo Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. `spark.ui.trimTasks` (default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` (default: 10000) (This is a "quick fix" to help those running into the problem until a update of how the history server loads app data can be done) Manual testing and dev/run-tests  Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#14673 from ajbozarth/spark15083.
What changes were proposed in this pull request?
History Server will load all task ui data, which would cause OOM if there were too many tasks.
We should allow user to trim them by spark.ui.retainedTasks(default:1000)
How was this patch tested?
bin/run-example SparkPi 100000, and open stages page on Web UI