-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20923] turn tracking of TaskMetrics._updatedBlockStatuses off #18162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having flashbacks to how tricky it was to figure out this logic when refactoring the block manager code (this used to be a lot messier). Really happy to see this get removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing that maybe gives me pause here is compatibility when reading Spark logs produced by a new version of Spark in an old version of the History Server: if we remove a key which was present from day 0 then we might run into problems when code assumes it will be present. That said, it looks like the read of this key down on line 842 was already using Utils.jsonOption so maybe this was added later and thus is already handled as an option in the existing History Server code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll look through some older versions and test out a few things with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we aren't sure I can also just leave it here but let it be empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this some more and I don't see any issue with this. As you said below on 842 its an option and I tested this both ways with history server as well just to be sure. new history file (without "Updated blocks" entry) with old history server and old file (with "Updated blocks" entry in history file) with new history server, both work fine.
if you think we should leave it I can put it back with a empty value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately TaskMetrics is a public class(via SparkListenerTaskEnd), so this is a breaking change.
But I do agree we should not track the updated block status, how about we still keep this method and make it always return Nil? We can mark it as deprecated and add comments to say that it's only here for binary compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I'll update it
|
Test build #77594 has finished for PR 18162 at commit
|
|
@cloud-fan has a good point about backwards-compatibility. In case any folks are actually relying on this behavior, I wonder whether we could mark it as deprecated and have a flag for disabling it instead of complete removal. |
|
Updated, I put the TaskMetrics api back with deprecated marking and just had it return Nil. @JoshRosen Were you thinking of adding more back? |
|
@tgravescs, I guess the question is whether any user has a SparkListener which actually uses the value of |
|
taskMetrics doesn't take the sparkconf or anything to get at a config so we would have to config out everywhere its incrementing or adding things. I think that wouldn't be to hard. I'll put everything back and just add config around them with default off. |
|
Test build #77603 has finished for PR 18162 at commit
|
| .createWithDefaultString("200m") | ||
|
|
||
| private[spark] val TASK_METRICS_TRACK_UPDATED_BLOCK_STATUSES = | ||
| ConfigBuilder("spark.taskMetrics.trackUpdatedBlockStatuses") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to document this somewhere? If so what do you suggest there isn't any other config quite like this right now. Could either put in configuration doc or monitoring doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can document it in configuration.md.
Actually can we turn it off by default? I think this is feature is useless for most of the users. cc @JoshRosen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right this is why I originally had this off by default and you requested it on. I will turn it back on and if a user finds they need it because they somehow extended the class they can turn it on. Turning this off will be most beneficial for the majority of users
| * By default it is configured to not actually save the block statuses via config | ||
| * TASK_METRICS_TRACK_UPDATED_BLOCK_STATUSES. | ||
| */ | ||
| def updatedBlockStatuses: Seq[(BlockId, BlockStatus)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we deprecate it? Considering the cost of collecting this information, I think we should not encourage users to use it...
|
I do love the clean up you did by removing cc @srowen |
|
Test build #77625 has finished for PR 18162 at commit
|
|
Yeah I was figuring I would file another jira to remove it later. I can add the deprecated flag here if you guys agree. |
|
It would be nice to get this into spark 2.2 if we can |
|
@JoshRosen what do you think should we add the deprecated? |
|
Yea I think we should merge this to 2.2, but we need to change the default value of the new config to |
|
ok, I'll update the default. |
|
turned on by default for backwards compatibility but don't really agree with it. We should make it more stable/usable for people by turning it off. I'm assuming anyone that is using this would be very minority of people, but to be safe we can leave it on and I'll file a separate jira to turn off and deprecate. Note I'm out of office next week so if comments I might not respond for a bit. |
|
Test build #77682 has finished for PR 18162 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line as it's not corrected now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we reach here, I think we already stored updatedBlockStatus in memory, filtering them out here doesn't help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by its already stored? It gets stored into the TaskMetrics when the call below to TaskMetrics.fromAccumulatorInfo is made, now the task metrics UpdatedBlockStatuses it returns aren't really ever used by this function in updateAggregateMetrics or updateTaskMetric., but I didn't see any reason to set it since its not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the accumUpdates was sent from executors, so if we already stopped tracking updatedBlockStatus, we don't need to do filter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @tgravescs , any ideas? If it's not doable, we can start a voting and decide whether to remove updatedBlockStatus entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be more clear, I think we should just do an assert here to make sure there is no UPDATED_BLOCK_STATUSES accumulator updates, instead of doing a filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that is correct as I said it shouldn't really ever get there so I can add an assert. Sorry for the delay was out of office for a while. I'll update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about I just revert this. I'm not sure its worth an assert here. the updated block statuses are just going to be empty so setting it to empty isn't going to hurt anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that also works
|
I'm wondering if it's doable to stop tracking the |
|
I'm not sure what you mean by its not doable? what places are you seeing update the block statuses that I haven't covered here? most of it was done by the BlockManager. Maybe I'm missing something in my intellij search or greps but I dont' think so. Look to see where incUpdatedBlockStatuses, setUpdatedBlockStatuses (2 versions) are used. |
|
Test build #77684 has finished for PR 18162 at commit
|
|
Test build #77687 has finished for PR 18162 at commit
|
|
@tgravescs, I deployed this to our production environment (based on 2.0.0) a few days ago and haven't hit any problems with it. I think this is good to go, unless something has been added recently that uses the block statuses. +1 |
|
Jenkins, test this please |
|
Test build #78300 has finished for PR 18162 at commit
|
|
@tgravescs can you address #18162 (comment) ? |
|
sorry missed that you had commented, yes we can change that |
|
will upmerge shortly, since there are conflicts |
|
upmerged to master and updated default and removed unneeded changes. |
|
Test build #78466 has finished for PR 18162 at commit
|
|
failure is from previous push of code. |
|
Jenkins, test this please |
|
Test build #78467 has finished for PR 18162 at commit
|
|
Test build #78460 has finished for PR 18162 at commit
|
|
Jenkins, test this please |
|
Test build #78468 has finished for PR 18162 at commit
|
|
thanks, merging to master! |
|
thanks for the reviews @cloud-fan |
## What changes were proposed in this pull request? Turn tracking of TaskMetrics._updatedBlockStatuses off by default. As far as I can see its not used by anything and it uses a lot of memory when caching and processing a lot of blocks. In my case it was taking 5GB of a 10GB heap and I even went up to 50GB heap and the job still ran out of memory. With this change in place the same job easily runs in less then 10GB of heap. We leave the api there as well as a config to turn it back on just in case anyone is using it. TaskMetrics is exposed via SparkListenerTaskEnd so if users are relying on it they can turn it back on. ## How was this patch tested? Ran unit tests that were modified and manually tested on a couple of jobs (with and without caching). Clicked through the UI and didn't see anything missing. Ran my very large hive query job with 200,000 small tasks, 1000 executors, cached 6+TB of data this runs fine now whereas without this change it would go into full gcs and eventually die. Author: Thomas Graves <tgraves@thirteenroutine.corp.gq1.yahoo.com> Author: Tom Graves <tgraves@yahoo-inc.com> Closes apache#18162 from tgravescs/SPARK-20923.
## What changes were proposed in this pull request? Turn tracking of TaskMetrics._updatedBlockStatuses off by default. As far as I can see its not used by anything and it uses a lot of memory when caching and processing a lot of blocks. In my case it was taking 5GB of a 10GB heap and I even went up to 50GB heap and the job still ran out of memory. With this change in place the same job easily runs in less then 10GB of heap. We leave the api there as well as a config to turn it back on just in case anyone is using it. TaskMetrics is exposed via SparkListenerTaskEnd so if users are relying on it they can turn it back on. ## How was this patch tested? Ran unit tests that were modified and manually tested on a couple of jobs (with and without caching). Clicked through the UI and didn't see anything missing. Ran my very large hive query job with 200,000 small tasks, 1000 executors, cached 6+TB of data this runs fine now whereas without this change it would go into full gcs and eventually die. Author: Thomas Graves <tgraves@thirteenroutine.corp.gq1.yahoo.com> Author: Tom Graves <tgraves@yahoo-inc.com> Closes apache#18162 from tgravescs/SPARK-20923. (cherry picked from commit 5b5a69b)
Turn tracking of TaskMetrics._updatedBlockStatuses off by default. As far as I can see its not used by anything and it uses a lot of memory when caching and processing a lot of blocks. In my case it was taking 5GB of a 10GB heap and I even went up to 50GB heap and the job still ran out of memory. With this change in place the same job easily runs in less then 10GB of heap. We leave the api there as well as a config to turn it back on just in case anyone is using it. TaskMetrics is exposed via SparkListenerTaskEnd so if users are relying on it they can turn it back on. Ran unit tests that were modified and manually tested on a couple of jobs (with and without caching). Clicked through the UI and didn't see anything missing. Ran my very large hive query job with 200,000 small tasks, 1000 executors, cached 6+TB of data this runs fine now whereas without this change it would go into full gcs and eventually die. Author: Thomas Graves <tgraves@thirteenroutine.corp.gq1.yahoo.com> Author: Tom Graves <tgraves@yahoo-inc.com> Closes apache#18162 from tgravescs/SPARK-20923.
What changes were proposed in this pull request?
Turn tracking of TaskMetrics._updatedBlockStatuses off by default. As far as I can see its not used by anything and it uses a lot of memory when caching and processing a lot of blocks. In my case it was taking 5GB of a 10GB heap and I even went up to 50GB heap and the job still ran out of memory. With this change in place the same job easily runs in less then 10GB of heap.
We leave the api there as well as a config to turn it back on just in case anyone is using it. TaskMetrics is exposed via SparkListenerTaskEnd so if users are relying on it they can turn it back on.
How was this patch tested?
Ran unit tests that were modified and manually tested on a couple of jobs (with and without caching). Clicked through the UI and didn't see anything missing.
Ran my very large hive query job with 200,000 small tasks, 1000 executors, cached 6+TB of data this runs fine now whereas without this change it would go into full gcs and eventually die.