Skip to content

Conversation

@tgravescs
Copy link
Contributor

@tgravescs tgravescs commented May 31, 2017

What changes were proposed in this pull request?

Turn tracking of TaskMetrics._updatedBlockStatuses off by default. As far as I can see its not used by anything and it uses a lot of memory when caching and processing a lot of blocks. In my case it was taking 5GB of a 10GB heap and I even went up to 50GB heap and the job still ran out of memory. With this change in place the same job easily runs in less then 10GB of heap.

We leave the api there as well as a config to turn it back on just in case anyone is using it. TaskMetrics is exposed via SparkListenerTaskEnd so if users are relying on it they can turn it back on.

How was this patch tested?

Ran unit tests that were modified and manually tested on a couple of jobs (with and without caching). Clicked through the UI and didn't see anything missing.
Ran my very large hive query job with 200,000 small tasks, 1000 executors, cached 6+TB of data this runs fine now whereas without this change it would go into full gcs and eventually die.

@tgravescs
Copy link
Contributor Author

@JoshRosen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having flashbacks to how tricky it was to figure out this logic when refactoring the block manager code (this used to be a lot messier). Really happy to see this get removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that maybe gives me pause here is compatibility when reading Spark logs produced by a new version of Spark in an old version of the History Server: if we remove a key which was present from day 0 then we might run into problems when code assumes it will be present. That said, it looks like the read of this key down on line 842 was already using Utils.jsonOption so maybe this was added later and thus is already handled as an option in the existing History Server code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll look through some older versions and test out a few things with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aren't sure I can also just leave it here but let it be empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this some more and I don't see any issue with this. As you said below on 842 its an option and I tested this both ways with history server as well just to be sure. new history file (without "Updated blocks" entry) with old history server and old file (with "Updated blocks" entry in history file) with new history server, both work fine.

if you think we should leave it I can put it back with a empty value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately TaskMetrics is a public class(via SparkListenerTaskEnd), so this is a breaking change.

But I do agree we should not track the updated block status, how about we still keep this method and make it always return Nil? We can mark it as deprecated and add comments to say that it's only here for binary compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'll update it

@SparkQA
Copy link

SparkQA commented May 31, 2017

Test build #77594 has finished for PR 18162 at commit 5327ba5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

@cloud-fan has a good point about backwards-compatibility. In case any folks are actually relying on this behavior, I wonder whether we could mark it as deprecated and have a flag for disabling it instead of complete removal.

@tgravescs
Copy link
Contributor Author

Updated, I put the TaskMetrics api back with deprecated marking and just had it return Nil. @JoshRosen Were you thinking of adding more back?

@JoshRosen
Copy link
Contributor

@tgravescs, I guess the question is whether any user has a SparkListener which actually uses the value of updatedBlockStatuses for some monitoring application or something similar. In that case returning Nil will preserve binary compatibility but will break semantics for their app. I don't know of any use-cases like this offhand so I don't have a personal stake in this, but I could imagine problems in case someone relies on the return value (hence suggestion of flagging if we want to be really conservative).

@tgravescs
Copy link
Contributor Author

taskMetrics doesn't take the sparkconf or anything to get at a config so we would have to config out everywhere its incrementing or adding things. I think that wouldn't be to hard. I'll put everything back and just add config around them with default off.

@SparkQA
Copy link

SparkQA commented May 31, 2017

Test build #77603 has finished for PR 18162 at commit 41ed774.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs tgravescs changed the title [SPARK-20923] Remove TaskMetrics._updatedBlockStatuses [SPARK-20923] turn tracking of TaskMetrics._updatedBlockStatuses off Jun 1, 2017
.createWithDefaultString("200m")

private[spark] val TASK_METRICS_TRACK_UPDATED_BLOCK_STATUSES =
ConfigBuilder("spark.taskMetrics.trackUpdatedBlockStatuses")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to document this somewhere? If so what do you suggest there isn't any other config quite like this right now. Could either put in configuration doc or monitoring doc.

Copy link
Contributor

@cloud-fan cloud-fan Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can document it in configuration.md.

Actually can we turn it off by default? I think this is feature is useless for most of the users. cc @JoshRosen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right this is why I originally had this off by default and you requested it on. I will turn it back on and if a user finds they need it because they somehow extended the class they can turn it on. Turning this off will be most beneficial for the majority of users

* By default it is configured to not actually save the block statuses via config
* TASK_METRICS_TRACK_UPDATED_BLOCK_STATUSES.
*/
def updatedBlockStatuses: Seq[(BlockId, BlockStatus)] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we deprecate it? Considering the cost of collecting this information, I think we should not encourage users to use it...

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 1, 2017

I do love the clean up you did by removing updatedBlockStatuses entirely... Since SparkListenerTaskEnd is marked as a developer API, is it acceptable to make TaskMetrics.updatedBlockStatuses always return Nil in Spark 2.3? Maybe we can send an email to dev list to ask about this.

cc @srowen

@SparkQA
Copy link

SparkQA commented Jun 1, 2017

Test build #77625 has finished for PR 18162 at commit ec3c29d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

Yeah I was figuring I would file another jira to remove it later. I can add the deprecated flag here if you guys agree.

@tgravescs
Copy link
Contributor Author

It would be nice to get this into spark 2.2 if we can

@tgravescs
Copy link
Contributor Author

@JoshRosen what do you think should we add the deprecated?

@cloud-fan
Copy link
Contributor

Yea I think we should merge this to 2.2, but we need to change the default value of the new config to true, to not surprise users.

@tgravescs
Copy link
Contributor Author

ok, I'll update the default.

@tgravescs
Copy link
Contributor Author

turned on by default for backwards compatibility but don't really agree with it. We should make it more stable/usable for people by turning it off. I'm assuming anyone that is using this would be very minority of people, but to be safe we can leave it on and I'll file a separate jira to turn off and deprecate. Note I'm out of office next week so if comments I might not respond for a bit.

@SparkQA
Copy link

SparkQA commented Jun 2, 2017

Test build #77682 has finished for PR 18162 at commit fbe30cf.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line as it's not corrected now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we reach here, I think we already stored updatedBlockStatus in memory, filtering them out here doesn't help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by its already stored? It gets stored into the TaskMetrics when the call below to TaskMetrics.fromAccumulatorInfo is made, now the task metrics UpdatedBlockStatuses it returns aren't really ever used by this function in updateAggregateMetrics or updateTaskMetric., but I didn't see any reason to set it since its not used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the accumUpdates was sent from executors, so if we already stopped tracking updatedBlockStatus, we don't need to do filter here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @tgravescs , any ideas? If it's not doable, we can start a voting and decide whether to remove updatedBlockStatus entirely.

Copy link
Contributor

@cloud-fan cloud-fan Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be more clear, I think we should just do an assert here to make sure there is no UPDATED_BLOCK_STATUSES accumulator updates, instead of doing a filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that is correct as I said it shouldn't really ever get there so I can add an assert. Sorry for the delay was out of office for a while. I'll update it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I just revert this. I'm not sure its worth an assert here. the updated block statuses are just going to be empty so setting it to empty isn't going to hurt anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that also works

@cloud-fan
Copy link
Contributor

I'm wondering if it's doable to stop tracking the updatedBlockStatus according to a config... there are many places that update updatedBlockStatus right?

@tgravescs
Copy link
Contributor Author

I'm not sure what you mean by its not doable? what places are you seeing update the block statuses that I haven't covered here? most of it was done by the BlockManager. Maybe I'm missing something in my intellij search or greps but I dont' think so. Look to see where incUpdatedBlockStatuses, setUpdatedBlockStatuses (2 versions) are used.

@SparkQA
Copy link

SparkQA commented Jun 2, 2017

Test build #77684 has finished for PR 18162 at commit 018955a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 2, 2017

Test build #77687 has finished for PR 18162 at commit a875aac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Jun 8, 2017

@tgravescs, I deployed this to our production environment (based on 2.0.0) a few days ago and haven't hit any problems with it. I think this is good to go, unless something has been added recently that uses the block statuses.

+1

@tgravescs
Copy link
Contributor Author

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Jun 20, 2017

Test build #78300 has finished for PR 18162 at commit a875aac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@tgravescs can you address #18162 (comment) ?

@tgravescs
Copy link
Contributor Author

sorry missed that you had commented, yes we can change that

@tgravescs
Copy link
Contributor Author

will upmerge shortly, since there are conflicts

@tgravescs
Copy link
Contributor Author

upmerged to master and updated default and removed unneeded changes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78466 has finished for PR 18162 at commit 68fbd9f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

failure is from previous push of code.

@tgravescs
Copy link
Contributor Author

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78467 has finished for PR 18162 at commit 68fbd9f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78460 has finished for PR 18162 at commit c9bf058.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78468 has finished for PR 18162 at commit 289e993.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 5b5a69b Jun 23, 2017
@tgravescs
Copy link
Contributor Author

thanks for the reviews @cloud-fan

robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
## What changes were proposed in this pull request?
Turn tracking of TaskMetrics._updatedBlockStatuses off by default. As far as I can see its not used by anything and it uses a lot of memory when caching and processing a lot of blocks.  In my case it was taking 5GB of a 10GB heap and I even went up to 50GB heap and the job still ran out of memory.  With this change in place the same job easily runs in less then 10GB of heap.

We leave the api there as well as a config to turn it back on just in case anyone is using it.  TaskMetrics is exposed via SparkListenerTaskEnd so if users are relying on it they can turn it back on.

## How was this patch tested?

Ran unit tests that were modified and manually tested on a couple of jobs (with and without caching).  Clicked through the UI and didn't see anything missing.
Ran my very large hive query job with 200,000 small tasks, 1000 executors, cached 6+TB of data this runs fine now whereas without this change it would go into full gcs and eventually die.

Author: Thomas Graves <tgraves@thirteenroutine.corp.gq1.yahoo.com>
Author: Tom Graves <tgraves@yahoo-inc.com>

Closes apache#18162 from tgravescs/SPARK-20923.
MikhailErofeev pushed a commit to MikhailErofeev/spark that referenced this pull request Dec 17, 2017
## What changes were proposed in this pull request?
Turn tracking of TaskMetrics._updatedBlockStatuses off by default. As far as I can see its not used by anything and it uses a lot of memory when caching and processing a lot of blocks.  In my case it was taking 5GB of a 10GB heap and I even went up to 50GB heap and the job still ran out of memory.  With this change in place the same job easily runs in less then 10GB of heap.

We leave the api there as well as a config to turn it back on just in case anyone is using it.  TaskMetrics is exposed via SparkListenerTaskEnd so if users are relying on it they can turn it back on.

## How was this patch tested?

Ran unit tests that were modified and manually tested on a couple of jobs (with and without caching).  Clicked through the UI and didn't see anything missing.
Ran my very large hive query job with 200,000 small tasks, 1000 executors, cached 6+TB of data this runs fine now whereas without this change it would go into full gcs and eventually die.

Author: Thomas Graves <tgraves@thirteenroutine.corp.gq1.yahoo.com>
Author: Tom Graves <tgraves@yahoo-inc.com>

Closes apache#18162 from tgravescs/SPARK-20923.

(cherry picked from commit 5b5a69b)
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
Turn tracking of TaskMetrics._updatedBlockStatuses off by default. As far as I can see its not used by anything and it uses a lot of memory when caching and processing a lot of blocks.  In my case it was taking 5GB of a 10GB heap and I even went up to 50GB heap and the job still ran out of memory.  With this change in place the same job easily runs in less then 10GB of heap.

We leave the api there as well as a config to turn it back on just in case anyone is using it.  TaskMetrics is exposed via SparkListenerTaskEnd so if users are relying on it they can turn it back on.

Ran unit tests that were modified and manually tested on a couple of jobs (with and without caching).  Clicked through the UI and didn't see anything missing.
Ran my very large hive query job with 200,000 small tasks, 1000 executors, cached 6+TB of data this runs fine now whereas without this change it would go into full gcs and eventually die.

Author: Thomas Graves <tgraves@thirteenroutine.corp.gq1.yahoo.com>
Author: Tom Graves <tgraves@yahoo-inc.com>

Closes apache#18162 from tgravescs/SPARK-20923.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants