Skip to content

Conversation

@Sephiroth-Lin
Copy link
Contributor

Currently, when we kill application on Yarn, then will call sc.stop() at Yarn application state monitor thread, then in YarnClientSchedulerBackend.stop() will call interrupt this will cause SparkContext not stop fully as we will wait executor to exit.

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #39318 has finished for PR 7846 at commit 243d2c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 1, 2015

This feels too hacky to be a good solution, relying on a flag to pass around who should interrupt a thread. Why not close the sc in the finally block of the monitor thread and be done?

@Sephiroth-Lin
Copy link
Contributor Author

@srowen We need call interrupt in YarnClientSchedulerBackend.stop(), details see PR #5305 and PR #3143, so even if we call sc.stop() in the finally block of the monitor thread it also can not stop successfully.

@srowen
Copy link
Member

srowen commented Aug 1, 2015

Is the sequence that sc.stop causes the backend to stop which may interrupt the monitor thread, which may be the thing causing it to stop? This change doesn't stop this sequence from happening though; there's still a race condition. Why would the thread interrupt itself as the last thing it does?

This is cleaner if it's entirely local to the monitor thread. The backend doesn't need a new field for this. The thread can have a "stop" method that interrupts it only if it's blocked in monitorApplication.

@Sephiroth-Lin
Copy link
Contributor Author

Yes, this change doesn't stop this sequence from happening. As monitor thread is daemon thread, we don't need call interrupt after sc.stop().
Below I am not very clear:

  1. there's still a race condition
  2. the thread can have a "stop" method that interrupts it only if it's blocked in monitorApplication
    Thank you!

@srowen
Copy link
Member

srowen commented Aug 1, 2015

If you're asking what I mean, I mean that the monitor thread itself can have the flag, like isMonitoring, that is true when it starts the blocking call to monitorApplication and false immediately after. Then expose another method like stop() or something, that interrupts the thread only if isMonitoring is true. This means that if the thread itself initiates sc.stop(), it won't get interrupted, but can still be interrupted in the blocking call to the library method.

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #39354 has finished for PR 7846 at commit ad0e23b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's tidier. Now that it's its own named class, name and daemon status can be set by the class itself I think.

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #1278 has finished for PR 7846 at commit ad0e23b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 4, 2015

Any more thoughts on this one? matching the keyword 'yarn' I will reference @sryza and @vanzin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this allowInterrupt.

So it too me a bit to understand why this code is like this. Basically when you interrupt it's because the SparkContext is being shut down (sc.stop() called by user code), and you do not want sc.stop() to be called again here. Now if monitorApplication() returns, it means the YARN app finished before sc.stop() was called, which means this code should call sc.stop(). Could you write a small comment explaining that so that in the future people know what's going on here?

@vanzin
Copy link
Contributor

vanzin commented Aug 4, 2015

Looks good, I just think we need a comment explaining the code for future readers.

@Sephiroth-Lin
Copy link
Contributor Author

@vanzin @srowen Updated, thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "for SPARK-9519".

@vanzin
Copy link
Contributor

vanzin commented Aug 5, 2015

LGTM, I'll leave it here to see if anyone else has comments, otherwise I'll merge in the morning.

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39798 has finished for PR 7846 at commit 2e8e365.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39809 has finished for PR 7846 at commit 1ae736d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Aug 5, 2015
… killed

Currently, when we kill application on Yarn, then will call sc.stop() at Yarn application state monitor thread, then in YarnClientSchedulerBackend.stop() will call interrupt this will cause SparkContext not stop fully as we will wait executor to exit.

Author: linweizhong <linweizhong@huawei.com>

Closes #7846 from Sephiroth-Lin/SPARK-9519 and squashes the following commits:

1ae736d [linweizhong] Update comments
2e8e365 [linweizhong] Add comment explaining the code
ad0e23b [linweizhong] Update
243d2c7 [linweizhong] Confirm stop sc successfully when application was killed

(cherry picked from commit 7a969a6)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@vanzin
Copy link
Contributor

vanzin commented Aug 5, 2015

Merged to master and 1.5, thanks!

@asfgit asfgit closed this in 7a969a6 Aug 5, 2015
@Sephiroth-Lin Sephiroth-Lin deleted the SPARK-9519 branch May 15, 2016 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants