-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon #25138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me. Maybe CC @JoshRosen
|
Test build #107602 has finished for PR 25138 at commit
|
|
Good catch. This seems reasonable to me. One question, though: is it possible to add a regression test for this? Here's some brainstorming on how we might do that:
Just thinking aloud here; let me know if you can think of a cleaner way to test this (or whether we can regression-test this via some other means). |
|
Yup, actually I suggested this way at the first place but couldn't have enough time to verify details. +1 for @JoshRosen's comment. |
|
@JoshRosen @srowen @HyukjinKwon spark/python/pyspark/daemon.py Line 58 in 8ecbb67
(i.e. in L58 the dup sock file will be allocated file descriptor "0") So I update my code. If this is OK, I will update test then. |
|
Test build #107689 has finished for PR 25138 at commit
|
|
Jenkins, retest this please. |
|
Test build #107691 has finished for PR 25138 at commit
|
|
Test build #107709 has finished for PR 25138 at commit
|
b6ceb30 to
be73d73
Compare
|
Test build #107710 has finished for PR 25138 at commit
|
| res = sys.stdin.read() | ||
| # Because the stdin is replaced with '/dev/null' | ||
| # Read data from it will get EOF | ||
| assert res == '', "Expect read EOF from stdin." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify read stdin get EOF immediately.
Should we add more test such as verifying the worker process actually exit ?
But I think current test is enough, the fact we can only read EOF from stdin represent the stdin is dummy and safe file descriptor, it won't influence other file descriptors in daemon.
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts about test.
|
Test build #107714 has finished for PR 25138 at commit
|
|
Gently ping @JoshRosen |
|
Gently ping @HyukjinKwon @ueshin |
|
I don't know this part well, but given the analysis and test, seems OK? @HyukjinKwon @JoshRosen |
|
Seems a-okay from a cursory look. but .. I will take a closer look since here's PySpark's core path .. I have been stuck in some works. I'll take a look within one day and leave some comments. Also, I hope @JoshRosen can have a chance to take a look as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good by me. @JoshRosen, mind if I ask to double check when you have a chance?
|
@WeichenXu123, can you update PR description as well when you address comments? |
22a4a2c to
51b1a66
Compare
|
Test build #108392 has finished for PR 25138 at commit
|
|
Test build #108390 has finished for PR 25138 at commit
|
|
retest this please |
|
Test build #108401 has finished for PR 25138 at commit
|
|
I took a quick cursory look and this seems reasonable to me. /cc @GregOwen @srinathshankar as FYI |
|
Merged to master. |
What changes were proposed in this pull request?
PySpark worker daemon reads from stdin the worker PIDs to kill.
spark/python/pyspark/daemon.py
Line 127 in 1bb60ab
However, the worker process is a forked process from the worker daemon process and we didn't close stdin on the child after fork. This means the child and user program can read stdin as well, which blocks daemon from receiving the PID to kill. This can cause issues because the task reaper might detect the task was not terminated and eventually kill the JVM.
This PR fix this by redirecting the standard input of the forked child to devnull.
How was this patch tested?
Manually test.
In
pyspark, run:Before:
The job will get stuck and press Ctrl+C to exit the job but the python worker process do not exit.
After:
The job finish correctly. The "cat" print nothing (because the dummay stdin is "/dev/null").
The python worker process exit normally.
Please review https://spark.apache.org/contributing.html before opening a pull request.