Commit dbc478b
committed
Reduce job failures during decommissioning
This PR reduces the prospect of a job loss during decommissioning. It
fixes two holes in the current decommissioning framework:
- (a) Loss of decommissioned executors is not treated as a job failure:
We know that the decom'd executor would be dying soon, so its death is
clearly not caused by the application.
- (b) Shuffle files on the decommissioned host are cleared when the
first fetch failure is detected from a decommissioned host: This is a
bit tricky in terms of when to clear the shuffle state ? Ideally you
want to clear it the millisecond before the shuffle service on the node
dies (or the executor dies when there is no external shuffle service) --
too soon and it could lead to some wasteage and too late would lead to
fetch failures.
(These two things are a bit intertwined so it is easier to do them in a
single commit)
The approach here is to do this clearing when the very first fetch
failure is observed on the decom'd block manager, without waiting for
other blocks to also signal a failure.
Added a new unit test `DecommissionWorkerSuite` to test the new
behavior.1 parent a3d8056 commit dbc478b
File tree
9 files changed
+539
-7
lines changed- core/src
- main/scala/org/apache/spark/scheduler
- test/scala/org/apache/spark
- deploy
- scheduler
9 files changed
+539
-7
lines changedLines changed: 14 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1821 | 1821 | | |
1822 | 1822 | | |
1823 | 1823 | | |
1824 | | - | |
1825 | | - | |
1826 | | - | |
1827 | | - | |
| 1824 | + | |
| 1825 | + | |
| 1826 | + | |
| 1827 | + | |
| 1828 | + | |
| 1829 | + | |
| 1830 | + | |
| 1831 | + | |
| 1832 | + | |
| 1833 | + | |
| 1834 | + | |
| 1835 | + | |
| 1836 | + | |
1828 | 1837 | | |
1829 | 1838 | | |
1830 | 1839 | | |
| |||
2339 | 2348 | | |
2340 | 2349 | | |
2341 | 2350 | | |
2342 | | - | |
| 2351 | + | |
2343 | 2352 | | |
2344 | 2353 | | |
2345 | 2354 | | |
| |||
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
| 57 | + | |
| 58 | + | |
57 | 59 | | |
58 | 60 | | |
59 | | - | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
60 | 65 | | |
61 | 66 | | |
62 | 67 | | |
| |||
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
103 | 103 | | |
104 | 104 | | |
105 | 105 | | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
106 | 111 | | |
107 | 112 | | |
108 | 113 | | |
| |||
Lines changed: 36 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
136 | 136 | | |
137 | 137 | | |
138 | 138 | | |
| 139 | + | |
| 140 | + | |
139 | 141 | | |
140 | 142 | | |
141 | 143 | | |
| |||
939 | 941 | | |
940 | 942 | | |
941 | 943 | | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
942 | 957 | | |
943 | 958 | | |
944 | 959 | | |
945 | 960 | | |
946 | | - | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
947 | 967 | | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
| 980 | + | |
948 | 981 | | |
949 | 982 | | |
950 | 983 | | |
| |||
1033 | 1066 | | |
1034 | 1067 | | |
1035 | 1068 | | |
| 1069 | + | |
| 1070 | + | |
1036 | 1071 | | |
1037 | 1072 | | |
1038 | 1073 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
985 | 985 | | |
986 | 986 | | |
987 | 987 | | |
| 988 | + | |
988 | 989 | | |
989 | 990 | | |
990 | 991 | | |
| |||
0 commit comments