-
Notifications
You must be signed in to change notification settings - Fork 549
Conversation
|
||
[[constraint]] | ||
name = "gopkg.in/yaml.v2" | ||
version = "2.2.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use built-in yaml #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't find the build-in yaml lib. Do I miss something?
In reply to: 321699635 [](ancestors = 321699635)
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | ||
|
||
{% for spec_item in cluster_cfg['k8s-job-exit-spec']['spec'] %} | ||
{%- if spec_item['code'] > 0 %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have code to add in spec yaml. If so also include it in this PR, so that you can test it. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{%- for str_val in spec_item['solution'] %} | ||
- "{{ str_val }}" | ||
{%- endfor %} | ||
{%- endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why runtime need to read static reason, solution? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/kube-runtime/deploy/refresh.sh
Outdated
|
||
pushd $(dirname "$0") > /dev/null | ||
|
||
kubectl create configmap user-exit-spec-configuration --from-file=user-exit-spec.yaml --dry-run -o yaml | kubectl replace -f - || exit $? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create [](start = 8, length = 6)
create [](start = 8, length = 6)
The refresh is update? If so, use update operation instead of create #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#kubectl-apply.
Replace will first delete then create, not so good.
In reply to: 322079743 [](ancestors = 322079743,322045283)
var log *logger.Logger | ||
|
||
const defaultExitCode = 255 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Keep exitcode in your file the same as the process's exit code. If it is hard, remove the one in file, and just use process's exit code.
-
If there is error in your runtime itself, you should not exit with 255. 255 means runtime unrecognized the code, for runtime itself failure, it can exit anything other than the code in spec. So it is to be:
- code: 256
phrase: PAIRuntimeExitAbnormally
(So I think you can just panic on any critical runtime itself error) #Closed
src/kube-runtime/src/init
Outdated
@@ -56,6 +57,11 @@ mkdir -p ${PAI_LOG_DIR} | |||
${PAI_INIT_DIR}/frameworkbarrier > ${PAI_LOG_DIR}/barrier.log 2>&1 | |||
echo "barrier returns $?" >> ${PAI_INIT_LOG_FILE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are code: 250, 251, 252? #Closed
} | ||
log.Info("finish generating the exit summary") | ||
fmt.Print(exitInfo.Exitcode) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do you exit with exitInfo.Exitcode?
Pls do simple test before send PR. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code exit at:
pai/src/kube-runtime/src/runtime
Line 46 in 2061dd7
exit ${CONTAINER_EXIT_CODE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems too tricky, can we unify the way to expose exitcode outside?
In reply to: 322047090 [](ancestors = 322047090)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I write like this for 2 reasons:
- Currently, it run as a program. So, I want to use exit code to identify if the program run correctly.
- I want to copy the aggerated info to log folder for debugging purpose. Since we set
set -o errexit
. Direct exit will terminate the main process
Ideally, this part should be executed as a function, then we can return the spec exit code and let main program exit with this code. But the main process is a shell script, it's not easy to do this unless we change the shell to a golang program.
A short term solution: exit with spec code. unset the set -o errexit
in main shell. Then we can do some following actions such as log something, copy the aggregated exitInfo to log folder
A long term: use golang to rewrite the shell and let exitHandler be a function.
In reply to: 322048329 [](ancestors = 322048329,322047090)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
maxRuntimeLogLines: 10, | ||
defaulExitCode: 255, | ||
maxSearchLogSize: 100 * 1024 * 1024, // 100MB | ||
aggExitInfoBegin: "[PAI_RUNTIME_ERROR_START]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test the worst time that exithandler need? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depeneds on many conditions. The log size, the complexity of the pattern, the number of patterns. Currently, the max number of patterns is 32 220-255 excludes 250,251,252,255. The worse case will happens when all these patterns with same exit code, and complex regex pattern.
Can do a test with useSize larger than 100MB, all patterns with the same exit code. And use simple regex pattern such as: command not found
for the test
In reply to: 322047160 [](ancestors = 322047160)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, pls benchmark it, make sure the default setting is not too time consuming
In reply to: 322054168 [](ancestors = 322054168,322047160)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems 100MB may also introduce false alarm, limit to 10MB?
In reply to: 322054707 [](ancestors = 322054707,322054168,322047160)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change to 10MB. Not do performance test yet
In reply to: 323139216 [](ancestors = 323139216,322054707,322054168,322047160)
|
||
// Use zero to present undefined exitCode in error spec | ||
const undefinedExitCode = 0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 is so confusing, why not use 256? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -30,6 +30,23 @@ PAI_WORK_DIR=/usr/local/pai | |||
PAI_RUNTIME_DIR=${PAI_WORK_DIR}/runtime.d | |||
PAI_LOG_DIR=${PAI_WORK_DIR}/logs/${FC_POD_UID} | |||
PAI_USER_LOG_FILE=${PAI_LOG_DIR}/user.pai.all | |||
PATTERN_FILE=${PAI_RUNTIME_DIR}/runtime-exit-spec.yaml | |||
TERMINATE_MESSAGE_PATH=/tmp/pai-termination-log | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crossref with restserver #Closed
|
||
{% for spec_item in cluster_cfg['k8s-job-exit-spec']['spec'] %} | ||
{%- if spec_item['code'] > 0 %} | ||
- containerExitCode: {{ spec_item['code'] }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also filter codes that does not have runtimeContainerPatterns? #Closed
@@ -736,6 +736,25 @@ spec: | |||
# Description: User Container issued failures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solution: | ||
- "Check job logs" | ||
- "Correct the command and retry the job" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be unknown, pls refer previous spec on yarn :
- code: 255
phrase: PAIRuntimeUnknownFailed
issuer: PAI_RUNTIME
causer: UNKNOWN
type: UNKNOWN_FAILURE
stage: COMPLETING
behavior: UNKNOWN
reaction: RETRY_TO_MAX
reason: "Container failed but the failure cannot be recognized by PAI Runtime"
repro:- "User program directly exits with exitcode 1"
solution: - "Check container log and find root cause"
- "Wait result from next retry" #Closed
- "User program directly exits with exitcode 1"
Do NOT make assumption on user image, it may not have python #Closed Refers to: src/kube-runtime/src/runtime:78 in 7a361c7. [](commit_id = 7a361c7, deletion_comment = False) |
Change the exitcode 200 to another dedicated code in spec #Closed Refers to: src/kube-runtime/src/runtime:78 in 7a361c7. [](commit_id = 7a361c7, deletion_comment = False) |
Seems runtime also need to handle these codes, such as ContainerSigSegvReceived, I mean the pattern should be directly forward these code outside, instead of converting them to be 255. And change the Owner: to be Owner: PAI_FC and PAI_RUNTIME #Closed Refers to: src/k8s-job-exit-spec/config/k8s-job-exit-spec.yaml:586 in 93f4b96. [](commit_id = 93f4b96, deletion_comment = False) |
Do not add > >(tee -a ${PAI_LOG_DIR}/runtime.pai.stdout) 2> >(tee -a ${PAI_LOG_DIR}/runtime.pai.stderr >&2) on each cmd, just split to 2 runtime script #Closed Refers to: src/kube-runtime/src/runtime:78 in 7a361c7. [](commit_id = 7a361c7, deletion_comment = False) |
44fb8f3
to
c8198a1
Compare
@@ -780,10 +884,40 @@ spec: | |||
solution: | |||
- "Contact PAI Dev to fix the PAI Runtime bug" | |||
|
|||
- code: 253 | |||
phrase: PortConflictCheckFailed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PortConflictCheckFailed [](start = 10, length = 23)
PortConflictCheckFailed [](start = 10, length = 23)
Just "ContainerPortConflict"? #Closed
c8198a1
to
91c58ea
Compare
src/kube-runtime/src/init
Outdated
@@ -56,6 +86,11 @@ mkdir -p ${PAI_LOG_DIR} | |||
${PAI_INIT_DIR}/frameworkbarrier > ${PAI_LOG_DIR}/barrier.log 2>&1 | |||
echo "barrier returns $?" >> ${PAI_INIT_LOG_FILE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tee all runtime logs to docker/k8s stdout/err for easy debugging? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to redirect lgos to stdout & stderr, I think it's better to wrap the init container or change the init container start command.
I prefer change the docker start command
In reply to: 323551280 [](ancestors = 323551280)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
if [[ $EXIT_CODE -eq 210 ]]; then | ||
exit 252 | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should only be used to revise bairrier code, not for global #Closed
if [[ $EXIT_CODE -eq 10 ]]; then | ||
exit 253 | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should only be used to revise port checker code, not for global #Closed
|
||
pushd $(dirname "$0") > /dev/null | ||
|
||
kubectl create configmap runtime-exit-spec-configuration --from-file=runtime-exit-spec.yaml --dry-run -o yaml | kubectl apply --overwrite=true -f - || exit $? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubectl create [](start = 0, length = 14)
Why need create? Just apply will create?
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure apply will create, we use create here to generate configmap spec, but not apply to k8s. Please notice the --dry-run
option. Then we apply the generated spec.
In reply to: 324500792 [](ancestors = 324500792)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does kubectl apply can use "configmap runtime-exit-spec-configuration --from-file=runtime-exit-spec.yaml" directly?
In reply to: 324954531 [](ancestors = 324954531,324500792)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't find such usage for kubectl apply. Refer to https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#create-a-configmap and https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply. kubectl apply
only accept the resource configuration. We can use kubectl create configmap
or use kustomization.yaml
to create the configmap. kubectl create configmap
is more straightforward.
In reply to: 325117384 [](ancestors = 325117384,324954531,324500792)
And change the Owner: to be Owner: PAI_FC and PAI_RUNTIME In reply to: 529900585 [](ancestors = 529900585) Refers to: src/k8s-job-exit-spec/config/k8s-job-exit-spec.yaml:586 in 93f4b96. [](commit_id = 93f4b96, deletion_comment = False) |
change init script
a68a678
to
30cdc51
Compare
81b4e1f
to
a0da0d2
Compare
echo "frameworkbarrier start" | ||
${PAI_INIT_DIR}/frameworkbarrier 2>&1 | tee ${PAI_LOG_DIR}/barrier.log | ||
echo "frameworkbarrier returns $?" | ||
echo "barrier returns $?" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always be returns 0, so why echo it? #Closed
# priority=1 | ||
CHILD_PROCESS="ERROR_SPEC" | ||
cp ${PAI_CONFIG_DIR}/runtime-exit-spec.yaml ${PAI_RUNTIME_DIR} | ||
echo "copy exit spec returns $?" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always be returns 0, so why echo it? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
start-script: start.sh | ||
delete-script: delete.sh | ||
refresh-script: refresh.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Binyang2014 stop-script
is missing here.
ErrorSpec runtime part. Most of files are golang lib and do not need to review.
Some files are used for test.