Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

ErrorSpec: runtime part #3585

Merged
merged 29 commits into from
Sep 18, 2019
Merged

ErrorSpec: runtime part #3585

merged 29 commits into from
Sep 18, 2019

Conversation

Binyang2014
Copy link
Contributor

@Binyang2014 Binyang2014 commented Sep 6, 2019

ErrorSpec runtime part. Most of files are golang lib and do not need to review.
Some files are used for test.

@Binyang2014 Binyang2014 marked this pull request as ready for review September 6, 2019 11:30

[[constraint]]
name = "gopkg.in/yaml.v2"
version = "2.2.2"
Copy link
Member

@yqwang-ms yqwang-ms Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use built-in yaml #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't find the build-in yaml lib. Do I miss something?


In reply to: 321699635 [](ancestors = 321699635)

# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

{% for spec_item in cluster_cfg['k8s-job-exit-spec']['spec'] %}
{%- if spec_item['code'] > 0 %}
Copy link
Member

@yqwang-ms yqwang-ms Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have code to add in spec yaml. If so also include it in this PR, so that you can test it. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already do some local test for this. Will add more


In reply to: 322044998 [](ancestors = 322044998)

{%- for str_val in spec_item['solution'] %}
- "{{ str_val }}"
{%- endfor %}
{%- endif %}
Copy link
Member

@yqwang-ms yqwang-ms Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why runtime need to read static reason, solution? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove these fields


In reply to: 322045079 [](ancestors = 322045079)


pushd $(dirname "$0") > /dev/null

kubectl create configmap user-exit-spec-configuration --from-file=user-exit-spec.yaml --dry-run -o yaml | kubectl replace -f - || exit $?
Copy link
Member

@yqwang-ms yqwang-ms Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create [](start = 8, length = 6)

create [](start = 8, length = 6)

The refresh is update? If so, use update operation instead of create #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we use kubectl replace -f - for the update refer to update . If we use in-place update operation such as patch, it will make the command complex.


In reply to: 322045283 [](ancestors = 322045283)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#kubectl-apply.

Replace will first delete then create, not so good.


In reply to: 322079743 [](ancestors = 322079743,322045283)

var log *logger.Logger

const defaultExitCode = 255

Copy link
Member

@yqwang-ms yqwang-ms Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Keep exitcode in your file the same as the process's exit code. If it is hard, remove the one in file, and just use process's exit code.

  2. If there is error in your runtime itself, you should not exit with 255. 255 means runtime unrecognized the code, for runtime itself failure, it can exit anything other than the code in spec. So it is to be:

  • code: 256
    phrase: PAIRuntimeExitAbnormally
    (So I think you can just panic on any critical runtime itself error) #Closed

@@ -56,6 +57,11 @@ mkdir -p ${PAI_LOG_DIR}
${PAI_INIT_DIR}/frameworkbarrier > ${PAI_LOG_DIR}/barrier.log 2>&1
echo "barrier returns $?" >> ${PAI_INIT_LOG_FILE}
Copy link
Member

@yqwang-ms yqwang-ms Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are code: 250, 251, 252? #Closed

}
log.Info("finish generating the exit summary")
fmt.Print(exitInfo.Exitcode)
}
Copy link
Member

@yqwang-ms yqwang-ms Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you exit with exitInfo.Exitcode?
Pls do simple test before send PR. #Closed

Copy link
Contributor Author

@Binyang2014 Binyang2014 Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code exit at:

exit ${CONTAINER_EXIT_CODE}
#Closed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems too tricky, can we unify the way to expose exitcode outside?


In reply to: 322047090 [](ancestors = 322047090)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I write like this for 2 reasons:

  1. Currently, it run as a program. So, I want to use exit code to identify if the program run correctly.
  2. I want to copy the aggerated info to log folder for debugging purpose. Since we set set -o errexit. Direct exit will terminate the main process

Ideally, this part should be executed as a function, then we can return the spec exit code and let main program exit with this code. But the main process is a shell script, it's not easy to do this unless we change the shell to a golang program.

A short term solution: exit with spec code. unset the set -o errexit in main shell. Then we can do some following actions such as log something, copy the aggregated exitInfo to log folder
A long term: use golang to rewrite the shell and let exitHandler be a function.


In reply to: 322048329 [](ancestors = 322048329,322047090)

Copy link
Member

@yqwang-ms yqwang-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

maxRuntimeLogLines: 10,
defaulExitCode: 255,
maxSearchLogSize: 100 * 1024 * 1024, // 100MB
aggExitInfoBegin: "[PAI_RUNTIME_ERROR_START]",
Copy link
Member

@yqwang-ms yqwang-ms Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test the worst time that exithandler need? #Closed

Copy link
Contributor Author

@Binyang2014 Binyang2014 Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depeneds on many conditions. The log size, the complexity of the pattern, the number of patterns. Currently, the max number of patterns is 32 220-255 excludes 250,251,252,255. The worse case will happens when all these patterns with same exit code, and complex regex pattern.
Can do a test with useSize larger than 100MB, all patterns with the same exit code. And use simple regex pattern such as: command not found for the test


In reply to: 322047160 [](ancestors = 322047160)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, pls benchmark it, make sure the default setting is not too time consuming


In reply to: 322054168 [](ancestors = 322054168,322047160)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems 100MB may also introduce false alarm, limit to 10MB?


In reply to: 322054707 [](ancestors = 322054707,322054168,322047160)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to 10MB. Not do performance test yet


In reply to: 323139216 [](ancestors = 323139216,322054707,322054168,322047160)


// Use zero to present undefined exitCode in error spec
const undefinedExitCode = 0

Copy link
Member

@yqwang-ms yqwang-ms Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is so confusing, why not use 256? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will fix


In reply to: 322048988 [](ancestors = 322048988)

@@ -30,6 +30,23 @@ PAI_WORK_DIR=/usr/local/pai
PAI_RUNTIME_DIR=${PAI_WORK_DIR}/runtime.d
PAI_LOG_DIR=${PAI_WORK_DIR}/logs/${FC_POD_UID}
PAI_USER_LOG_FILE=${PAI_LOG_DIR}/user.pai.all
PATTERN_FILE=${PAI_RUNTIME_DIR}/runtime-exit-spec.yaml
TERMINATE_MESSAGE_PATH=/tmp/pai-termination-log

Copy link
Member

@yqwang-ms yqwang-ms Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crossref with restserver #Closed


{% for spec_item in cluster_cfg['k8s-job-exit-spec']['spec'] %}
{%- if spec_item['code'] > 0 %}
- containerExitCode: {{ spec_item['code'] }}
Copy link
Member

@yqwang-ms yqwang-ms Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also filter codes that does not have runtimeContainerPatterns? #Closed

@@ -736,6 +736,25 @@ spec:
# Description: User Container issued failures:
Copy link
Member

@yqwang-ms yqwang-ms Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solution:
- "Check job logs"
- "Correct the command and retry the job"

Copy link
Member

@yqwang-ms yqwang-ms Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unknown, pls refer previous spec on yarn :

  • code: 255
    phrase: PAIRuntimeUnknownFailed
    issuer: PAI_RUNTIME
    causer: UNKNOWN
    type: UNKNOWN_FAILURE
    stage: COMPLETING
    behavior: UNKNOWN
    reaction: RETRY_TO_MAX
    reason: "Container failed but the failure cannot be recognized by PAI Runtime"
    repro:
    • "User program directly exits with exitcode 1"
      solution:
    • "Check container log and find root cause"
    • "Wait result from next retry" #Closed

@yqwang-ms
Copy link
Member

yqwang-ms commented Sep 10, 2019

python ${PAI_RUNTIME_DIR}/port.py > >(tee -a ${PAI_LOG_DIR}/runtime.pai.stdout) 2> >(tee -a ${PAI_LOG_DIR}/runtime.pai.stderr >&2)

Do NOT make assumption on user image, it may not have python #Closed


Refers to: src/kube-runtime/src/runtime:78 in 7a361c7. [](commit_id = 7a361c7, deletion_comment = False)

@yqwang-ms
Copy link
Member

yqwang-ms commented Sep 10, 2019

python ${PAI_RUNTIME_DIR}/port.py > >(tee -a ${PAI_LOG_DIR}/runtime.pai.stdout) 2> >(tee -a ${PAI_LOG_DIR}/runtime.pai.stderr >&2)

Change the exitcode 200 to another dedicated code in spec #Closed


Refers to: src/kube-runtime/src/runtime:78 in 7a361c7. [](commit_id = 7a361c7, deletion_comment = False)

@yqwang-ms
Copy link
Member

yqwang-ms commented Sep 10, 2019

###########################################################################

Seems runtime also need to handle these codes, such as ContainerSigSegvReceived, I mean the pattern should be directly forward these code outside, instead of converting them to be 255.

And change the Owner: to be

Owner: PAI_FC and PAI_RUNTIME #Closed


Refers to: src/k8s-job-exit-spec/config/k8s-job-exit-spec.yaml:586 in 93f4b96. [](commit_id = 93f4b96, deletion_comment = False)

@yqwang-ms
Copy link
Member

yqwang-ms commented Sep 10, 2019

Owner: PAI_LAUNCHER

Change to Owner: PAI_FC #Closed


Refers to: src/k8s-job-exit-spec/config/k8s-job-exit-spec.yaml:820 in 93f4b96. [](commit_id = 93f4b96, deletion_comment = False)

@yqwang-ms
Copy link
Member

yqwang-ms commented Sep 10, 2019

python ${PAI_RUNTIME_DIR}/port.py > >(tee -a ${PAI_LOG_DIR}/runtime.pai.stdout) 2> >(tee -a ${PAI_LOG_DIR}/runtime.pai.stderr >&2)

Do not add > >(tee -a ${PAI_LOG_DIR}/runtime.pai.stdout) 2> >(tee -a ${PAI_LOG_DIR}/runtime.pai.stderr >&2) on each cmd, just split to 2 runtime script #Closed


Refers to: src/kube-runtime/src/runtime:78 in 7a361c7. [](commit_id = 7a361c7, deletion_comment = False)

@yqwang-ms
Copy link
Member

yqwang-ms commented Sep 10, 2019

  - USER_CONTAINER

Add PAI_HW #Closed


Refers to: src/k8s-job-exit-spec/config/k8s-job-exit-spec.yaml:55 in 93f4b96. [](commit_id = 93f4b96, deletion_comment = False)

@Binyang2014 Binyang2014 force-pushed the binyli/runtime_error branch 2 times, most recently from 44fb8f3 to c8198a1 Compare September 11, 2019 09:10
@@ -780,10 +884,40 @@ spec:
solution:
- "Contact PAI Dev to fix the PAI Runtime bug"

- code: 253
phrase: PortConflictCheckFailed
Copy link
Member

@yqwang-ms yqwang-ms Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PortConflictCheckFailed [](start = 10, length = 23)

PortConflictCheckFailed [](start = 10, length = 23)

Just "ContainerPortConflict"? #Closed

@@ -56,6 +86,11 @@ mkdir -p ${PAI_LOG_DIR}
${PAI_INIT_DIR}/frameworkbarrier > ${PAI_LOG_DIR}/barrier.log 2>&1
echo "barrier returns $?" >> ${PAI_INIT_LOG_FILE}
Copy link
Member

@yqwang-ms yqwang-ms Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tee all runtime logs to docker/k8s stdout/err for easy debugging? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to redirect lgos to stdout & stderr, I think it's better to wrap the init container or change the init container start command.

I prefer change the docker start command


In reply to: 323551280 [](ancestors = 323551280)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by this PR: #3620


In reply to: 323602539 [](ancestors = 323602539,323551280)


if [[ $EXIT_CODE -eq 210 ]]; then
exit 252
fi
Copy link
Member

@yqwang-ms yqwang-ms Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should only be used to revise bairrier code, not for global #Closed

if [[ $EXIT_CODE -eq 10 ]]; then
exit 253
fi

Copy link
Member

@yqwang-ms yqwang-ms Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should only be used to revise port checker code, not for global #Closed


pushd $(dirname "$0") > /dev/null

kubectl create configmap runtime-exit-spec-configuration --from-file=runtime-exit-spec.yaml --dry-run -o yaml | kubectl apply --overwrite=true -f - || exit $?
Copy link
Member

@yqwang-ms yqwang-ms Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl create [](start = 0, length = 14)

Why need create? Just apply will create?
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure apply will create, we use create here to generate configmap spec, but not apply to k8s. Please notice the --dry-run option. Then we apply the generated spec.


In reply to: 324500792 [](ancestors = 324500792)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does kubectl apply can use "configmap runtime-exit-spec-configuration --from-file=runtime-exit-spec.yaml" directly?


In reply to: 324954531 [](ancestors = 324954531,324500792)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't find such usage for kubectl apply. Refer to https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#create-a-configmap and https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply. kubectl apply only accept the resource configuration. We can use kubectl create configmap or use kustomization.yaml to create the configmap. kubectl create configmap is more straightforward.


In reply to: 325117384 [](ancestors = 325117384,324954531,324500792)

@yqwang-ms
Copy link
Member

###########################################################################

And change the Owner: to be

Owner: PAI_FC and PAI_RUNTIME


In reply to: 529900585 [](ancestors = 529900585)


Refers to: src/k8s-job-exit-spec/config/k8s-job-exit-spec.yaml:586 in 93f4b96. [](commit_id = 93f4b96, deletion_comment = False)

@Binyang2014 Binyang2014 force-pushed the binyli/runtime_error branch 2 times, most recently from a68a678 to 30cdc51 Compare September 17, 2019 05:29
echo "frameworkbarrier start"
${PAI_INIT_DIR}/frameworkbarrier 2>&1 | tee ${PAI_LOG_DIR}/barrier.log
echo "frameworkbarrier returns $?"
echo "barrier returns $?"

Copy link
Member

@yqwang-ms yqwang-ms Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always be returns 0, so why echo it? #Closed

# priority=1
CHILD_PROCESS="ERROR_SPEC"
cp ${PAI_CONFIG_DIR}/runtime-exit-spec.yaml ${PAI_RUNTIME_DIR}
echo "copy exit spec returns $?"

Copy link
Member

@yqwang-ms yqwang-ms Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always be returns 0, so why echo it? #Closed

Copy link
Member

@yqwang-ms yqwang-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Binyang2014 Binyang2014 merged commit 6b003a1 into master Sep 18, 2019
@Binyang2014 Binyang2014 deleted the binyli/runtime_error branch September 18, 2019 08:39

start-script: start.sh
delete-script: delete.sh
refresh-script: refresh.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Binyang2014 stop-script is missing here.

@hzy46 hzy46 mentioned this pull request Apr 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants