Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Expose Task History #62

Merged
merged 4 commits into from
Oct 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion Gopkg.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 8 additions & 6 deletions doc/user-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## <a name="Index">Index</a>
- [Framework Interop](#FrameworkInterop)
- [Framework ExecutionType](#FrameworkExecutionType)
- [Container EnvironmentVariable](#ContainerEnvironmentVariable)
- [Predefined Container EnvironmentVariable](#PredefinedContainerEnvironmentVariable)
- [Pod Failure Classification](#PodFailureClassification)
- [Predefined CompletionCode](#PredefinedCompletionCode)
- [CompletionStatus](#CompletionStatus)
Expand Down Expand Up @@ -475,8 +475,10 @@ spec:
3. [Get Framework](#GET_Framework), and archive it into a DataBase first.
4. [Delete Framework](#DELETE_Framework), then the Framework will be deleted.

## <a name="ContainerEnvironmentVariable">Container EnvironmentVariable</a>
[Container EnvironmentVariable](../pkg/apis/frameworkcontroller/v1/constants.go)
## <a name="PredefinedContainerEnvironmentVariable">Predefined Container EnvironmentVariable</a>
[Predefined Container EnvironmentVariable](../pkg/apis/frameworkcontroller/v1/constants.go)

[Framework Example](../example/framework/basic/batchstatefulfailed.yaml)

## <a name="PodFailureClassification">Pod Failure Classification</a>
You can specify how to classify and summarize Pod failures by the [PodFailureSpec](../pkg/apis/frameworkcontroller/v1/config.go).
Expand Down Expand Up @@ -842,7 +844,7 @@ Besides these general [Framework ConsistencyGuarantees](#ConsistencyGuarantees),
To safely run large scale Framework, i.e. the total task number in a single Framework is greater than 300, you just need to enable the [LargeFrameworkCompression](../pkg/apis/frameworkcontroller/v1/config.go). However, you may also need to decompress the Framework by yourself.

## <a name="FrameworkPodHistory">Framework and Pod History</a>
By leveraging the [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as for persistence, metrics conversion, visualization, alerting, acting, analysis, etc.
By leveraging the [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework, Task and Pod history snapshots even if it was retried or deleted, such as for persistence, metrics conversion, visualization, alerting, acting, analysis, etc.

## <a name="FrameworkTaskStateMachine">Framework and Task State Machine</a>
### <a name="FrameworkStateMachine">Framework State Machine</a>
Expand Down Expand Up @@ -894,7 +896,7 @@ The default behavior is to achieve all the [ConsistencyGuarantees](#ConsistencyG

For example, [drain the Node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node) before delete it is acceptable.

*The Task instance can be universally located by its [TaskAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [PodUID](../pkg/apis/frameworkcontroller/v1/types.go).*
*The Task running instance can be universally located by its [TaskAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [PodUID](../pkg/apis/frameworkcontroller/v1/types.go).*

*To avoid the Pod is stuck in deleting forever, such as if its Node is down forever, leverage the same approach as [Delete StatefulSet Pod only after the Pod termination has been confirmed](https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods) manually or by your [Cloud Controller Manager](https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager).*

Expand All @@ -911,7 +913,7 @@ The default behavior is to achieve all the [ConsistencyGuarantees](#ConsistencyG

4. Do not change the [OwnerReferences](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents) of the managed ConfigMap and Pods.

*The Framework instance can be universally located by its [FrameworkAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [ConfigMapUID](../pkg/apis/frameworkcontroller/v1/types.go).*
*The Framework running instance can be universally located by its [FrameworkAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [ConfigMapUID](../pkg/apis/frameworkcontroller/v1/types.go).*

### <a name="FrameworkAvailability">Framework Availability</a>
According to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), in the presence of a network partition, you cannot achieve both consistency and availability at the same time in any distributed system. So you have to make a trade-off between the [Framework Consistency](#FrameworkConsistency) and the [Framework Availability](#FrameworkAvailability).
Expand Down
35 changes: 22 additions & 13 deletions example/framework/basic/batchstatefulfailed.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,21 +27,30 @@ spec:
- name: ubuntu
image: ubuntu:trusty
# To locate a specific Task during its whole lifecycle regardless of
# any retry:
# any retry and rescale:
# Consistent Identity:
# PodNamespace = {FrameworkNamespace}
# PodName = {FrameworkName}-{TaskRoleName}-{TaskIndex}
# PodNamespace = {FrameworkNamespace}
# PodName = {FrameworkName}-{TaskRoleName}-{TaskIndex}
# Consistent Environment Variable Value:
# ${FC_FRAMEWORK_NAMESPACE},
# ${FC_FRAMEWORK_NAME}, ${FC_TASKROLE_NAME}, ${FC_TASK_INDEX},
# ${FC_CONFIGMAP_NAME}, ${FC_POD_NAME}
# ${FC_FRAMEWORK_NAMESPACE}
# ${FC_FRAMEWORK_NAME}
# ${FC_TASKROLE_NAME}
# ${FC_TASK_INDEX}
#
# To locate a specific execution attempt of a specific Task:
# Attempt Specific Environment Variable Value:
# ${FC_FRAMEWORK_ATTEMPT_ID}, ${FC_TASK_ATTEMPT_ID}
# To locate a specific Task instance, in case the Task is deleted then
# added by rescale with a different Task instance:
# Environment Variable Value:
# ${FC_TASK_UID}
#
# To locate a specific execution attempt instance of a specific Task:
# Attempt Instance Specific Environment Variable Value:
# ${FC_FRAMEWORK_ATTEMPT_INSTANCE_UID}, ${FC_CONFIGMAP_UID}
# ${FC_TASK_ATTEMPT_INSTANCE_UID}, ${FC_POD_UID}
# To locate a specific execution attempt of a specific Task instance:
# Environment Variable Value:
# ${FC_TASK_UID}
# ${FC_TASK_ATTEMPT_ID}
#
# To locate a specific execution attempt instance of a specific Task
# instance, in case the attempt instance, i.e. the Pod instance is
# created but not observed by FrameworkController, then it is deleted
# and created later with a different attempt instance:
# Environment Variable Value:
# ${FC_TASK_ATTEMPT_INSTANCE_UID}
command: ["sh", "-c", "printenv && sleep 60 && exit 1"]
22 changes: 13 additions & 9 deletions pkg/apis/frameworkcontroller/v1/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ type Config struct {
// analysis, etc.
// Notes:
// 1. The snapshot is logged to stderr and can be extracted by the regular
// expression ": ObjectSnapshot: (.+)".
// expression ": ObjectSnapshot: (.+)", see LogMarkerObjectSnapshot.
// 2. To determine the type of the snapshot, using object.apiVersion and
// object.kind.
// 3. The same snapshot may be logged more than once in some rare cases, so
Expand Down Expand Up @@ -149,16 +149,20 @@ type Config struct {

type LogObjectSnapshot struct {
Framework LogFrameworkSnapshot `yaml:"framework"`
Task LogTaskSnapshot `yaml:"task"`
Pod LogPodSnapshot `yaml:"pod"`
}

type LogFrameworkSnapshot struct {
OnTaskRetry *bool `yaml:"onTaskRetry"`
OnFrameworkRetry *bool `yaml:"onFrameworkRetry"`
OnFrameworkRescale *bool `yaml:"onFrameworkRescale"`
OnFrameworkDeletion *bool `yaml:"onFrameworkDeletion"`
}

type LogTaskSnapshot struct {
OnTaskRetry *bool `yaml:"onTaskRetry"`
OnTaskDeletion *bool `yaml:"onTaskDeletion"`
}

type LogPodSnapshot struct {
OnPodDeletion *bool `yaml:"onPodDeletion"`
}
Expand Down Expand Up @@ -254,18 +258,18 @@ func NewConfig() *Config {
if c.FrameworkMaxRetryDelaySecForTransientConflictFailed == nil {
c.FrameworkMaxRetryDelaySecForTransientConflictFailed = common.PtrInt64(15 * 60)
}
if c.LogObjectSnapshot.Framework.OnTaskRetry == nil {
c.LogObjectSnapshot.Framework.OnTaskRetry = common.PtrBool(true)
}
if c.LogObjectSnapshot.Framework.OnFrameworkRetry == nil {
c.LogObjectSnapshot.Framework.OnFrameworkRetry = common.PtrBool(true)
}
if c.LogObjectSnapshot.Framework.OnFrameworkRescale == nil {
c.LogObjectSnapshot.Framework.OnFrameworkRescale = common.PtrBool(true)
}
if c.LogObjectSnapshot.Framework.OnFrameworkDeletion == nil {
c.LogObjectSnapshot.Framework.OnFrameworkDeletion = common.PtrBool(true)
}
if c.LogObjectSnapshot.Task.OnTaskRetry == nil {
c.LogObjectSnapshot.Task.OnTaskRetry = common.PtrBool(true)
}
if c.LogObjectSnapshot.Task.OnTaskDeletion == nil {
c.LogObjectSnapshot.Task.OnTaskDeletion = common.PtrBool(true)
}
if c.LogObjectSnapshot.Pod.OnPodDeletion == nil {
c.LogObjectSnapshot.Pod.OnPodDeletion = common.PtrBool(true)
}
Expand Down
20 changes: 20 additions & 0 deletions pkg/apis/frameworkcontroller/v1/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ const (
FrameworkPlural = "frameworks"
FrameworkCRDName = FrameworkPlural + "." + GroupName
FrameworkKind = "Framework"
TaskKind = "Task"
ConfigMapKind = "ConfigMap"
PodKind = "Pod"
ObjectUIDFieldPath = "metadata.uid"
Expand All @@ -56,9 +57,12 @@ const (
AnnotationKeyConfigMapName = "FC_CONFIGMAP_NAME"
AnnotationKeyPodName = "FC_POD_NAME"

AnnotationKeyFrameworkUID = "FC_FRAMEWORK_UID"
AnnotationKeyFrameworkAttemptID = "FC_FRAMEWORK_ATTEMPT_ID"
AnnotationKeyFrameworkAttemptInstanceUID = "FC_FRAMEWORK_ATTEMPT_INSTANCE_UID"
AnnotationKeyConfigMapUID = "FC_CONFIGMAP_UID"
AnnotationKeyTaskRoleUID = "FC_TASKROLE_UID"
AnnotationKeyTaskUID = "FC_TASK_UID"
AnnotationKeyTaskAttemptID = "FC_TASK_ATTEMPT_ID"

// Predefined Labels
Expand All @@ -79,9 +83,12 @@ const (
EnvNameConfigMapName = AnnotationKeyConfigMapName
EnvNamePodName = AnnotationKeyPodName

EnvNameFrameworkUID = AnnotationKeyFrameworkUID
EnvNameFrameworkAttemptID = AnnotationKeyFrameworkAttemptID
EnvNameFrameworkAttemptInstanceUID = AnnotationKeyFrameworkAttemptInstanceUID
EnvNameConfigMapUID = AnnotationKeyConfigMapUID
EnvNameTaskRoleUID = AnnotationKeyTaskRoleUID
EnvNameTaskUID = AnnotationKeyTaskUID
EnvNameTaskAttemptID = AnnotationKeyTaskAttemptID
EnvNameTaskAttemptInstanceUID = "FC_TASK_ATTEMPT_INSTANCE_UID"
EnvNamePodUID = "FC_POD_UID"
Expand All @@ -98,9 +105,22 @@ const (
PlaceholderTaskIndex = AnnotationKeyTaskIndex
PlaceholderConfigMapName = AnnotationKeyConfigMapName
PlaceholderPodName = AnnotationKeyPodName

// For LogObjectSnapshot
// All snapshots are logged in format:
// {AnyLogMessage}{ObjectSnapshotTrigger}{LogMarkerObjectSnapshot}{JsonObjectSnapshot}
LogMarkerObjectSnapshot = ": ObjectSnapshot: "
LogMarkerOnFrameworkRetry ObjectSnapshotTrigger = ": OnFrameworkRetry"
LogMarkerOnFrameworkDeletion ObjectSnapshotTrigger = ": OnFrameworkDeletion"
LogMarkerOnTaskRetry ObjectSnapshotTrigger = ": OnTaskRetry"
LogMarkerOnTaskDeletion ObjectSnapshotTrigger = ": OnTaskDeletion"
LogMarkerOnPodDeletion ObjectSnapshotTrigger = ": OnPodDeletion"
)

type ObjectSnapshotTrigger string

var FrameworkGroupVersionKind = SchemeGroupVersion.WithKind(FrameworkKind)
var TaskGroupVersionKind = SchemeGroupVersion.WithKind(TaskKind)
var ConfigMapGroupVersionKind = core.SchemeGroupVersion.WithKind(ConfigMapKind)
var PodGroupVersionKind = core.SchemeGroupVersion.WithKind(PodKind)

Expand Down
Loading