Enable insertId generation, and update Stackdriver Logging Agent image to 0.5-1.5.36-1-k8s. #68920

qingling128 · 2018-09-21T03:27:24Z

What this PR does / why we need it:
Enable insertId generation, and update Stackdriver Logging Agent image to 0.5-1.5.36-1-k8s. This help reduce log duplication and guarantee log order.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

Enable insertId generation, and update Stackdriver Logging Agent image to 0.5-1.5.36-1-k8s. This help reduce log duplication and guarantee log order.

qingling128 · 2018-09-21T16:47:41Z

Hi, could someone help with adding a /ok-to-test tag? @x13n @jszczepkowski

x13n · 2018-09-21T17:33:46Z

/ok-to-test
/assign

x13n · 2018-09-25T12:09:18Z

/lgtm

qingling128 · 2018-09-25T15:58:01Z

/assign @MaciekPytel

qingling128 · 2018-09-28T21:34:33Z

Just added some tolerations for Metadata Agent to fix the issue where Metadata Agents are not scheduled.

PTAL @x13n @MaciekPytel

x13n · 2018-10-02T09:48:48Z

cluster/addons/metadata-agent/stackdriver/metadata-agent.yaml

@@ -56,6 +56,13 @@ spec:
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
+      tolerations:
+      - key: "node.alpha.kubernetes.io/ismaster"


Below you are adding a toleration for all keys with NoSchedule effect, so this one doesn't seem to be doing anything.

x13n · 2018-10-02T14:40:15Z

/lgtm

MaciekPytel · 2018-10-02T15:00:24Z

cluster/addons/metadata-agent/stackdriver/metadata-agent.yaml

+      - operator: "Exists"
+        effect: "NoExecute"
+      - operator: "Exists"
+        effect: "NoSchedule"


Do we really want such strong tolerations? This will make this able to schedule anywhere, including for example a node that is currently being deleted or a super-expensive node with GPU.

This is what we currently have for fluentd and they should really go together. If this is more restricted, then fluentd should be more restricted, too.

Ah, so this is meant to run on every node? In this case it makes much more sense.

What about metadata-agent-cluster-level below though? If it's one-per-cluster than presumably it should schedule on a node without taints, no?

Good point. Yes, I think the cluster level one should respect taints.

We do want all cluster level Metadata Agent to be scheduled though. Otherwise we lose insight of the metadata. It's intentional to disregard the taints. Note that this will only apply to customers who enable the Metadata Agent addons.

@x13n - Yeah, Heapster does use CriticalAddonsOnly

kubernetes/cluster/addons/cluster-monitoring/stackdriver/heapster-controller.yaml

Line 126 in 2e0e168

- key: "CriticalAddonsOnly"

As I read this guaranteed-scheduling-critical-addon-pods doc to understand a bit more about what it does, and it sounds like CriticalAddonsOnly will be deprecated soon. Instead, it recommends priorityClass.

And I did find a priorityClassName spec defined for Heapster:

kubernetes/cluster/addons/cluster-monitoring/stackdriver/heapster-controller.yaml

Line 48 in 2e0e168

priorityClassName: system-cluster-critical

We probably should just use priorityClassName: system-cluster-critical for Logging and Metadata Agents then?

@x13n @MaciekPytel Friendly ping.

https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ says this feature is in beta starting from 1.11. You can add it, but don't cherrypick it to 1.10 branch later.

Cool. Updated to use priorityClassName: system-cluster-critical instead. PTAL

@x13n @MaciekPytel

qingling128 · 2018-10-08T14:44:30Z

PTAL

qingling128 · 2018-10-08T16:55:03Z

/retest

x13n · 2018-10-09T08:24:51Z

cluster/addons/metadata-agent/stackdriver/metadata-agent.yaml

@@ -28,6 +28,7 @@ spec:
        seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
    spec:
      serviceAccountName: metadata-agent
+      priorityClassName: system-cluster-critical


This one is per-node, so probably should be system-node-critical (which is what fluentd-gcp uses already). @bsalamat can you help with getting this right?

Changed to system-node-critical.

@bsalamat - Does this look right to you?

Yes, as @x13n said, it system-node-critical is the right one to use here.

Thanks for confirming!

… 0.5-1.5.36-1-k8s and add priorityClassName for Metadata Agent.

qingling128 · 2018-10-09T19:16:01Z

/retest

x13n · 2018-10-10T06:28:14Z

/lgtm

qingling128 · 2018-10-10T16:08:28Z

Thanks for lgtm. @MaciekPytel Does this look good to you too?

x13n · 2018-10-11T07:51:07Z

Btw, I understand this change as correcting the existing behavior, so:
/kind bug

MaciekPytel · 2018-10-11T09:45:52Z

/approve
Cluster-level components (presumably metadata-agent-cluster-level) shouldn't have wide tolerations, because it removes the ability to control where those pods go and we don't want them scheduling on a node with expensive hardware they don't need or a node that is currently being deleted by autoscaler / node upgrade process, etc.

On the other hand we may need such tolerations to per-node metadata agent. Without them the agent won't schedule on a node with GPU or a node that has been explicitly dedicated for a specific job by user. If we need agent on every node than it probably needs the tolerations.

I'd suggest adding back tolerations to per-node metadata agent if we do need it running on every node.

/hold
So this doesn't merge before @qingling128 has a chance to reply to my comment. If I'm wrong and we don't need metadata agent running on tainted nodes feel free to cancel hold and let this merge.

x13n · 2018-10-11T10:12:19Z

/approve
(since lgtm doesn't automatically approve anymore)

k8s-ci-robot · 2018-10-11T10:12:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, qingling128, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster/addons/fluentd-gcp/OWNERS~~ [x13n]
~~cluster/addons/metadata-agent/OWNERS~~ [x13n]
~~cluster/gce/OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

x13n · 2018-10-11T10:20:26Z

@MaciekPytel According to the documentation, critical pods should be marked system-cluster-critical or system-node-critical when priorities are used. CriticalAddonsOnly taint was used by rescheduler which apparently is deprecated and removed in 1.12.

MaciekPytel · 2018-10-11T10:51:14Z

@x13n I understand this part, but AFAIK priorityClass just allows the pod to preempt a lower priority pod if there is no space in the cluster. It doesn't allow bypassing other scheduling constraints (such as taints).

qingling128 · 2018-10-11T14:52:17Z

For Metadata Agent, we expect the behavior to be consistent between cluster level agent and node level agent. CriticalAddonsOnly is to be deprecated, so we probably don't want to introduce that now.

The current implementation sounds like what we want.

qingling128 · 2018-10-11T16:21:45Z

/hold cancel

qingling128 · 2018-10-11T21:12:59Z

Created cherry-pick PRs to release branches 1.11 and 1.12.

…68920-upstream-release-1.12 Automated cherry pick of #68920: Enable insertId generation, update Stackdriver Logging Agent

…68920-upstream-release-1.11 Automated cherry pick of #68920: Enable insertId generation, update Stackdriver Logging Agent

k8s-ci-robot requested review from jszczepkowski and x13n September 21, 2018 03:27

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 21, 2018

k8s-ci-robot assigned x13n Sep 21, 2018

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 21, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 25, 2018

k8s-ci-robot assigned MaciekPytel Sep 25, 2018

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 28, 2018

x13n reviewed Oct 2, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2018

MaciekPytel reviewed Oct 2, 2018

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2018

x13n reviewed Oct 9, 2018

View reviewed changes

Enable insertId generation, update Stackdriver Logging Agent image to…

d8da1ba

… 0.5-1.5.36-1-k8s and add priorityClassName for Metadata Agent.

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2018

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Oct 11, 2018

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 11, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2018

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 11, 2018

k8s-ci-robot merged commit 1aef631 into kubernetes:master Oct 11, 2018

This was referenced Oct 11, 2018

Automated cherry pick of #68920: Enable insertId generation, update Stackdriver Logging Agent #69698

Merged

Automated cherry pick of #68920: Enable insertId generation, update Stackdriver Logging Agent #69701

Merged

qingling128 mentioned this pull request Oct 12, 2018

Add tolerations for Stackdriver Logging and Metadata Agents. #69737

Merged

k8s-ci-robot added a commit that referenced this pull request Oct 22, 2018

Merge pull request #69701 from qingling128/automated-cherry-pick-of-#…

3fdbee9

…68920-upstream-release-1.12 Automated cherry pick of #68920: Enable insertId generation, update Stackdriver Logging Agent

k8s-ci-robot added a commit that referenced this pull request Nov 2, 2018

Merge pull request #69698 from qingling128/automated-cherry-pick-of-#…

46d1732

…68920-upstream-release-1.11 Automated cherry pick of #68920: Enable insertId generation, update Stackdriver Logging Agent

Enable insertId generation, and update Stackdriver Logging Agent image to 0.5-1.5.36-1-k8s. #68920

Enable insertId generation, and update Stackdriver Logging Agent image to 0.5-1.5.36-1-k8s. #68920

Conversation

qingling128 commented Sep 21, 2018

qingling128 commented Sep 21, 2018

x13n commented Sep 21, 2018

x13n commented Sep 25, 2018

qingling128 commented Sep 25, 2018

qingling128 commented Sep 28, 2018

Choose a reason for hiding this comment

x13n commented Oct 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qingling128 commented Oct 8, 2018

qingling128 commented Oct 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qingling128 commented Oct 9, 2018

x13n commented Oct 10, 2018

qingling128 commented Oct 10, 2018

x13n commented Oct 11, 2018

MaciekPytel commented Oct 11, 2018

x13n commented Oct 11, 2018 • edited Loading

k8s-ci-robot commented Oct 11, 2018

x13n commented Oct 11, 2018

MaciekPytel commented Oct 11, 2018

qingling128 commented Oct 11, 2018

qingling128 commented Oct 11, 2018

qingling128 commented Oct 11, 2018

x13n commented Oct 11, 2018 •

edited

Loading