Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources. #62076

qingling128 · 2018-04-03T19:07:15Z

What this PR does / why we need it:

Which issue(s) this PR fixes
Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources.
Also Fluentd 0.14 has some memory leak issues that caused the e2e tests to be flaky. Downgrading to v0.12.

Special notes for your reviewer:
We never released any previous version with Fluentd v0.14. Only upgraded it very recently. So this downgrading is not visible to users.

Release note:

Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources.

k8s-ci-robot · 2018-04-03T19:07:17Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

qingling128 · 2018-04-03T19:32:12Z

Just signed the CLA.

qingling128 · 2018-04-03T19:52:51Z

CC: @piosz

qingling128 · 2018-04-03T20:28:55Z

Just for the reference, the original PR where we updated to an image using Fluentd v0.14 and adjusted the config file's parser section is at #59128.

As part of the downgrading to v0.12, we had to revert the parser section config change.

qingling128 · 2018-04-03T20:32:48Z

/assign @piosz

qingling128 · 2018-04-04T02:35:05Z

cluster/addons/fluentd-gcp/fluentd-gcp-configmap.yaml

+        # resource. The format is:
+        # 'k8s_container.<namespace_name>.<pod_name>.<container_name>'.
+        "logging.googleapis.com/local_resource_id" ${"k8s_container.#{tag_suffix[4].rpartition('.')[0].split('_')[1]}.#{tag_suffix[4].rpartition('.')[0].split('_')[0]}.#{tag_suffix[4].rpartition('.')[0].split('_')[2].rpartition('-')[0]}"}
+        # Rename the field 'log' to a more generic field 'message'. This way the


For gke_container resource, we are sending log field as text_payload after time/severity/stream are extracted. For every other resources, we are sending message field as textPayload if it is the only field.

To assure the same behavior for k8s_container, we can either rename the log field to message in the configuration, or hardcode an exception in the fluent-plugin-google-cloud gem next to the gke_container resource.

I'm slightly leaning towards handling it via configuration instead of hardcoding it in the gem. The only issue is that when there are additional fields (other than time/severity/stream/log), the behavior might be different. For the gke_container resource, we seem to be dropping all other fields (That feel like a bug to me, isn't it?). For the k8s_container resource, we will keep the jsonPayload structure and retain the additional fields.

WDYT? @bmoyles0117 @igorpeshansky

For gke_container, the implementation and the configuration on the Kubernetes side were carefully crafted to work together. This, as you observed, was very fragile.
The semantics you propose is simple and straightforward. We can document that the logging agent expects the text payload in a message field, and rename log to message in the configuration.

Sounds good.

+1 to what Igor said.

x13n · 2018-04-04T09:31:32Z

/ok-to-test

x13n · 2018-04-04T11:25:25Z

cluster/addons/fluentd-gcp/fluentd-gcp-ds.yaml

+        args:
+        - /bin/sh
+        - -c
+        - STACKDRIVER_METADATA_AGENT_URL="http://${NODE_NAME}:8000" /usr/sbin/google-fluentd


Can't STACKDRIVER_METADATA_AGENT_URL be set via env?

Also, double-checking - is this the right port?

Changed STACKDRIVER_METADATA_AGENT_URL to be configurable via an env. PTAL.

Nice catch regarding to the port. We use 8000 in our testing cluster. The actual metadata agent yaml uses 8799 instead. Fixed.

x13n · 2018-04-04T11:28:42Z

cluster/addons/fluentd-gcp/fluentd-gcp-configmap.yaml

@@ -366,6 +366,11 @@ data:
      </record>
    </filter>

+    # Do not collect fluentd's own logs to avoid infinite loops.
+    <match fluent.**>


Does that mean no logs from fluentd will be exported?

Yes. This is also the behavior of our default packaging.

It's on our roadmap to support exporting Fluentd logs. But it has to be handled carefully.

The issue right now is that when Fluentd runs into issues ingesting any logs, it emits a Fluentd log entry. When it has issues ingesting its own log, that triggers infinite loops and leads to agent crashes.

Ling, have you confirmed that this section does what it tries to do? As far as I'm aware, fluent logs are emitted to stdout/stderr, making this tag meaningless. If you've confirmed this behavior works as intended great, but if it doesn't add any value, I would suggest removing it.

Yes, this indeed takes effect. I encountered infinite loops and this change fixed it.

A sample ingested fluentd log looks like below:

{ insertId: "9ppb4lg2hn1rtn" logName: "projects/stackdriver-kubernetes-1337/logs/fluent.info" receiveTimestamp: "2018-04-04T01:27:05.638928314Z" resource: { labels: { cluster_name: "stackdriver-metadata-e2e" location: "us-central1-a" node_name: "gke-stackdriver-metadata-default-pool-c622577e-vlmk" project_id: "stackdriver-kubernetes-1337" } type: "k8s_node" } textPayload: "following tail of /var/log/containers/metadata-agent-api-availability-test-p469t_metadata-agent-api-availability-test_metadata-agent-api-availability-test-ef878dbc383e0e59b558103f2acfa446c4ca8d8520695460f5764e5a5719b468.log" timestamp: "2018-04-04T01:27:04Z" }

As we can see, Fluentd logs are tagged as fluent.info, fluent.error and fluent.warn.

This is because Fluentd logs are emitted as Fluentd events directly.

qingling128

All feedback addressed.

Grouped fluentd config setups into a setup-fluentd method.
Made metadata_agent_url configurable via env var.
Added an env var to control whether to ingest logs against old or new resources.

The contributor guide seems to suggest not to squash unless being suggested by reviewers. So I'll leave these commits separate for now.

qingling128 · 2018-04-04T14:07:07Z

cluster/addons/fluentd-gcp/fluentd-gcp-configmap.yaml

+        # resource. The format is:
+        # 'k8s_container.<namespace_name>.<pod_name>.<container_name>'.
+        "logging.googleapis.com/local_resource_id" ${"k8s_container.#{tag_suffix[4].rpartition('.')[0].split('_')[1]}.#{tag_suffix[4].rpartition('.')[0].split('_')[0]}.#{tag_suffix[4].rpartition('.')[0].split('_')[2].rpartition('-')[0]}"}
+        # Rename the field 'log' to a more generic field 'message'. This way the


Sounds good.

qingling128 · 2018-04-04T18:34:04Z

cluster/addons/fluentd-gcp/fluentd-gcp-configmap.yaml

@@ -366,6 +366,11 @@ data:
      </record>
    </filter>

+    # Do not collect fluentd's own logs to avoid infinite loops.
+    <match fluent.**>


Yes. This is also the behavior of our default packaging.

It's on our roadmap to support exporting Fluentd logs. But it has to be handled carefully.

The issue right now is that when Fluentd runs into issues ingesting any logs, it emits a Fluentd log entry. When it has issues ingesting its own log, that triggers infinite loops and leads to agent crashes.

qingling128 · 2018-04-04T19:55:10Z

cluster/addons/fluentd-gcp/fluentd-gcp-ds.yaml

+        args:
+        - /bin/sh
+        - -c
+        - STACKDRIVER_METADATA_AGENT_URL="http://${NODE_NAME}:8000" /usr/sbin/google-fluentd


Changed STACKDRIVER_METADATA_AGENT_URL to be configurable via an env. PTAL.

Nice catch regarding to the port. We use 8000 in our testing cluster. The actual metadata agent yaml uses 8799 instead. Fixed.

qingling128 · 2018-04-04T20:19:16Z

cluster/addons/fluentd-gcp/fluentd-gcp-configmap-old.yaml

@@ -0,0 +1,425 @@
+# This ConfigMap is used to ingest logs against old resources like


NOTE: This file is roughly a copy of fluentd-gcp-configmap.yaml. The only difference is whether to set up local_resource_id and enable talking to Metadata Agent. With those, logs are ingested against k8s_container and k8s_node. Without those, logs are ingested against gke_container and gce_instance.

+1 to what you've done here.

qingling128 · 2018-04-04T20:20:57Z

cluster/addons/fluentd-gcp/fluentd-gcp-configmap.yaml

@@ -51,26 +57,18 @@ data:
      pos_file /var/log/gcp-containers.log.pos
      tag reform.*
      read_from_head true
-      format none
+      format multi_format


NOTE: This entire format change is to revert the changes we made in https://github.com/kubernetes/kubernetes/pull/59128/files#diff-dec77c261fefaa453d67b0c26b4b07c2 to adapt to Fluentd v0.14 syntax. As we downgrade back to v0.12 in this PR, the syntax needs to be reverted as well.

qingling128 · 2018-04-04T20:22:43Z

cluster/addons/fluentd-gcp/fluentd-gcp-configmap.yaml

@@ -83,7 +81,19 @@ data:
    <match reform.**>
      @type record_reformer


NOTE: This whole record_reformer section sets up the local_resource_id for k8s_container resource. It's present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

qingling128 · 2018-04-04T20:23:18Z

cluster/addons/fluentd-gcp/fluentd-gcp-configmap.yaml

    </match>

+    # Attach local_resource_id for 'k8s_node' monitored resource.


NOTE: This entire record_reformer section sets up the local_resource_id for k8s_node resource. It's present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

qingling128 · 2018-04-04T20:23:46Z

cluster/addons/fluentd-gcp/fluentd-gcp-configmap.yaml

-        # the necessary resource types when this label is set.
-        "logging.googleapis.com/k8s_compatibility": "true"
-      }
+      # Use Metadata Agent to get monitored resource.


NOTE: This enable_metadata_agent true configuration is present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

qingling128 · 2018-04-04T20:23:55Z

cluster/addons/fluentd-gcp/fluentd-gcp-configmap.yaml

-        "logging.googleapis.com/k8s_compatibility": "true"
-      }
+      # Use Metadata Agent to get monitored resource.
+      enable_metadata_agent true


NOTE: This enable_metadata_agent true configuration is present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

qingling128 · 2018-04-04T20:27:28Z

cluster/addons/fluentd-gcp/fluentd-gcp-ds.yaml

@@ -37,6 +37,10 @@ spec:
          readOnly: true
        - name: config-volume
          mountPath: /etc/google-fluentd/config.d
+        args:


NOTE: STACKDRIVER_METADATA_AGENT_URL is set up as an environment variable regardless whether we are ingesting logs via old resources (in which case STACKDRIVER_METADATA_AGENT_URL is not needed) or new resources. Just because there seems to be no need to distinguish the case.

Thoughts?

I'm ok with setting this unconditionally, however, why aren't we setting this as a normal environment variable? Injecting it at this point is actually problematic as it harms the ability to shut down cleanly.

- env: - name: STACKDRIVER_METADATA_AGENT_URL value: {{ stackdriver_metadata_agent_url }}

Is what I'm suggesting.

Good point. Changed.

qingling128 · 2018-04-04T21:01:03Z

/retest

MaciekPytel · 2018-04-06T12:32:08Z

@qingling128 Please update release note, so it's clear this is specific to Stackdriver and not some general change in k8s logging (current version is way too specific for the target audience of Kubernetes users).

/approve

qingling128 · 2018-04-06T12:40:35Z

@MaciekPytel - Changed the PR title and release note. Should I squash everything into one commit now?

MaciekPytel · 2018-04-06T12:45:10Z

Oh yes, I completely missed the number of commits. Please squash them.

/hold
So it doesn't merge before squashing.

…tainer" and "k8s_node" resources.

qingling128 · 2018-04-06T12:49:01Z

@MaciekPytel - Squashed.

kawych · 2018-04-06T13:06:57Z

/lgtm
@x13n is out of office, so approving instead of him as he lgtmed before squashing commits.

k8s-ci-robot · 2018-04-06T13:07:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, MaciekPytel, qingling128, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster/addons/fluentd-gcp/OWNERS~~ [x13n]
~~cluster/gce/OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

qingling128 · 2018-04-06T13:12:34Z

Thanks @kawych!

@MaciekPytel - PTAL.

MaciekPytel · 2018-04-06T13:44:47Z

/hold cancel

k8s-ci-robot · 2018-04-06T15:01:54Z

@qingling128: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-gce-100-performance	4ac1e9a8ee79b787a2513f82e031b29b7893c6f6	link	`/test pull-kubernetes-e2e-gce-100-performance`
pull-kubernetes-kubemark-e2e-gce-big	4ac1e9a8ee79b787a2513f82e031b29b7893c6f6	link	`/test pull-kubernetes-kubemark-e2e-gce-big`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

qingling128 · 2018-04-06T15:04:18Z

/test pull-kubernetes-e2e-kops-aws

k8s-github-robot · 2018-04-06T16:51:32Z

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

qingling128 · 2018-04-06T17:10:26Z

Finally made it into the queue (there was some flakiness of the test) and merged. A cherrypick PR has been created as well.

StevenACoffman · 2018-04-11T10:11:59Z

@qingling128 Thank you for your work on this. What is the issue tracking the fluentd memory leak so we can know when to try to upgrade again?

…62076-origin-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #62076: Add support to ingest log entries to Stackdriver against new Cherry pick of #62076 on release-1.10. #62076: Add support to ingest log entries to Stackdriver against new

StevenACoffman · 2018-04-12T12:42:01Z

@monotek are you aware of any issue tracking the fluentd memory leak referred to here?

monotek · 2018-04-12T14:26:22Z

No.
Is there a bug report at fluentd about it?

qingling128 · 2018-04-12T19:18:53Z

@StevenACoffman @monotek - I've created an internal ticket to track it. Will report it with Fluentd as well and link it here. (Meant to do it earlier, but got distracted by the cherrypick issues)

qingling128 · 2018-04-12T22:06:36Z

Created fluent/fluentd#1941 to track the memory leak.

StevenACoffman · 2018-04-12T22:59:21Z

Hmmm... I am not seeing that with latest (only running for a few days) and there have been several leak fixes in versions in between:

see ChangeLog

qingling128 · 2018-04-13T18:16:21Z

@StevenACoffman - That sounds promising! We are also setting up some soak test for various Fluentd versions including the latest (v1.1). Some additional testing work is needed from our side to bump a major Fluentd version. We might not have resources for it this quarter, but it's definitely on our roadmap.

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 3, 2018

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 3, 2018

k8s-ci-robot requested review from MaciekPytel and mwielgus April 3, 2018 19:07

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 3, 2018

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 3, 2018

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 3, 2018

k8s-ci-robot assigned piosz Apr 3, 2018

qingling128 changed the title ~~Update fluentd_gcp_version to 0.2-1.5.30-1-k8s to include downgrading Fluentd to v0.12.~~ Update fluentd_gcp_version to 0.2-1.5.30-1-k8s (downgrade Fluentd to v0.12) and remove k8s_compatibility labels. Apr 3, 2018

qingling128 commented Apr 4, 2018

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 4, 2018

x13n reviewed Apr 4, 2018

View reviewed changes

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 4, 2018

qingling128 commented Apr 4, 2018

View reviewed changes

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 4, 2018

qingling128 changed the title ~~Update fluentd_gcp_version to 0.2-1.5.30-1-k8s (downgrade Fluentd to v0.12), remove k8s_compatibility labels and let Logging Agent talk to Metadata Agent.~~ Add support to ingest log entries against new "k8s_container" and "k8s_node" resources. Apr 4, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2018

qingling128 changed the title ~~Add support to ingest log entries against new "k8s_container" and "k8s_node" resources.~~ Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources. Apr 6, 2018

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 6, 2018

Add support to ingest log entries to Stackdriver against new "k8s_con…

cbec62a

…tainer" and "k8s_node" resources.

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2018

k8s-ci-robot assigned kawych Apr 6, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2018

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 6, 2018

k8s-github-robot merged commit 4009cb3 into kubernetes:master Apr 6, 2018

qingling128 mentioned this pull request Apr 6, 2018

Automated cherry pick of #62076: Add support to ingest log entries to Stackdriver against new #62201

Merged

		@@ -0,0 +1,425 @@
		# This ConfigMap is used to ingest logs against old resources like

		@@ -83,7 +81,19 @@ data:
		<match reform.**>
		@type record_reformer

		</match>

		# Attach local_resource_id for 'k8s_node' monitored resource.

Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources. #62076

Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources. #62076

Conversation

qingling128 commented Apr 3, 2018 • edited Loading

k8s-ci-robot commented Apr 3, 2018

qingling128 commented Apr 3, 2018

qingling128 commented Apr 3, 2018

qingling128 commented Apr 3, 2018

qingling128 commented Apr 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

x13n commented Apr 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qingling128 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qingling128 commented Apr 4, 2018

MaciekPytel commented Apr 6, 2018

qingling128 commented Apr 6, 2018

MaciekPytel commented Apr 6, 2018

qingling128 commented Apr 6, 2018

kawych commented Apr 6, 2018

k8s-ci-robot commented Apr 6, 2018

qingling128 commented Apr 6, 2018

MaciekPytel commented Apr 6, 2018

k8s-ci-robot commented Apr 6, 2018 • edited Loading

qingling128 commented Apr 6, 2018

k8s-github-robot commented Apr 6, 2018

qingling128 commented Apr 6, 2018

StevenACoffman commented Apr 11, 2018

StevenACoffman commented Apr 12, 2018

monotek commented Apr 12, 2018

qingling128 commented Apr 12, 2018

qingling128 commented Apr 12, 2018

StevenACoffman commented Apr 12, 2018 • edited Loading

qingling128 commented Apr 13, 2018

qingling128 commented Apr 3, 2018 •

edited

Loading

k8s-ci-robot commented Apr 6, 2018 •

edited

Loading

StevenACoffman commented Apr 12, 2018 •

edited

Loading