Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-v1.10] SRVKE-958: Cache init offsets results #817

Conversation

openshift-cherrypick-robot

This is an automated cherry-pick of #815

/assign pierDipi

When there is high load and multiple consumer group schedule calls,
we get many `dial tcp 10.130.4.8:9092: i/o timeout` errors when
trying to connect to Kafka.
This leads to increased "time to readiness" for consumer groups.

The downside of caching is that, in the case, partitions
increase while the result is cached we won't initialize
the offsets of the new partitions.

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
@openshift-ci-robot
Copy link

@openshift-cherrypick-robot: Could not make automatic cherrypick of Jira Issue SRVKE-958 for this PR as the target version is not set for this branch in the jira plugin config. Running refresh:
/jira refresh

In response to this:

This is an automated cherry-pick of #815

/assign pierDipi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@openshift-ci-robot: This pull request references SRVKE-958 which is a valid jira issue.

In response to this:

@openshift-cherrypick-robot: Could not make automatic cherrypick of Jira Issue SRVKE-958 for this PR as the target version is not set for this branch in the jira plugin config. Running refresh:
/jira refresh

In response to this:

This is an automated cherry-pick of #815

/assign pierDipi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from matzew and pierDipi September 7, 2023 07:31
@matzew
Copy link
Member

matzew commented Sep 7, 2023

/test 410-test-reconciler-aws-ocp-410

1 similar comment
@pierDipi
Copy link
Member

pierDipi commented Sep 7, 2023

/test 410-test-reconciler-aws-ocp-410

Copy link
Member

@pierDipi pierDipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm label Sep 7, 2023
@openshift-ci
Copy link

openshift-ci bot commented Sep 7, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: openshift-cherrypick-robot, pierDipi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Sep 7, 2023
@openshift-merge-robot openshift-merge-robot merged commit 220d898 into openshift-knative:release-v1.10 Sep 7, 2023
openshift-merge-robot pushed a commit that referenced this pull request Sep 20, 2023
* Make SecretSpec field of consumers Auth omitempty (#780)

* Expose init offset and schedule metrics for ConsumerGroup reconciler (#790) (#791)

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>

* Fix channel finalizer logic (knative-extensions#3295) (#795)

Signed-off-by: Calum Murray <cmurray@redhat.com>
Co-authored-by: Calum Murray <cmurray@redhat.com>
Co-authored-by: Pierangelo Di Pilato <pierdipi@redhat.com>

* [release-v1.10] SRVKE-958: Cache init offsets results (#817)

* Cache init offsets results

When there is high load and multiple consumer group schedule calls,
we get many `dial tcp 10.130.4.8:9092: i/o timeout` errors when
trying to connect to Kafka.
This leads to increased "time to readiness" for consumer groups.

The downside of caching is that, in the case, partitions
increase while the result is cached we won't initialize
the offsets of the new partitions.

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>

* Add autoscaler leader log patch

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>

---------

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Co-authored-by: Pierangelo Di Pilato <pierdipi@redhat.com>

* Scheduler handle overcommitted pods (#820)

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Co-authored-by: Pierangelo Di Pilato <pierdipi@redhat.com>

* Set consumer and consumergroups finalizers when creating them (#823)

It is possible that a delete consumer or consumergroup might
be reconciled and never finalized when it is deleted before
the finalizer is set.
This happens because the Knative generated reconciler uses
patch (as opposed to using update) for setting the finalizer
and patch doesn't have any optimistic concurrency controls.

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Co-authored-by: Pierangelo Di Pilato <pierdipi@redhat.com>

* Clean up reserved from resources that have been scheduled (#830)

In a recent testing run, we've noticed we have have a scheduled
`ConsumerGroup` [1] (see placements) being considered having
reserved replicas in a different pod [2].

That makes the scheduler think that there is no space but the
autoscaler says we have enough space to hold every virtual
replica.

[1]
```
$ k describe consumergroups -n ks-multi-ksvc-0 c9ee3490-5b4b-4d11-87af-8cb2219d9fe3
Name:         c9ee3490-5b4b-4d11-87af-8cb2219d9fe3
Namespace:    ks-multi-ksvc-0
...
Status:
  Conditions:
    Last Transition Time:  2023-09-06T19:58:27Z
    Reason:                Autoscaler is disabled
    Status:                True
    Type:                  Autoscaler
    Last Transition Time:  2023-09-06T21:41:13Z
    Status:                True
    Type:                  Consumers
    Last Transition Time:  2023-09-06T19:58:27Z
    Status:                True
    Type:                  ConsumersScheduled
    Last Transition Time:  2023-09-06T21:41:13Z
    Status:                True
    Type:                  Ready
  Observed Generation:     1
  Placements:
    Pod Name:      kafka-source-dispatcher-6
    Vreplicas:     4
    Pod Name:      kafka-source-dispatcher-7
    Vreplicas:     4
  Replicas:        8
  Subscriber Uri:  http://receiver5-2.ks-multi-ksvc-0.svc.cluster.local
Events:            <none>
```

[2]
```
    "ks-multi-ksvc-0/c9ee3490-5b4b-4d11-87af-8cb2219d9fe3": {
      "kafka-source-dispatcher-3": 8
    },
```

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Co-authored-by: Pierangelo Di Pilato <pierdipi@redhat.com>

* Ignore unknown fields in data plane contract (knative-extensions#3335) (#828)

Signed-off-by: Calum Murray <cmurray@redhat.com>

---------

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Signed-off-by: Calum Murray <cmurray@redhat.com>
Co-authored-by: Martin Gencur <mgencur@redhat.com>
Co-authored-by: Matthias Wessendorf <mwessend@redhat.com>
Co-authored-by: Calum Murray <cmurray@redhat.com>
Co-authored-by: OpenShift Cherrypick Robot <openshift-cherrypick-robot@redhat.com>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/eventing-kafka-broker that referenced this pull request Jan 5, 2024
…s#3354)

When there is high load and multiple consumer group schedule calls,
we get many `dial tcp 10.130.4.8:9092: i/o timeout` errors when
trying to connect to Kafka.
This leads to increased "time to readiness" for consumer groups.

The downside of caching is that, in the case, partitions
increase while the result is cached we won't initialize
the offsets of the new partitions.

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants