Increase consumer group test timeout #187

orange-kao · 2024-07-24T01:13:06Z

Test test/test_consumer_group.py::test_group failed. Increase timeout to find out if it is just slow or real failure.

wbarnha · 2024-07-24T15:28:25Z

In my opinion, it's slower than it used to be. If you re-run the tests, it eventually passes but it's still strange. I'm suspecting something internally with GitHub's CI changed but I need to review things further.

wbarnha · 2024-07-24T15:51:10Z

I started to observe this behavior after merging #184 but I can't imagine anything in there being causally related. Maybe I should rethink how brokers are spun up for each CI test.

orange-kao · 2024-08-08T02:24:22Z

Test failed again after increasing the timeout. This time I reverted 5e461a7e017130fb9115add8d64291d6966267e9 in this draft PR, see if the test can pass or not.

wbarnha · 2024-08-08T03:13:41Z

Hmm, seems like the revert didn't fix it. Thanks for checking, though. I'm baffled this is now an issue.

orange-kao · 2024-08-08T07:06:18Z

I have a suspicion that the test failure might be related to the CPU time available in github workflow. So I have started a Ubuntu 24.04 VM on GCP using VM type n2d-standard-4 (4 vCPU, 2 core, 16G RAM), all tests can pass (master branch). But if I limit the CPU to 0.1 core (using cgroup), the following tests will fail.

test/test_admin_integration.py sEEEEEEEEE
test/test_consumer_group.py sEEEEEEE
test/test_consumer_integration.py sEEEEEEEEEsE
test/test_producer.py s.EEEEE.EEEEE
test/test_sasl_integration.py sEEEEEEEEE
test/test_ssl_integration.py sEEE

I am still investigating but I don't have enough evidence at the moment.

I also started testing using e2-micro (0.25~2 vCPU, 1 shared core, 1G RAM), will update result here later.

orange-kao · 2024-08-08T14:22:55Z

n2d-standard-4 limit to 1 core

test/test_ssl_integration.py sEEE  # RuntimeError: Child thread died already.

n2d-standard-4 limit to 0.5 core: passed twice

e2-micro 1st time

test/test_admin_integration.py ...xF.....
test/test_consumer_group.py ..F.....

e2-micro 2nd time

test/test_admin_integration.py .EEEEEEEEE
test/test_consumer_group.py .EEEEEEE
test/test_consumer_integration.py .EEEEEEEEEsE
test/test_producer.py ..EEEEE.EEEEE
test/test_sasl_integration.py .EEEEEEEEE
test/test_ssl_integration.py .EEE

Most of them raised RuntimeError: Child thread died already.

Result: Inconclusive.

I'm not familiar with tox but Is there a way to run a single tests? I tried something like tox -e py312 -- kafka/scram.py but it always fail...

orange-kao · 2024-08-09T06:19:34Z

I am using this PR as a testing ground, but I have submitted #192, I assume it will fix part of the problem.

orange-kao · 2024-08-11T22:41:40Z

This PR still fail because test/test_ssl_integration.py raised RuntimeError: Child thread died already.. As far as I know this means Java has died. I will try to find out why...

orange-kao · 2024-08-13T07:08:37Z

I haven't figure out why java dies in test/test_ssl_integration.py. However, I noticed that there could be up to 7~9 Kafka subprocesses when running test/test_sasl_integration.py (SASL, not SSL). I use PDB and objgraph to observer, and looks like kafka_broker_factory will create 7 KafkaFixture and each of them have their own SapwnedService and Popen object.

I guess that GitHub runner may not have enough memory, and I can reproduce slowliness on resource constrained VMs when there are too many Kafka running.

wbarnha · 2024-08-13T13:22:33Z

I haven't figure out why java dies in test/test_ssl_integration.py. However, I noticed that there could be up to 7~9 Kafka subprocesses when running test/test_sasl_integration.py (SASL, not SSL). I use PDB and objgraph to observer, and looks like kafka_broker_factory will create 7 KafkaFixture and each of them have their own SapwnedService and Popen object.

I guess that GitHub runner may not have enough memory, and I can reproduce slowliness on resource constrained VMs when there are too many Kafka running.

I agree. My goal has been to run one Kafka instance per test in order to conserve memory, as a solution. I was planning on tinkering with concurrency groups in https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-concurrency to improve this issue.

orange-kao · 2024-08-14T00:29:28Z

Thank you. #186 and #194 has passed the test and ready for review.

I also noticed test for Kafka 0.8.22 Python 3.12 always timeout (after test/test_partitioner.py, before test/test_producer.py). I will try to find out why...

This reverts commit f76b6d4.

orange-kao · 2024-08-16T02:24:14Z

^ I cannot reproduce the timeout in my environment (Kafka 0.8.2.2, Python 3.12, after test/test_partitioner.py, before test/test_producer.py). I have updated this branch to trigger the test to find out...

wbarnha · 2024-08-16T02:25:54Z

Thank you for your meticulous investigation, I really do appreciate it. It's been troubling me as to why it's been an issue the past month. Possibly Microsoft scaling down runner resources as a cost-cutting measure?

orange-kao · 2024-08-16T03:05:58Z

No worries. I am not sure. On paper, public repo runner have plenty of resource. But I am not sure what's behind the scene, and I never tried to benchmark it...

Note:
First attempt: timeout during test/test_producer.py::test_kafka_producer_proper_record_metadata[gzip]
Second attempt: test_kafka_producer_proper_record_metadata gzip/snappy/lz4/zstd and timeout

orange-kao mentioned this pull request Aug 8, 2024

Avoid FD spike after retrying KafkaAdminClient aiven/kafka-python#32

Merged

orange-kao mentioned this pull request Aug 9, 2024

[PLEASE IGNORE] Trying to trigger test to find out cause of failure #193

Closed

orange-kao added 6 commits August 12, 2024 00:57

[PLEASE IGNORE] Trying to trigger test to find out cause of failure

8f440f5

debug: trying to find out why subprocess dies

e38b9ab

mroe debug message

3796c2d

debug

e7e22fd

Add repeat-tox

8fe1e7c

Add scripts to setup test environment automatically

77b7555

orange-kao added 5 commits August 16, 2024 01:28

use Kafka 0.8.2.2 (for 15 minutes timeout issue)

f76b6d4

Add script to get temurin java

56685ba

Revert "use Kafka 0.8.2.2 (for 15 minutes timeout issue)"

dc4c57e

This reverts commit f76b6d4.

Add kill-extra-java to verify SASL test

f297764

Try to find out why Kafka 0.8.2.2 timeout on github runner

1608037

orange-kao force-pushed the orange-test-consumer-group-timeout branch from 7d8bfac to 1608037 Compare August 16, 2024 02:24

Narrow down the cause of test timeout in test/test_producer.py

97887e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase consumer group test timeout #187

Increase consumer group test timeout #187

orange-kao commented Jul 24, 2024

wbarnha commented Jul 24, 2024

wbarnha commented Jul 24, 2024

orange-kao commented Aug 8, 2024

wbarnha commented Aug 8, 2024

orange-kao commented Aug 8, 2024

orange-kao commented Aug 8, 2024

orange-kao commented Aug 9, 2024

orange-kao commented Aug 11, 2024

orange-kao commented Aug 13, 2024

wbarnha commented Aug 13, 2024

orange-kao commented Aug 14, 2024

orange-kao commented Aug 16, 2024

wbarnha commented Aug 16, 2024

orange-kao commented Aug 16, 2024

Increase consumer group test timeout #187

Are you sure you want to change the base?

Increase consumer group test timeout #187

Conversation

orange-kao commented Jul 24, 2024

wbarnha commented Jul 24, 2024

wbarnha commented Jul 24, 2024

orange-kao commented Aug 8, 2024

wbarnha commented Aug 8, 2024

orange-kao commented Aug 8, 2024

orange-kao commented Aug 8, 2024

orange-kao commented Aug 9, 2024

orange-kao commented Aug 11, 2024

orange-kao commented Aug 13, 2024

wbarnha commented Aug 13, 2024

orange-kao commented Aug 14, 2024

orange-kao commented Aug 16, 2024

wbarnha commented Aug 16, 2024

orange-kao commented Aug 16, 2024