Skip to content

Conversation

@C0urante
Copy link
Contributor

Implements embedded end-to-end integration tests for KIP-618, and brings together previously-decoupled logic from upstream PRs.

Relies on changes from:

@C0urante
Copy link
Contributor Author

Converting to draft until upstream PRs are reviewed.

@C0urante C0urante force-pushed the kafka-10000-integration-tests branch 2 times, most recently from acb8fb3 to 3d65e79 Compare March 3, 2022 17:50
@C0urante C0urante force-pushed the kafka-10000-integration-tests branch 2 times, most recently from 9e017b8 to 8756ee2 Compare June 13, 2022 00:02
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this line to hide the newly-introduced zombie fencing API from the OpenAPI spec that we started generating with #12067.

@C0urante C0urante force-pushed the kafka-10000-integration-tests branch 3 times, most recently from c520f26 to 26c6879 Compare June 14, 2022 13:17
@C0urante C0urante force-pushed the kafka-10000-integration-tests branch 3 times, most recently from 0b0d4c2 to a751225 Compare June 18, 2022 19:25
@C0urante
Copy link
Contributor Author

Given that all merge conflicts have been resolved and #11781 has already been approved, marking this as ready for review.

@C0urante C0urante marked this pull request as ready for review June 18, 2022 19:31
@C0urante C0urante force-pushed the kafka-10000-integration-tests branch 2 times, most recently from 4efd374 to d322e99 Compare June 21, 2022 12:55
Copy link
Contributor Author

@C0urante C0urante Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this check to clean up the flood of ERROR-level log messages that occurs when the task's producer is closed while there are still in-flight messages.

This issue was not specific to these integration tests or to KIP-618, but it clogged up the logs for these tests badly enough that a small tweak in the code base to address it seemed warranted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was way too noisy at INFO-level.

@C0urante
Copy link
Contributor Author

@showuon @tombentley @mimaison anybody got time for this one? Just this and #11783 left before everything for KIP-618 is merged! 🎉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the missing Javadoc on this one!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, fair. Added a few paragraphs; hope it's enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really explain why this second call is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, updated to something less snarky.

@C0urante C0urante force-pushed the kafka-10000-integration-tests branch from 8ed7592 to 2f4b1cc Compare June 28, 2022 13:27
Copy link
Member

@mimaison mimaison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @C0urante for the PR. I left a few comments. My main concern is about checking the exact equality between what the connector produces and what ends up in the topic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a bit confused looking at this. I wonder if we should make these custom config names as constants and maybe also name them similarly. I think custom.exactly.once.support may be clearer than exactly.once.support.level as I initially did not notice it was not the real configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, did both (constants and renaming property).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm boring but I'm not quite sure if we want to keep the emojis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can live with that 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be fail instead of null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch--yes, it should

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to produce a precise number of records and check we get that exact number. Here, if I understand correctly, the connector could have created duplicates and we would not notice it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true; we don't check for duplicates here. However, the test cases don't exactly do a lot to induce them, either. Even if exactly-once support were disabled for these, the conditions that they're run under are still green-path and shouldn't result in any duplicates.

We perform hard and soft rolling bounces in the connect_distributed_test.py::test_exactly_once_source system test introduced in #11783 and check for duplicates and record content there, which seems appropriate if we're aiming to test how resilient this feature is against real-world and sometimes-suboptimal conditions.

We could possibly add a utility method to assert that a collection of records produced by a source connector with n tasks has consecutive, non-duplicated sequence numbers for each unique task, but I'm not sure it's really worth the effort considering how much coverage we already get from the system test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to agree with @mimaison. It shouldn't be that much effort to verify the sequence numbers, and because most people are not in the habit of running the system tests having the assertion would provide much earlier warning of regressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually is a surprising amount of work to verify sequence numbers since records can be spread across partitions and, as a result, polled out of order by a consumer. We could reduce the number of topic partitions that we write to in these cases to 1, or place an arbitrary limitation on the number of records produced by each MonitorableSourceTask instance before it stops producing, or shut down the connector and do a poll for the end offsets of the topic and read from beginning up to that point to get absolutely everything that's in the topic. But the first option would not be worth the decrease in coverage IMO and the last two are pretty complex and come with their own edge cases.

I do realize that system tests are pretty uncommon (I hardly run them myself), but do we really think there's risk of a regression that would be caught by these integration tests that wouldn't be caught by all the accompanying unit tests that we have?

If there's a simpler way to do things let me know, too; I could be missing an easy win here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, one long weekend later, I've taken a shot at this. I don't know if I love how much complexity it's introduced into the tests here since it's unfortunately non-trivial to do a "read all" for a Kafka topic, but at least if the newly-introduced EmbeddedKafkaCluster::consumeAll method is accurate, we can reuse that logic elsewhere in our tests and not have to worry about solving this problem more than once.

There's still no assertions for in-order delivery, but we now have checks to make sure that no records are dropped or duplicated. LMKWYT

@C0urante C0urante force-pushed the kafka-10000-integration-tests branch 2 times, most recently from 533eae1 to bba0d8e Compare June 29, 2022 22:13
Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple more comments, but otherwise this LGTM. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to agree with @mimaison. It shouldn't be that much effort to verify the sequence numbers, and because most people are not in the habit of running the system tests having the assertion would provide much earlier warning of regressions.

@C0urante C0urante force-pushed the kafka-10000-integration-tests branch from bba0d8e to a6eb895 Compare July 1, 2022 20:35
Copy link
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a pass. LGTM. Left some minor comments. Thanks.

Comment on lines +239 to +251
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention something about Fibonacci numbers in comment here? I think it's not obvious what we're trying to achieve here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, done 👍

@C0urante C0urante force-pushed the kafka-10000-integration-tests branch from a6eb895 to c267490 Compare July 5, 2022 18:15
@C0urante C0urante force-pushed the kafka-10000-integration-tests branch from c267490 to 54e443e Compare July 5, 2022 18:20
Copy link
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the update.

@showuon
Copy link
Member

showuon commented Jul 6, 2022

Failed tests are unrelated:

Build / JDK 11 and Scala 2.13 / kafka.network.DynamicConnectionQuotaTest.testDynamicListenerConnectionCreationRateQuota()

@showuon showuon merged commit 3ae1afa into apache:trunk Jul 6, 2022
@C0urante C0urante deleted the kafka-10000-integration-tests branch August 3, 2024 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants