Add `fetch_in_order` field to `schema_registry` input #2898

mihaitodor · 2024-09-25T15:29:01Z

No description provided.

mihaitodor · 2024-09-25T15:31:10Z

internal/impl/kafka/enterprise/integration_test.go

+ }
+
+ if test.schemaWithReference != "" {
+ body, err := json.Marshal(payload{Schema: test.schemaWithReference, References: []map[string]any{{"name": "foo", "subject": subject, "version": 1}}})


A bit janky, but I can clean it up to use the name from t.schema if need be...

mihaitodor · 2024-09-25T15:34:25Z

internal/impl/kafka/enterprise/redpanda_migrator_bundle_output.tmpl.yaml

+ if ["topic", "key", "partition", "partitioner", "timestamp"].any(f -> this.redpanda_migrator.keys().contains(f)) {
+ root = throw("The topic, key, partition, partitioner and timestamp fields of the redpanda_migrator output must be left empty")
+ }
+ let rpMigratorMaxInFlight = this.redpanda_migrator.max_in_flight.or(1)
+ let redpandaMigrator = this.redpanda_migrator.assign(
 {
 "topic": "${! metadata(\"kafka_topic\").or(throw(\"missing kafka_topic metadata\")) }",
 "key": "${! metadata(\"kafka_key\") }",
 "partition": "${! metadata(\"kafka_partition\").or(throw(\"missing kafka_partition metadata\")) }",
 "partitioner": "manual",
- "timestamp": "${! metadata(\"kafka_timestamp_unix\").or(timestamp_unix()) }"
+ "timestamp": "${! metadata(\"kafka_timestamp_unix\").or(timestamp_unix()) }",
+ "max_in_flight": $rpMigratorMaxInFlight
 }
 )

+ let redpandaMigratorOffsets = this.redpanda_migrator.with("seed_brokers", "consumer_group", "client_id", "rack_id", "max_message_bytes", "broker_write_max_bytes", "tls", "sasl")
+
 if this.schema_registry.keys().contains("subject") {
 root = throw("The subject field of the schema_registry output must not be set")
 }
- let schema_registry = if this.schema_registry.length() > 0 { this.schema_registry.assign({"subject": "${! @schema_registry_subject }"}) }
- if ["topic", "key", "partition", "partitioner", "timestamp"].any(f -> this.redpanda_migrator.keys().contains(f)) {
- root = throw("The topic, key, partition, partitioner and timestamp fields of the redpanda_migrator output must be left empty")
+
+ let srMaxInFlight = this.schema_registry.max_in_flight.or(1)
+ let schemaRegistry = if this.schema_registry.length() > 0 {
+ this.schema_registry.assign({
+ "subject": "${! @schema_registry_subject }",
+ "max_in_flight": $srMaxInFlight
+ })
 }


Just making sure that we set max_in_flight to 1 unless users set it to something else explicitly (and hopefully are aware they'll get out-of-order messages in that case).

rockwotj · 2024-09-25T15:37:27Z

docs/modules/components/pages/inputs/schema_registry.adoc

+=== `fetch_all_schemas_on_connect`
+
+Fetch all schemas on connect. Set this to `true` when schema references are used.
+
+
+*Type*: `bool`
+
+*Default*: `true`
+


This name is not very intuitive for me.

Can we name it:

enforce_ordering: bool If true all schemas will be fetched and sorted, so that schemas are inserted in the same order. This is required to correctly resolve schema references. default: true

On another note: do we even need this option? Is this only false if we would OOM reading all the schemas?

Or maybe fetch_in_order since they can still mess up the order on the output with max_in_flight

Naming things is hard 😅 I like fetch_in_order, thanks!

Is this only false if we would OOM reading all the schemas?

I'm not super-concerned about OOM unless the schemas are huge, but it could take a long time to fetch them all if there are many of them, so it might be handy to stream them if people wish to stick a cache in the middle. Or maybe I'm overthinking it... The tests cover both flows now, so I'm happy to leave it there.

Signed-off-by: Mihai Todor <todormihai@gmail.com>

rockwotj

Way cleaner - I did find one issue, but otherwise LGTM

rockwotj · 2024-09-26T15:44:06Z

internal/impl/kafka/enterprise/schema_registry_input.go

+ if !tlsEnabled {
+ tlsConf = nil
+ }
+ if i.client, err = sr.NewClient(srURL.String(), reqSigner, tlsConf, mgr); err != nil {


Thank you for that suggestion! The code is much cleaner indeed! ❤️

rockwotj · 2024-09-26T15:46:36Z

internal/impl/kafka/enterprise/schema_registry_input.go

+ schemas[schema.ID] = schemaInfo{
+ SchemaInfo: schema,
+ subject: subject,
+ version: version,
+ }


This will overwrite if there is a schema stored in multiple different subjects.

For example schema ID 2 can be registered in 3 different subjects, and this way we will drop two others. I think we need to collect all of them without deduplication, then post them all in sorted order. The order between same id keys doesn't matter, but we still need to populate the subject -> schema ID relation.

Thanks for spotting that! Indeed, this is a corner case that I was trying to get my head around and it looks like you're right and the schemas will get duplicate IDs :( The trouble is that I can't get Redpanda to import schemas with a fixed ID but different subjects:

$ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"id": 1, "version": 1, "schema": "{\"name\": \"Foo\", \"type\": \"record\", \"fields\": [{\"name\": \"data\", \"type\": \"int\"}]}"}' http://localhost:8081/subjects/foo/versions {"id":1}% $ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"id": 1, "version": 1, "schema": "{\"name\": \"Foo\", \"type\": \"record\", \"fields\": [{\"name\": \"data\", \"type\": \"int\"}]}"}' http://localhost:8081/subjects/bar/versions {"error_code":42207,"message":"Overwrite new schema with id 1 is not permitted."}%

It does work, however, if I drop the id field or both the id and version` fields:

$ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema": "{\"name\": \"Foo\", \"type\": \"record\", \"fields\": [{\"name\": \"data\", \"type\": \"int\"}]}"}' http://localhost:8081/subjects/bar/versions {"id":1}% $ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"version": 1, "schema": "{\"name\": \"Foo\", \"type\": \"record\", \"fields\": [{\"name\": \"data\", \"type\": \"int\"}]}"}' http://localhost:8081/subjects/baz/versions {"id":1}%

However, if I do this, then it feels kind of brittle because I'd have to make sure I first send the schema which does have the id field set...

This should be fixed now. I'll defer adding a test until Redpanda lifts the above constraint as part of
CORE-7734.

Schemas can have the same ID and be registered under different subjects. I can't add a test for now because Redpanda doesn't allow us to POST schemas with the same ID under different subjects for now. This limitation will be resolved as part of a follow-up task. Signed-off-by: Mihai Todor <todormihai@gmail.com>

rockwotj · 2024-09-26T19:11:20Z

internal/impl/kafka/enterprise/schema_registry_input.go

+ // Omit `subject` from the schema since the `/subjects/<subject>/versions` endpoint doesn't allow it as part of
+ // the payload we pass it along as metadata.
+ schema := map[string]any{
+ "id": si.ID,


Don't we have to drop the ID if there are multiple subjects with the same ID?

Hopefully CORE-7734 gets addressed in Redpanda so we won’t have to make this logic even more convoluted.

rockwotj

LGTM

mihaitodor requested review from rockwotj and Jeffail September 25, 2024 15:29

mihaitodor commented Sep 25, 2024

View reviewed changes

mihaitodor force-pushed the mihaitodor-fetch-all-schemas-upfront branch from ee05f61 to 4a21635 Compare September 25, 2024 15:32

mihaitodor commented Sep 25, 2024

View reviewed changes

rockwotj reviewed Sep 25, 2024

View reviewed changes

mihaitodor force-pushed the mihaitodor-fetch-all-schemas-upfront branch 2 times, most recently from 5217677 to 141ec00 Compare September 26, 2024 15:33

Add fetch_in_order field to schema_registry input

ff11032

Signed-off-by: Mihai Todor <todormihai@gmail.com>

mihaitodor force-pushed the mihaitodor-fetch-all-schemas-upfront branch from 141ec00 to ff11032 Compare September 26, 2024 15:35

mihaitodor changed the title ~~Add fetch_all_schemas_on_connect field to schema_registry input~~ Add Version field to schema_registry input Sep 26, 2024

rockwotj reviewed Sep 26, 2024

View reviewed changes

mihaitodor changed the title ~~Add Version field to schema_registry input~~ Add fetch_in_order field to schema_registry input Sep 26, 2024

rockwotj reviewed Sep 26, 2024

View reviewed changes

rockwotj approved these changes Sep 26, 2024

View reviewed changes

Jeffail approved these changes Sep 26, 2024

View reviewed changes

Jeffail merged commit d124570 into main Sep 26, 2024
3 checks passed

Jeffail deleted the mihaitodor-fetch-all-schemas-upfront branch September 26, 2024 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `fetch_in_order` field to `schema_registry` input #2898

Add `fetch_in_order` field to `schema_registry` input #2898

mihaitodor commented Sep 25, 2024

mihaitodor Sep 25, 2024

mihaitodor Sep 25, 2024

rockwotj Sep 25, 2024

rockwotj Sep 25, 2024

mihaitodor Sep 25, 2024

mihaitodor Sep 25, 2024

rockwotj left a comment

rockwotj Sep 26, 2024

mihaitodor Sep 26, 2024

rockwotj Sep 26, 2024

mihaitodor Sep 26, 2024

mihaitodor Sep 26, 2024

rockwotj Sep 26, 2024

mihaitodor Sep 26, 2024 •

edited

Loading

rockwotj left a comment

Add fetch_in_order field to schema_registry input #2898

Add fetch_in_order field to schema_registry input #2898

Conversation

mihaitodor commented Sep 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mihaitodor Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

rockwotj left a comment

Choose a reason for hiding this comment

Add `fetch_in_order` field to `schema_registry` input #2898

Add `fetch_in_order` field to `schema_registry` input #2898

mihaitodor Sep 26, 2024 •

edited

Loading