Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return record name for AVRO union #785

Closed

Conversation

Dieken
Copy link

@Dieken Dieken commented Feb 22, 2020

@ghost
Copy link

ghost commented Feb 22, 2020

It looks like @Dieken hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement here.

Once you've signed reply with [clabot:check] to prove it.

Appreciation of efforts,

clabot

@Dieken
Copy link
Author

Dieken commented Feb 22, 2020

[clabot:check]

@ghost
Copy link

ghost commented Feb 22, 2020

@confluentinc It looks like @Dieken just signed our Contributor License Agreement. 👍

Always at your service,

clabot

@scottbelden
Copy link

I don't think this is something you want on by default. It changes the format of the the data that the user gets back and therefore could break for people that aren't expecting it. It should probably be some option that a user can choose to have on.

@Dieken
Copy link
Author

Dieken commented Feb 25, 2020

I don't think this is something you want on by default. It changes the format of the the data that the user gets back and therefore could break for people that aren't expecting it. It should probably be some option that a user can choose to have on.

@scottbelden Yes, this will break backward compatibility, but maybe it's a chance to break due to rewriting of AVRO encoding and decoding in branch "FirstClassFastAvro".

I'm not sure how to properly add this option to the AvroConsumer class, actually it's better to be a standard behaviour, always returns record name for AVRO union just like NodeJS avsc and LinkedIn goavro:

Currently in avsc and goavro "union {Foo, Bar} msg" is deserialzed into:

"msg": {   "Foo":  some-value }

Currently in confluent-kafka-python(fastavro without "return_record_name" option):

"msg": some-value

Expected with "return_record_name=True":

"msg": ("Foo", some-value)

Unluckily the "return_record_name" option makes fastavro change normal AVRO record into a tuple too, this is very ugly.

@rnpridgeon
Copy link
Contributor

We can't enable this by default but we could expose the option. We are actually deprecating the Avro Producer/ Avro Consumer entirely in favor of genreric serialization. Let me push some of those changes and we can adjust accordingly.

@rnpridgeon rnpridgeon force-pushed the FirstClassFastAvro branch 5 times, most recently from fef1440 to b4f2481 Compare February 28, 2020 09:58
@rnpridgeon rnpridgeon force-pushed the FirstClassFastAvro branch 5 times, most recently from 558cdb0 to 52520d5 Compare March 17, 2020 22:32
@rnpridgeon rnpridgeon force-pushed the FirstClassFastAvro branch 10 times, most recently from b936017 to 40fe906 Compare March 30, 2020 14:09
@rnpridgeon rnpridgeon force-pushed the FirstClassFastAvro branch 2 times, most recently from 2d1b742 to 4d318bc Compare March 30, 2020 22:26
@slominskir
Copy link
Contributor

This is about consuming AVRO unions, and there is a related topic: #656 (produce AVRO unions). Am I understanding correctly that as of right now there is no way to properly handle AVRO unions (properly being described in AVRO docs: https://avro.apache.org/docs/current/spec.html#json_encoding)?

@scottbelden
Copy link

@slominskir The JSON encoding that you referenced does exist in fastavro. Here's an example that uses unions:

from fastavro import json_writer, json_reader
from io import StringIO

schema = [
    {
        "type": "record",
        "name": "A",
        "fields": [{"name": "foo", "type": ["string", "null"]}],
    },
    {
        "type": "record",
        "name": "B",
        "fields": [{"name": "bar", "type": ["string", "null"]}],
    },
    {
        "type": "record",
        "name": "AOrB",
        "fields": [{"name": "entity", "type": ["A", "B"]}],
    },
]

records = [
    {"entity": {"foo": "FOO"}},  # This is an AOrB record that has an internal A record
    {"entity": {"bar": "BAR"}},  # This is an AOrB record that has an internal B record
]

sio = StringIO()
json_writer(sio, schema, records)

print(sio.getvalue())

That will print out the following:

{"AOrB": {"entity": {"A": {"foo": {"string": "FOO"}}}}}
{"AOrB": {"entity": {"A": {"foo": null}}}}

However, it should be noted that the form above is the serialized form. If one were to read that into memory, the form would look more like the original records line where there is no type information. The return_record_name option mentioned in this issue only affects the in-memory object (where it optionally gives the type information as a tuple).

There is currently no way to have the in-memory representation be like {"AOrB": {"entity": {"A": {"foo": {"string": "FOO"}}}}}

@slominskir
Copy link
Contributor

@scottbelden

There is currently no way to have the in-memory representation be like {"AOrB": {"entity": {"A": {"foo": {"string": "FOO"}}}}}

This sounds very bad for interoperability between Java and Python APIs as the Java API does use the JSON AVRO format with union type information. If a Java client exports topic messages to a file the contents are formatted with such that the Python client cannot import them back in (Java backed Confluent kafka-avro-console-consumer can though).

In your example you printed out the "correct" format, so it seems it is possible to print it at least?

@scottbelden
Copy link

@slominskir I don't think I'm doing a very good job of explaining myself.

The following form:

{"AOrB": {"entity": {"A": {"foo": {"string": "FOO"}}}}}

Is the serialized JSON Avro format. fastavro supports reading and writing this format. For example, let's create a file called aorb.json and write that single line to it. It is possible to read that file with the following code:

from fastavro import json_reader

schema = [
    {
        "type": "record",
        "name": "A",
        "fields": [{"name": "foo", "type": ["string", "null"]}],
    },
    {
        "type": "record",
        "name": "B",
        "fields": [{"name": "bar", "type": ["string", "null"]}],
    },
    {
        "type": "record",
        "name": "AOrB",
        "fields": [{"name": "entity", "type": ["A", "B"]}],
    },
]

with open("aorb.json") as fp:
    for record in json_reader(fp, schema):
        print(record)

If you run that, what gets printed out is {'entity': {'foo': 'FOO'}}. The dictionary printed out is what I'm calling the "in-memory" representation.

On disk, in the serialized form, both Java and fastavro conform to the specification and include the type information. However, the in-memory representations differ. On the Java side, when you interact with the record you are probably doing so via some class and on the python side you have a dictionary but the dictionary is different than the serialized one.

So on the python side, if you already read in the json file and you want to access the FOO value, you would do something like record["entity"]["foo"], not record["AOrB"]["entity"]["A"]["foo"]["string"]. Hopefully that is more clear, but if not please let me know.

@slominskir
Copy link
Contributor

@scottbelden One point of confusion is that I'm using the confluent-kafka-python API, which hides a lot of the details that happen under the hood (sometimes a good thing). It sounds like the fastavro library does everything, but by the time the DeserializedConsumer from the kafka-python library returns a message with poll() it's already too late it seems as I get a dict object back (that does not include the union type info). You've given examples using fastavro, but do you have read/write examples using the kafka-python API? I assume that's what this ticket is about?

@slominskir
Copy link
Contributor

By the way - one "hack" to get this to work is to edit the file under site-packages directly:

obj_dict = schemaless_reader(payload,
writer_schema,
self._reader_schema)

to be:

            obj_dict = schemaless_reader(payload,
                                         writer_schema,
                                         self._reader_schema, True)

On my system the full file path is: /usr/local/lib/python3.7/site-packages/confluent_kafka/schema_registry.avro.py.

This is obviously not a permanent solution. I assume buried in the 42 files changed in this pull request a flag is exposed so that True is added to that function call?

@scottbelden
Copy link

You've given examples using fastavro, but do you have read/write examples using the kafka-python API? I assume that's what this ticket is about?

Correct, my examples all just use fastavro directly because I am not familiar with the python kafka API. Unfortunately I probably can't help there.

This is obviously not a permanent solution. I assume buried in the 42 files changed in this pull request a flag is exposed so that True is added to that function call?

Correct, I think the original PR was just this single commit: f958879. But then a refactor commit got added to it that made it massive.

But my original comment from above still stands. This would need to be added as an optional argument somewhere in this library because it changes the data structure of what is being returned to the user. Instead of getting {'entity': {'foo': 'FOO'}}, they get {'entity': ('A', {'foo': 'FOO'}}).

slominskir added a commit to slominskir/confluent-kafka-python that referenced this pull request Feb 4, 2021
@slominskir
Copy link
Contributor

Replacement pull: #1028

slominskir added a commit to slominskir/confluent-kafka-python that referenced this pull request Feb 4, 2021
slominskir added a commit to slominskir/confluent-kafka-python that referenced this pull request Feb 4, 2021
slominskir added a commit to slominskir/confluent-kafka-python that referenced this pull request Feb 4, 2021
slominskir added a commit to slominskir/confluent-kafka-python that referenced this pull request Feb 4, 2021
@edenhill edenhill added the serdes Avro, JSON, Protobof, Schema-registry label Mar 9, 2021
mhowlett pushed a commit that referenced this pull request Mar 10, 2021
Replaced stale pull #785; add config bool return_record_name
@edenhill edenhill closed this Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
serdes Avro, JSON, Protobof, Schema-registry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants