return record name for AVRO union #785

Dieken · 2020-02-22T06:17:48Z

See https://fastavro.readthedocs.io/en/latest/reader.html#fastavro._read_py.schemaless_reader

ghost · 2020-02-22T06:17:50Z

It looks like @Dieken hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement here.

Once you've signed reply with [clabot:check] to prove it.

Appreciation of efforts,

clabot

Dieken · 2020-02-22T06:42:57Z

[clabot:check]

ghost · 2020-02-22T06:42:59Z

@confluentinc It looks like @Dieken just signed our Contributor License Agreement. 👍

Always at your service,

clabot

scottbelden · 2020-02-24T23:44:18Z

I don't think this is something you want on by default. It changes the format of the the data that the user gets back and therefore could break for people that aren't expecting it. It should probably be some option that a user can choose to have on.

Dieken · 2020-02-25T08:25:43Z

I don't think this is something you want on by default. It changes the format of the the data that the user gets back and therefore could break for people that aren't expecting it. It should probably be some option that a user can choose to have on.

@scottbelden Yes, this will break backward compatibility, but maybe it's a chance to break due to rewriting of AVRO encoding and decoding in branch "FirstClassFastAvro".

I'm not sure how to properly add this option to the AvroConsumer class, actually it's better to be a standard behaviour, always returns record name for AVRO union just like NodeJS avsc and LinkedIn goavro:

Currently in avsc and goavro "union {Foo, Bar} msg" is deserialzed into:

"msg": {   "Foo":  some-value }

Currently in confluent-kafka-python(fastavro without "return_record_name" option):

"msg": some-value

Expected with "return_record_name=True":

"msg": ("Foo", some-value)

Unluckily the "return_record_name" option makes fastavro change normal AVRO record into a tuple too, this is very ugly.

rnpridgeon · 2020-02-25T09:52:02Z

We can't enable this by default but we could expose the option. We are actually deprecating the Avro Producer/ Avro Consumer entirely in favor of genreric serialization. Let me push some of those changes and we can adjust accordingly.

slominskir · 2021-02-01T22:46:48Z

This is about consuming AVRO unions, and there is a related topic: #656 (produce AVRO unions). Am I understanding correctly that as of right now there is no way to properly handle AVRO unions (properly being described in AVRO docs: https://avro.apache.org/docs/current/spec.html#json_encoding)?

scottbelden · 2021-02-01T23:05:18Z

@slominskir The JSON encoding that you referenced does exist in fastavro. Here's an example that uses unions:

from fastavro import json_writer, json_reader
from io import StringIO

schema = [
    {
        "type": "record",
        "name": "A",
        "fields": [{"name": "foo", "type": ["string", "null"]}],
    },
    {
        "type": "record",
        "name": "B",
        "fields": [{"name": "bar", "type": ["string", "null"]}],
    },
    {
        "type": "record",
        "name": "AOrB",
        "fields": [{"name": "entity", "type": ["A", "B"]}],
    },
]

records = [
    {"entity": {"foo": "FOO"}},  # This is an AOrB record that has an internal A record
    {"entity": {"bar": "BAR"}},  # This is an AOrB record that has an internal B record
]

sio = StringIO()
json_writer(sio, schema, records)

print(sio.getvalue())

That will print out the following:

{"AOrB": {"entity": {"A": {"foo": {"string": "FOO"}}}}}
{"AOrB": {"entity": {"A": {"foo": null}}}}

However, it should be noted that the form above is the serialized form. If one were to read that into memory, the form would look more like the original records line where there is no type information. The return_record_name option mentioned in this issue only affects the in-memory object (where it optionally gives the type information as a tuple).

There is currently no way to have the in-memory representation be like {"AOrB": {"entity": {"A": {"foo": {"string": "FOO"}}}}}

slominskir · 2021-02-02T14:35:56Z

@scottbelden

There is currently no way to have the in-memory representation be like {"AOrB": {"entity": {"A": {"foo": {"string": "FOO"}}}}}

This sounds very bad for interoperability between Java and Python APIs as the Java API does use the JSON AVRO format with union type information. If a Java client exports topic messages to a file the contents are formatted with such that the Python client cannot import them back in (Java backed Confluent kafka-avro-console-consumer can though).

In your example you printed out the "correct" format, so it seems it is possible to print it at least?

scottbelden · 2021-02-02T15:05:50Z

@slominskir I don't think I'm doing a very good job of explaining myself.

The following form:

{"AOrB": {"entity": {"A": {"foo": {"string": "FOO"}}}}}

Is the serialized JSON Avro format. fastavro supports reading and writing this format. For example, let's create a file called aorb.json and write that single line to it. It is possible to read that file with the following code:

from fastavro import json_reader

schema = [
    {
        "type": "record",
        "name": "A",
        "fields": [{"name": "foo", "type": ["string", "null"]}],
    },
    {
        "type": "record",
        "name": "B",
        "fields": [{"name": "bar", "type": ["string", "null"]}],
    },
    {
        "type": "record",
        "name": "AOrB",
        "fields": [{"name": "entity", "type": ["A", "B"]}],
    },
]

with open("aorb.json") as fp:
    for record in json_reader(fp, schema):
        print(record)

If you run that, what gets printed out is {'entity': {'foo': 'FOO'}}. The dictionary printed out is what I'm calling the "in-memory" representation.

On disk, in the serialized form, both Java and fastavro conform to the specification and include the type information. However, the in-memory representations differ. On the Java side, when you interact with the record you are probably doing so via some class and on the python side you have a dictionary but the dictionary is different than the serialized one.

So on the python side, if you already read in the json file and you want to access the FOO value, you would do something like record["entity"]["foo"], not record["AOrB"]["entity"]["A"]["foo"]["string"]. Hopefully that is more clear, but if not please let me know.

slominskir · 2021-02-02T23:02:54Z

@scottbelden One point of confusion is that I'm using the confluent-kafka-python API, which hides a lot of the details that happen under the hood (sometimes a good thing). It sounds like the fastavro library does everything, but by the time the DeserializedConsumer from the kafka-python library returns a message with poll() it's already too late it seems as I get a dict object back (that does not include the union type info). You've given examples using fastavro, but do you have read/write examples using the kafka-python API? I assume that's what this ticket is about?

slominskir · 2021-02-02T23:43:31Z

By the way - one "hack" to get this to work is to edit the file under site-packages directly:

confluent-kafka-python/src/confluent_kafka/schema_registry/avro.py

Lines 321 to 323 in 8069f26

    
           obj_dict = schemaless_reader(payload, 
        
                                        writer_schema, 
        
                                        self._reader_schema)

to be:

            obj_dict = schemaless_reader(payload,
                                         writer_schema,
                                         self._reader_schema, True)

On my system the full file path is: /usr/local/lib/python3.7/site-packages/confluent_kafka/schema_registry.avro.py.

This is obviously not a permanent solution. I assume buried in the 42 files changed in this pull request a flag is exposed so that True is added to that function call?

scottbelden · 2021-02-03T00:15:11Z

You've given examples using fastavro, but do you have read/write examples using the kafka-python API? I assume that's what this ticket is about?

Correct, my examples all just use fastavro directly because I am not familiar with the python kafka API. Unfortunately I probably can't help there.

This is obviously not a permanent solution. I assume buried in the 42 files changed in this pull request a flag is exposed so that True is added to that function call?

Correct, I think the original PR was just this single commit: f958879. But then a refactor commit got added to it that made it massive.

But my original comment from above still stands. This would need to be added as an optional argument somewhere in this library because it changes the data structure of what is being returned to the user. Instead of getting {'entity': {'foo': 'FOO'}}, they get {'entity': ('A', {'foo': 'FOO'}}).

slominskir · 2021-02-04T15:20:49Z

Replacement pull: #1028

Replaced stale pull #785; add config bool return_record_name

rnpridgeon and others added 2 commits February 21, 2020 09:43

Generic Serialization/Avro Refactor

2f1151b

return record name for AVRO union

f958879

Dieken requested review from rnpridgeon and edenhill February 22, 2020 06:43

rnpridgeon force-pushed the FirstClassFastAvro branch 5 times, most recently from fef1440 to b4f2481 Compare February 28, 2020 09:58

rnpridgeon force-pushed the FirstClassFastAvro branch 5 times, most recently from 558cdb0 to 52520d5 Compare March 17, 2020 22:32

rnpridgeon force-pushed the FirstClassFastAvro branch 10 times, most recently from b936017 to 40fe906 Compare March 30, 2020 14:09

rnpridgeon force-pushed the FirstClassFastAvro branch 2 times, most recently from 2d1b742 to 4d318bc Compare March 30, 2020 22:26

slominskir added a commit to slominskir/confluent-kafka-python that referenced this pull request Feb 4, 2021

Replaced stale pull confluentinc#785; add config bool return_record_name

3cdd1d2

slominskir added a commit to slominskir/confluent-kafka-python that referenced this pull request Feb 4, 2021

Replaced stale pull confluentinc#785; add config bool return_record_name

63c7860

slominskir added a commit to slominskir/confluent-kafka-python that referenced this pull request Feb 4, 2021

Replaced stale pull confluentinc#785; add config bool return_record_name

4ef0b43

slominskir added a commit to slominskir/confluent-kafka-python that referenced this pull request Feb 4, 2021

Replaced stale pull confluentinc#785; add config bool return_record_name

60e607f

slominskir added a commit to slominskir/confluent-kafka-python that referenced this pull request Feb 4, 2021

Replaced stale pull confluentinc#785; add config bool return_record_name

a4d3f90

edenhill added the component:serdes Avro, JSON, Protobof, Schema-registry label Mar 9, 2021

mhowlett pushed a commit that referenced this pull request Mar 10, 2021

Merge pull request #1028 from slominskir/return_record_name

60049a8

Replaced stale pull #785; add config bool return_record_name

edenhill closed this Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

return record name for AVRO union #785

return record name for AVRO union #785

Uh oh!

Dieken commented Feb 22, 2020

Uh oh!

ghost commented Feb 22, 2020

Uh oh!

Dieken commented Feb 22, 2020

Uh oh!

ghost commented Feb 22, 2020

Uh oh!

scottbelden commented Feb 24, 2020

Uh oh!

Dieken commented Feb 25, 2020 •

edited

Loading

Uh oh!

rnpridgeon commented Feb 25, 2020

Uh oh!

slominskir commented Feb 1, 2021

Uh oh!

scottbelden commented Feb 1, 2021

Uh oh!

slominskir commented Feb 2, 2021

Uh oh!

scottbelden commented Feb 2, 2021

Uh oh!

slominskir commented Feb 2, 2021

Uh oh!

slominskir commented Feb 2, 2021

Uh oh!

scottbelden commented Feb 3, 2021

Uh oh!

slominskir commented Feb 4, 2021

Uh oh!

Uh oh!

return record name for AVRO union #785

return record name for AVRO union #785

Uh oh!

Conversation

Dieken commented Feb 22, 2020

Uh oh!

ghost commented Feb 22, 2020

Uh oh!

Dieken commented Feb 22, 2020

Uh oh!

ghost commented Feb 22, 2020

Uh oh!

scottbelden commented Feb 24, 2020

Uh oh!

Dieken commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rnpridgeon commented Feb 25, 2020

Uh oh!

slominskir commented Feb 1, 2021

Uh oh!

scottbelden commented Feb 1, 2021

Uh oh!

slominskir commented Feb 2, 2021

Uh oh!

scottbelden commented Feb 2, 2021

Uh oh!

slominskir commented Feb 2, 2021

Uh oh!

slominskir commented Feb 2, 2021

Uh oh!

scottbelden commented Feb 3, 2021

Uh oh!

slominskir commented Feb 4, 2021

Uh oh!

Uh oh!

Dieken commented Feb 25, 2020 •

edited

Loading