Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.63.2 introduced serialization problems #763

Open
cristianmatache opened this issue Sep 30, 2024 · 4 comments
Open

Version 0.63.2 introduced serialization problems #763

cristianmatache opened this issue Sep 30, 2024 · 4 comments

Comments

@cristianmatache
Copy link
Contributor

cristianmatache commented Sep 30, 2024

Describe the bug
Changes in version 0.63.2 introduced a serialization problem.

To Reproduce

from dataclasses import dataclass

from dataclasses_avroschema import AvroModel


@dataclass
class A(AvroModel):
    raw: int


@dataclass
class B(AvroModel):
    a: A | None


@dataclass
class C(AvroModel):
    b: B | None

obj = C(B(A(1)))
assert obj.asdict() == {'b': {'a': {'raw': 1, '-type': 'A'}, '-type': 'B'}}
assert obj.avro_schema() == '{"type": "record", "name": "C", "fields": [{"name": "b", "type": [{"type": "record", "name": "B", "fields": [{"name": "a", "type": [{"type": "record", "name": "A", "fields": [{"name": "raw", "type": "long"}]}, "null"]}]}, "null"]}]}'
obj.serialize()
# ValueError: {'a': {'raw': 1, '-type': 'A'}, '-type': 'B'} (type <class 'dict'>) do not match [{'type': 'record', 'name': 'B', 'fields': [{'name': 'a', 'type': [{'type': 'record', 'name': 'A', 'fields': [{'name': 'raw', 'type': 'long'}]}, 'null']}]}, 'null'] on field b

Expected behavior
Not erroring.

@mauro-palsgraaf
Copy link

mauro-palsgraaf commented Oct 1, 2024

I can confirm that it's also the case for Pydantic:

To Reproduce

from typing import Optional

from dataclasses_avroschema.pydantic import AvroBaseModel
from pydantic import Field

class A(AvroBaseModel):
    x: str = Field(...)

class B(AvroBaseModel):
    a: Optional[A] = Field(...)
    
class C(AvroBaseModel):
    b: Optional[B] = Field(...)
    
event = C(b=B(a=A(x="hello world")))

print(event.serialize())

Investigation

It seems to be a problem with the double optional (in this particular case, but my guess would be that it's for every nested union). This case goes wrong in the _validate_record function in _validation_py.py:

def _validate_record(datum, schema, named_schemas, parent_ns, raise_errors, options):
    """
    Check that the data is a Mapping type with all schema defined fields
    validated as True.
    """
    _, fullname = schema_name(schema, parent_ns)
    return (
        isinstance(datum, Mapping)
        and not ("-type" in datum and datum["-type"] != fullname)
        and all(
            _validate(
                datum=datum.get(f["name"], f.get("default", NoValue)),
                schema=f["type"],
                named_schemas=named_schemas,
                field=f"{fullname}.{f['name']}",
                raise_errors=raise_errors,
                options=options,
            )
            for f in schema["fields"]
        )
    )

With the pydantic classes above, the fullname variable is: B.a.A whereas the -type is A. This results in the validation to fail. This validation is used in the write_union in _write_py.py of fastavro when determining the best match. Since the validation fails, no best match can be found and the ValueError` is thrown.

Extra: Not sure if this helps, but what's interesting is that the parent_ns in the validation seems to be relative to the other optional.

In the following example:

from typing import Optional

from dataclasses_avroschema.pydantic import AvroBaseModel
from pydantic import Field

class A(AvroBaseModel):
    x: str = Field(...)

class B(AvroBaseModel):
    a: Optional[A] = Field(...)
    
class C(AvroBaseModel):
    b: Optional[B] = Field(...)
    
class D(AvroBaseModel):
    c: C = Field(...)
    
class E(AvroBaseModel):
  d: D = Field(...)
  
class F(AvroBaseModel):
  e: E = Field(...)
    
event = F(e=E(d=D(c=C(b=B(a=A(x="hello world"))))))

print(event.serialize())

Interesting this is that if I inspect the fullname variable in _validate_record, it is B.a.A. However, if i change

class F(AvroBaseModel):
  e: Optional[E] = Field(...) # note this one is optional now

Then the fullname variable is E.d.D.c.C.b.B. So i guess the fullname there must be relative to the outer union

@marcosschroh
Copy link
Owner

Hallo, checking it.

@marcosschroh
Copy link
Owner

I have created this issue on fastavro to check it. For simple use cases it is working but not for multiple levels with unions.

@marcosschroh
Copy link
Owner

marcosschroh commented Oct 9, 2024

I have made some refactor to support better serialization for nested unions, however still I am not able to make it work with schemas that has NOT a namespace. With the patch that I am working on this will work (with a namespace):

from typing import Optional

from dataclasses_avroschema.pydantic import AvroBaseModel
from pydantic import Field

class A(AvroBaseModel):
    x: str = Field(...)

    class Meta:
        namespace = "a_namescpace"  # NAMESPACE ADDED

class B(AvroBaseModel):
    a: Optional[A] = Field(...)
    
class C(AvroBaseModel):
    b: Optional[B] = Field(...)
    
event = C(b=B(a=A(x="hello world")))

print(event.serialize())
# >>> b'\x00\x00\x16hello world'

If I inspected the __named_schemas then I can see: dict_keys(['C', 'B', 'a_namescpace.A']). Then the type 'a_namescpace.A is used in asdict:

from fastavro import parse_schema

parsed_schema = parse_schema(C.avro_schema_to_python())
print(parsed_schema["__named_schemas"].keys())
# >>> dict_keys(['C', 'B', 'a_namescpace.A'])

print(event.asdict())
# >>> {'b': {'a': {'x': 'hello world', '-type': 'a_namescpace.A'}, '-type': 'B'}}

If we remove the namespace then the __named_schemas results in dict_keys(['C', 'B', 'A']) which is the correct one, but then the serialization fails. Still I am wondering whether it is a fastavro bug or the types should be indicated in a different way. If I specify the type as B.a.A does not work either. My understanding is that the full name (in fastavro __named_schemas) should be the one specified by the Name specification

Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants