Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SchemaRegistry] investigate Apache Avro dependency #20708

Closed
4 tasks done
swathipil opened this issue Sep 15, 2021 · 6 comments
Closed
4 tasks done

[SchemaRegistry] investigate Apache Avro dependency #20708

swathipil opened this issue Sep 15, 2021 · 6 comments
Assignees
Labels
blocking-release Blocks release Client This issue points to a problem in the data-plane of the library. Messaging Messaging crew Schema Registry

Comments

@swathipil
Copy link
Member

swathipil commented Sep 15, 2021

  • check underlying dependencies
    • neither avro or fastavro have any dependencies for Python 3+
    • fastavro==0.24.2 for Python 2.7 has a dependency on pytz
  • check if Apache Avro has good community support
  • do not have Issues tab in GH
  • do allow for external contributors through PRs
  • compare with fastavro
    • fastavro is expected much faster
  • supporting Python 2.7/3.10
    • fastavro==0.24.2 used for Python 2.7, installs pytz dependency
    • fastavro does not yet support Python 3.10
@swathipil swathipil added Client This issue points to a problem in the data-plane of the library. Schema Registry Messaging Messaging crew labels Sep 15, 2021
@swathipil swathipil added this to the [2021] October milestone Sep 15, 2021
@swathipil swathipil self-assigned this Sep 15, 2021
@swathipil swathipil added the blocking-release Blocks release label Sep 15, 2021
@swathipil
Copy link
Member Author

swathipil commented Sep 16, 2021

concerns about fastavro:

  • does not support 2.7
  • Laurent: probably will not be a problem, since we're phasing out + schema registry has not been around long enough that a lot of people depend on 2.7
  • fastavro>=0.23.0,<1.0 DOES SUPPORT 2.7
  • does not support 3.10 yet
  • is it faster than apache? by how much? (include stats)
    yes, by about 10 times. (table below)
  • is there good community support? (include indicators)
  • pros:
    • has Issues tab in GH
    • all of the issues that are created have responses within a day or two
    • allow for PR submissions/community fixes so we could probably submit a PR if necessary
  • cons:
    • only one person is actively responding to the issues
    • addressing/fixing issues seems to be pretty slow (since only one person looks active)

other notes about fastavro:

  • apache avro does not return fullname on primitive types, and does not have an Issues section in their GH repo.
  • FASTAVRO HAS AN ISSUE CREATED TO RESOLVE THIS! (although don't know when it will be implemented)
  • fastavro writer requires passing in the schema each time so it does not need caching at the ObjectSerializer level (Apache Avro allows for specifying schema when initializing writer, then caching writer for future use)

@swathipil
Copy link
Member Author

swathipil commented Sep 20, 2021

other libraries that use Apache Avro (either in the library or in examples):


article comparing avro packages:
https://www.perfectlyrandom.org/2019/11/29/handling-avro-files-in-python/

@swathipil
Copy link
Member Author

swathipil commented Sep 21, 2021

  • average time (s) to serialize 100,000 records:
    • fastavro/bytes schema: 1.1712
    • avro/bytes schema: 8.9221
    • fastavro/str schema: 1.0727
    • avro/str schema: 8.7702
  • average time (s) to deserialize 100,000 records:
    • fastavro/bytes schema: 0.9307
    • avro/bytes schema: 8.1456
    • fastavro/str schema: 0.9554
    • avro/str schema: 7.9060

@swathipil
Copy link
Member Author

swathipil commented Sep 21, 2021

SERIALIZE WITH FASTAVRO

from fastavro import parse_schema, schemaless_writer
from io import BytesIO

def serialize_fastavro(schema, value):
    # type: (Union[bytes, str], dict) -> bytes
    # schemas passed in to fastavro must be converted to dict first
    json_schema = to_dict(schema)
    parsed_schema = parse_schema(json_schema, _write_hint=False)

    stream = BytesIO()
    with stream:
        schemaless_writer(stream, parsed_schema, value)
        encoded_data = stream.getvalue()
    return encoded_data

SERIALIZING WITH AVRO

avro_schema_cache = {}

def serialize_avro(schema, value):
    if not isinstance(schema, avro.schema.Schema):
        schema = avro.schema.parse(schema)
    try:
        writer = avro_schema_cache[str(schema)]
    except KeyError:
        writer = DatumWriter(schema)
        avro_schema_cache[str(schema)] = writer
    
    stream = BytesIO()
    with stream:
        writer.write(value, BinaryEncoder(stream))
        encoded_data = stream.getvalue()
    return encoded_data

DESERIALIZE WITH FASTAVRO

from fastavro import parse_schema, schemaless_reader
from io import BytesIO

def deserialize_fastavro(schema, data):
    # type: (Union[bytes, str], bytes) -> str
    # schemas passed in to fastavro must be converted to dict first
    json_schema = to_dict(schema)
    parsed_schema = parse_schema(json_schema, _write_hint=False)

    stream = BytesIO(data)
    with stream:
        decoded_data = schemaless_reader(stream, parsed_schema)

DESERIALIZE WITH AVRO

avro_reader_cache = {}

def deserialize_avro(schema, data):
    if not hasattr(data, 'read'):
        data = BytesIO(data)

    if not isinstance(schema, avro.schema.Schema):
        schema = avro.schema.parse(schema)

    try:
        reader = avro_reader_cache[str(schema)]
    except KeyError:
        reader = DatumReader(writers_schema=schema)
        avro_reader_cache[str(schema)] = reader

    with data:
        bin_decoder = BinaryDecoder(data)
        decoded_data = reader.read(bin_decoder)

@swathipil
Copy link
Member Author

sample loading into pandas dataframa

@swathipil
Copy link
Member Author

ACTION ITEMS:

  • keep Python 2.7 for GA
  • test Avro exceptions/serialization/deserialization so that no types are leaked from Avro. If so, throw Azure Core exceptions.
  • create backlog issue: avro with comment in README if users want to install fastavro instead.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocking-release Blocks release Client This issue points to a problem in the data-plane of the library. Messaging Messaging crew Schema Registry
Projects
None yet
Development

No branches or pull requests

1 participant