DM-17058: Initial PR #1

jonathansick · 2019-01-24T21:19:52Z

This PR sets up the package, implements a RegistryAPI client that works with the Confluent Schema Registry (and caches schemas locally when possible), and adds Avro serializers and deserializers that work with schemas in a Confluent Schema Registry.

The package is pretty young still, but there are API docs at https://kafkit.lsst.io/v/DM-17058/ Those docs explain the scope of the work and aim of the package.

This includes all the standard practices used in other SQuaRE Python packages. The encrypted variables for PyPI and LTD deployments are real.

This is a generic function that wraps uritemplate, and also adds the ability to add a hostname. This will be a fairly common pattern in this library because relative paths will be well-known, but the hostname needs to be configured. This is based on gidgethub. I've Added the license for gidgethub (licenses/gidgethub.txt) along with attribution in the module docstring to give credit to Gidgethub for the sansio design and basis for this implementation.

I'm trying out the idea of creating a sansio layer for the registry interface, inspired gidgethub (https://github.com/brettcannon/gidgethub) and https://sans-io.readthedocs.io.

This client is based on the sans-io architecture (https://sans-io.readthedocs.io) and heavily influenced by the Gidgethub implementation. The idea is that the sansio.RegistryApi client knows how to format requests and process responses for the Confluence Schema Registry API, but doesn't include an HTTP client itself. This means it's easy to subclass RegistryApi with a mock implementation for unit testing, and with other subclasses to integrate with specific HTTP clients. Some things that can be improved from this starting point: - More tests (I've got cursory tests set up) - A clear exception hierachy for the different failure odes of the schema registry. This also adds pytest-asyncio, which adds a useful decorator for testing coroutines.

This client subclass integrates with aiohttp itself and is what I expect most users to actually use. There's not a lot to it given that the sansio base client has all the logic. I've added an aiohttp setuptools extra to make it easy to install aiohttp alongside kafkit, though because the session is created by the user, there isn't (yet) an import dependency on aiohttp inside kafkit.

- decipher_message automatically raises these errors for users. - the classes correspond to 300, 400 and 500 errors. - the exceptions include information from the error message produced by the registry's response, if possible.

The confluent wire format is documented at https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html#wire-format

This cache specifically maps schemas to their IDs in the registry. Given a schema, you can look up its ID, and given an ID you can look up its ID. This is expected to be the main usage pattern, rather than referring to schemas by their subject name or version number.

This is an important method because it lets us register schemas in the first place, and also lets us get schema IDs for already registered schemas based on the schema itself. By integrating with the SchemaCache, this method should be fast.

fastavro is the 'blessed' package for parsing and normalizing Avro schemas, as well as encoding and decoding messages.

This makes it easier to use the mock client in many different test contexts.

This wraps the registry's GET /schemas/ids/{int: id} endpoint and is used frequently be deserializers since the schema ID is included in the Confluent Wire Format. The method is cached.

This adds Serializer and Deserializer classes that can be used by producers and consumers to send and recieve messages in the Confluent Wire Format (https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html#wire-format) This means that schema ID in the schema registry are included as a prefix to the avro-encoded message. The Serializer is dedicated to working for a specific schema. This way is can be used as a value_serializer or key_serializer by the Producer (see https://aiokafka.readthedocs.io/en/stable/examples/serialize_and_compress.html#serialization-and-compression). The Deserializer needs to react to whatever schema ID gets send to it. For that reason, it can't implement a __call__ (it'd need to be a coroutine), and thus must always be used externally to the consumer.

Wheras Serializer serializes for a single schema (so it can be a callable), the PolySerializer can serialize messages in any schema.

I think this is more accurate, because it's expected to have the schema, the hostname, and the post.

This method is a high-level interface to the GET /subjects{/subject}/versions{/version} endpoint.

This makes it possible to cache subject version information. The existing schema version is still stored in the Schema Cache. This now associates a subject-version tuple to a schema ID. This makes it possible to cache the RegistryApi.get_schema_by_subject() method.

This is required to easily get the full set of schema information, rather than calling both get_id and get_schema.

Now the get_schema_by_subject method gets caching.

This adds properties to RegistryApi for the SchemaCache and SubjectCache, rather than having general attributes. This effectively renames RegistryApi.schemas -> RegistryApi.schema_cache.

The endpoints like GET /schemas/ids/{int: id} and GET /subjects/(string: subject)/versions/(versionId: version) provide the schemas in a serialized string form, rather than as a JSON object. The tests misrepresented this, but now it's all fixed up :)

API documentation powered by numpydoc and automodapi.

This will get shortened and details moved to conceptual documentation pages around serializers, deserializers, and the registry client.

Put common core APIs in the main kafkit.registry namespace. Leave sansio stuff in kafkit.registry.sansio so it's clear that that code is sans IO. And leave kafkit.registry.aiohttp in its own namespace because of the optional aiohttp dependency.

jonathansick · 2019-01-24T21:30:39Z

@cbanek care to look at this (maybe just scan the docs, https://kafkit.lsst.io/v/DM-17058/, and code) and let me know if this looks like a reasonable architecture to you?

cbanek · 2019-01-28T20:34:54Z

kafkit/registry/sansio.py

+            except Exception:
+                # If the schema couldn't be parsed, its not going to be a
+                # valid key anyhow.
+                raise KeyError


You may want to print out the serialization error, or at least some message with this, as it might be hard for the user to tell the difference between a key not existing, and maybe when there's something troubling in the cache?

It looks like the schema registry catches keyerror and then just continues as if it isn't in the cache, and might add it back in or something?

Good idea. The client should recover mostly by doing the actual HTTP request, but it doesn't seem like there's any harm forming an error message for anyone who's using the cache directly.

cbanek · 2019-01-28T20:40:25Z

kafkit/registry/sansio.py

+                if schema is None:
+                    raise ValueError(
+                        'Trying to cache the schema ID for subject '
+                        f'{subject!r}, version {version}, but it\'s schema ID '


its / it's? then you can get rid of the escaped backslash? ;)

cbanek

Looks amazing!

Just wondering, and this is somewhat of an out of band question, but what do you think the workflow of publishing versions will be? It looks like the same library can both make new versions and use any version, which is good, but it seems like maybe the CI system will make new versions with the latest XML schema? And then the callers using the schemas will only be consuming them? This is somewhat important since it doesn't look like there's any strong ordering of versions, or pushing new versions, or even that version numbers have to be sequential?

jonathansick · 2019-01-30T22:03:32Z

Thanks!

That's a really good question. For SAL/EFD, I have no idea.

It is a little spooky that the Schema Registry's register endpoint is both used to register a schema and also to look up a schema's ID based on the schema document if the schema is already registered.

For SQuaRE Events, I was thinking of having a policy something like this:

Applications (producers or consumers) keep a copy of the schemas they use in their code base under git.
Normally schema changes are led by the producer. So if a producer introduces a schema change, that new schema first appears in the Git repo of the producer. Consumers would later follow and copy that schema to their repositories once they are updated to use those new schemas.
This way producers and consumers are always using exactly the schemas they want, but the schema ID is maintained solely in the Schema Registry. This is great because schema IDs and version numbers are kind of deployment specific.
I'm thinking of have a Deserializer that lets you provide "preferred" schemas. This way, as far as I understand it, the deserializer can automatically convert a message to the exact schema that the consumer wants, rather than whatever version the producer is producing.

Does that make any sense?

cbanek · 2019-01-30T22:27:25Z

That makes sense, although I'm not sure if there's any backward compatibility between the versions - it seems like they're completely different from each other. So that means that you need to update everything in lock-step. That is, if a producer starts publishing a new version of a schema, the consumer is basically broken because it's listening for a version that will no longer be produced until it is updated. A single source of truth might help with this, depending on how deep the knowledge of the schemas might have to be. Plus having multiple copies of what versions are used might be hard to keep in sync. Seems reasonable though that the producer controls what schemas are published though, although you might be able to do this at build time? (Like publish the latest version of the schemas when you make the containers or release? Then you might have some kind of post-test hook that does a commit of the new schema)

It's also just possible to try a few things out until they break, or become unwieldy, then we can know the problems and figure out a better way of doing things in a more incremental way. I'm not even sure how often we'll update new versions, rather than just add new schemas... Just something to keep in mind.

jonathansick · 2019-01-30T22:37:58Z

This is what I'm thinking about for schema backwards compatibility:
https://fastavro.readthedocs.io/en/latest/reader.html#fastavro._read_py.schemaless_reader

reader = schemaless_reader(fh, writer_schema, reader_schema=preferred_schema)

So the writer_schema comes from the ID in the Confluent Wire Format and then the preferred_schema comes from the consumer's Git repo.

To make this work, we'd have to turn on the Schema Registry feature that requires a subject to have transitive forwards compatibility (
https://docs.confluent.io/current/schema-registry/docs/avro.html#forward-compatibility) across schema versions.

Hmm, the Confluent default is to have backwards compatibility so that consumers upgrade first and producers follow. I wonder if that really is better than changing producers first?

See tholo/pytest-flake8#56

jonathansick added 28 commits January 7, 2019 16:27

Set up Python package

3034340

This includes all the standard practices used in other SQuaRE Python packages. The encrypted variables for PyPI and LTD deployments are real.

Add a sansio function to set up registry headers

96d1d6a

I'm trying out the idea of creating a sansio layer for the registry interface, inspired gidgethub (https://github.com/brettcannon/gidgethub) and https://sans-io.readthedocs.io.

Add exception classes for registry errors

37fcc0e

- decipher_message automatically raises these errors for users. - the classes correspond to 300, 400 and 500 errors. - the exceptions include information from the error message produced by the registry's response, if possible.

Add docstrings to RegistryApi methods

0759cfa

Add functions to pack and unpack the wire format

1826810

The confluent wire format is documented at https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html#wire-format

Add a SchemaCache

3111dc2

This cache specifically maps schemas to their IDs in the registry. Given a schema, you can look up its ID, and given an ID you can look up its ID. This is expected to be the main usage pattern, rather than referring to schemas by their subject name or version number.

Add register_schema method to RegistryyApi

c2bb037

This is an important method because it lets us register schemas in the first place, and also lets us get schema IDs for already registered schemas based on the schema itself. By integrating with the SchemaCache, this method should be fast.

Add fastavro as a dependency

4635697

fastavro is the 'blessed' package for parsing and normalizing Avro schemas, as well as encoding and decoding messages.

Move MockRegistryApi to kafkit.registry.sansio

c0d0937

This makes it easier to use the mock client in many different test contexts.

Add the RegistryApi.get_schema_by_id method

7a68168

This wraps the registry's GET /schemas/ids/{int: id} endpoint and is used frequently be deserializers since the schema ID is included in the Confluent Wire Format. The method is cached.

Add a PolySerializer class

ff39db2

Wheras Serializer serializes for a single schema (so it can be a callable), the PolySerializer can serialize messages in any schema.

Rename host to url

33d2190

I think this is more accurate, because it's expected to have the schema, the hostname, and the post.

Fix repo name in PyPI deploy

01ba867

Add RegistryApi.get_schema_by_subject method

1875670

This method is a high-level interface to the GET /subjects{/subject}/versions{/version} endpoint.

Add a SubjectCache

0530741

This makes it possible to cache subject version information. The existing schema version is still stored in the Schema Cache. This now associates a subject-version tuple to a schema ID. This makes it possible to cache the RegistryApi.get_schema_by_subject() method.

Add SubjectCache.get() method

c9103b2

This is required to easily get the full set of schema information, rather than calling both get_id and get_schema.

Add the SubjectCache to RegistryApi

d80a920

Now the get_schema_by_subject method gets caching.

Make RegistryApi cache properties

30668f0

This adds properties to RegistryApi for the SchemaCache and SubjectCache, rather than having general attributes. This effectively renames RegistryApi.schemas -> RegistryApi.schema_cache.

Add API reference to docs

a4d27e2

Set up documentation

96a4ce1

API documentation powered by numpydoc and automodapi.

Spiff up main description

39e0626

This will get shortened and details moved to conceptual documentation pages around serializers, deserializers, and the registry client.

Polish up the public API

5103c31

Put common core APIs in the main kafkit.registry namespace. Leave sansio stuff in kafkit.registry.sansio so it's clear that that code is sans IO. And leave kafkit.registry.aiohttp in its own namespace because of the optional aiohttp dependency.

Ignore registry URL from doc example

d4d2f9b

jonathansick requested a review from cbanek January 24, 2019 21:29

cbanek reviewed Jan 28, 2019

View reviewed changes

cbanek approved these changes Jan 28, 2019

View reviewed changes

jonathansick added 2 commits January 30, 2019 15:20

Typo fix

b65c5bf

Add an error message for SchemaCache misses

30739ca

Update change log for 0.1.0 release

1409b82

jonathansick force-pushed the tickets/DM-17058 branch from d3529d4 to 31467e2 Compare January 30, 2019 23:27

Pin flake8 to 3.6.0 for now

b6d6e76

See tholo/pytest-flake8#56

jonathansick force-pushed the tickets/DM-17058 branch from 31467e2 to b6d6e76 Compare January 30, 2019 23:37

jonathansick merged commit b6d6e76 into master Jan 30, 2019

jonathansick deleted the tickets/DM-17058 branch July 12, 2022 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-17058: Initial PR #1

DM-17058: Initial PR #1

jonathansick commented Jan 24, 2019

jonathansick commented Jan 24, 2019

cbanek Jan 28, 2019

cbanek Jan 28, 2019

jonathansick Jan 30, 2019

cbanek Jan 28, 2019

cbanek left a comment

jonathansick commented Jan 30, 2019

cbanek commented Jan 30, 2019

jonathansick commented Jan 30, 2019

DM-17058: Initial PR #1

DM-17058: Initial PR #1

Conversation

jonathansick commented Jan 24, 2019

jonathansick commented Jan 24, 2019

cbanek Jan 28, 2019

Choose a reason for hiding this comment

cbanek Jan 28, 2019

Choose a reason for hiding this comment

jonathansick Jan 30, 2019

Choose a reason for hiding this comment

cbanek Jan 28, 2019

Choose a reason for hiding this comment

cbanek left a comment

Choose a reason for hiding this comment

jonathansick commented Jan 30, 2019

cbanek commented Jan 30, 2019

jonathansick commented Jan 30, 2019