Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Struct columns #5

Open
adriangb opened this issue Feb 28, 2023 · 6 comments
Open

Support Struct columns #5

adriangb opened this issue Feb 28, 2023 · 6 comments

Comments

@adriangb
Copy link
Owner

What should we do with these? Custom types in Postgres are messy because we need to know (1) the name to emit DDL and (2) the OIDs of the inner types, which is messy when dealing with nested types like arrays. Maybe we should copy these over as JSONB?

@adriangb
Copy link
Owner Author

adriangb commented Mar 9, 2023

So I think the thing to do here is ask the user to get the OIDs for us:

needed_metadata = analyze_metadata(schema)

metada = await some_util_funcion_per_db_driver(needed_metada, some_connection)

encoder = Encoder(schema, metadata)

The key is that the rust package remain sans-io and with no dependency on a particular database driver (which would make it much harder to be language agnostic).

@adriangb adriangb mentioned this issue Mar 19, 2023
@adriangb
Copy link
Owner Author

adriangb commented May 8, 2023

Making progress here. We can now specify the encoder to use for each column. We just need the utility function.

@adriangb adriangb added the good first issue Good for newcomers label May 8, 2023
@adriangb adriangb removed the good first issue Good for newcomers label Jun 15, 2023
@jpfeuffer
Copy link

Hi @adriangb ! Thank you for this helpful little lib.

I was investigating the problem a little and I am not sure if you really need the type names or their OIDs for an export into Postgres' binary format (anymore). I think restrictions on matches between expected server-side OID and OIDs in the pg.bin dump were loosened around v14: https://www.postgresql.org/message-id/E1jxxnr-0003Mn-S8%40gemulon.postgresql.org
Since then, only mismatches in predefined system-level OIDs (<10000) error out (which is fine for our purposes since you registered them for all equivalent pyarrow types already). Same for array types.

Therefore, I believe (as a first step or fallback) you could just put dummy OIDs (>10000, maybe better >16384) in your Encoder(Builder) for a pa.Struct or the pa.List Encoder when it encounters a pa.Struct as inner_type.

For pg versions <14 or as safe default behaviour, one could indeed require a dict[pa.struct, tuple[OID, OID]] (or Map equiv. in rust) that is used during the traversal of the column type hierarchy (where the tuple describes the type OID and corresponding array_type OID).
Note that by looking at the source code, you actually do not need any OID for the first level of a record/composite type column (only the first children and every struct below).
I don't think we would ever need the type names as in the DDL except if we want the some_util_funcion_per_db_driver function to work with that to make it a bit easier.

What do you think about working on this together? My rust is unfortunately not very good (we will need some recursion for the struct encoders) but I could do testing and an example for the analyze_metadata and util_functions.

@adriangb
Copy link
Owner Author

That's really interesting news! I would be happy to support only Postgres > 14.

The recursion shouldn't be too hard. Do we need arbitrary recursion, or just one level?

@jpfeuffer
Copy link

I think ideally full recursion but it only recurses whenever you encounter a struct or a list (which you kind of handled already).
Regarding total size that needs to be put in the beginning of a binary struct encoding for an entry: it should be possible to keep a total byte count while recursing down but you could also do it like duckdb and save the start position when starting to insert in the buffer, do a start-end when the insert for an entry is done, go back to the start position and fill those bytes. Not sure what fits the current style best.

@adriangb
Copy link
Owner Author

We can do full recursion. IIRC I used the DuckDB way for Lists and other variable length types - I would do the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants