-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support PG array dimensionality #411
support PG array dimensionality #411
Conversation
Build a list type with `ndims` dimensions containing nullable `base_type` as the innermost value type. | ||
""" | ||
if ndims == 0: | ||
return UnionType(types=[NullType(), base_type]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about this one. It seems right, but I'm not 100% sure. As I read it, there are a few things:
- DbapiConverter handles root-level NULLABLE fields (https://github.com/recap-build/recap/blob/main/recap/converters/dbapi.py#L15-L16)
- This code here handles NULLABLE items in a PG ARRAY field.
I think this is the right behavior. But I'm curious: are PG arrays always allowed NULLs in their dimensional values? I couldn't find good docs on this. I haven't tested it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some testing and digging, and afaict the answer is yes- the innermost value can always be null. Enforcing non-nulls requires adding some sort of validation to CHECK
against https://stackoverflow.com/a/59421233. Which seems like a pretty challenging rabbit hole of digging through information_schema.check_constraints
@@ -8,32 +8,22 @@ | |||
FloatType, | |||
IntType, | |||
ListType, | |||
ProxyType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiiice. Took me a sec to grok why we didn't need this anymore. Fully walking the n_dimensions means we don't need self-references. Awesome.
One question/nuance here: the PG dimensions are just a suggestion.
The current implementation does not enforce the declared number of dimensions either. Arrays of a particular element type are all considered to be of the same type, regardless of size or number of dimensions. So, declaring the array size or number of dimensions in CREATE TABLE is simply documentation; it does not affect run-time behavior.
https://www.postgresql.org/docs/current/arrays.html
So the question is, do we want to have the Recap reflect the DB's data or its schema? My implementation (with ProxyType) reflected the data. Yours changes it to reflect the schema. Perhaps we want it configurable one as the default? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to think the schema is the beacon of truth for what the user intends for the column. If users are leveraging the column differently than schema's representation, they should fix the schema. But I could see past mistakes leading to a situation where this isn't true, which would then lead to recap constructing a false narrative about the data. I think making it configurable makes sense. Maybe default to ProxyType since that's the safer assumption? Would we want to add config params to the PostgresqlConverter
constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, can you add a param to the init to config. Defaulting to proxy is safer, as you say.
recap/clients/postgresql.py
Outdated
information_schema.columns.*, | ||
pg_attribute.attndims | ||
FROM information_schema.columns | ||
JOIN pg_attribute on information_schema.columns.column_name = pg_attribute.attname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is enough. I think you need to join on table, schema, and column. It appears pg_attribute
doesn't have those:
Table "pg_catalog.pg_attribute"
Column | Type | Collation | Nullable | Default
----------------+-----------+-----------+----------+---------
attrelid | oid | | not null |
attname | name | | not null |
atttypid | oid | | not null |
attstattarget | integer | | not null |
attlen | smallint | | not null |
attnum | smallint | | not null |
attndims | integer | | not null |
attcacheoff | integer | | not null |
atttypmod | integer | | not null |
attbyval | boolean | | not null |
attalign | "char" | | not null |
attstorage | "char" | | not null |
attcompression | "char" | | not null |
attnotnull | boolean | | not null |
atthasdef | boolean | | not null |
atthasmissing | boolean | | not null |
attidentity |
"char" | | not null |
attgenerated | "char" | | not null |
attisdropped | boolean | | not null |
attislocal | boolean | | not null |
attinhcount | integer | | not null |
attcollation | oid | | not null |
attacl | aclitem[] | | |
attoptions | text[] | C | |
attfdwoptions | text[] | C | |
attmissingval | anyarray | | |
So perhaps we need another join here as well?
Per-ChatGPT:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice catch, I'll add in the other joins
test_bit_array BIT(8)[] | ||
test_bit_array BIT(8)[], | ||
test_int_array_2d INTEGER[][], | ||
test_text_array_3d TEXT[][][] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding a NOT NULL array as well? I realized we haven't tested that.
This looks great. I think we're ready to merge after you add the ProxyType config stuff! |
…e how the reader interprets the number of dimensions in an array
recap/converters/postgresql.py
Outdated
def __init__(self, namespace: str = DEFAULT_NAMESPACE) -> None: | ||
def __init__( | ||
self, | ||
ignore_array_dimensionality: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the negation support_array_dimensionality = False
or include_array_dimensions = False
would be more intuitive for the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote is for 'enforce_array_dimensions'
Sorry for delayed review--things have been busy on the home front. 🎄 I'll try and get to this over the next few days. |
@alexdemeo 0.9.6 is up on pypi with this change! Thanks! |
Add array support to postgres reader/converter with dimensionality read from
pg_attribute.attndims
. This is grabbed by joining oninformation_schema.columns.column_name = pg_attribute.attname
This a followup to #410 which lays a tiny bit of groundwork for #264 (comment)