-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Representations #2523
Data Representations #2523
Conversation
Ah sorry about that, I don't have more big refactors planned. Rebasing on main would be great to check how's the test suite. Looks like an interesting feature! 👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great stuff. Looking at the higher level architecture, I see we are storing those parsers and formatters in dbTables
right now. This will prevent us from doing the same thing for computed columns and/or RPCs, correct?
Computed columns could be solved by making them "first-class" columns in terms of the schema cache - i.e. parse those functions and add them as columns in the cache. This would also improve openapi output, where those functions lack right now. That's unrelated to this PR, though.
But with the current structure we could not support RPCs, I think. Would RPC support be useful? I think so. At least for the output part, but probably also to parse input for arguments. We should maybe think RPCs from the beginning, because that might affect the overall architecture.
Thanks! Since the result of
Oh yes, I didn't know computed columns don't go into the cache! My vote for what it's worth would be to consider them normal columns, perhaps just with a flag indicating they are read-only? (Although even that isn't certain since I guess you could make computed columns read-write with
I think the need is slightly less pressing since with RPC you can already do whatever function calls you need to parse your inputs and format your outputs when you define your function. And if your function can't be changed you can define a second function to call the first with the appropriate layer of conversions. So RPC data representations is more of a nice to have than something that enables entirely new use cases. That said, it would be intuitive and probably also convenient to have that support. And if we do, definitely it should be for input and output for the same reason, intuitiveness. In order to add such support, my first instinct would perhaps be to split
Since we already know the types of a proc both in and out we could just look up each type in the second map to figure what functions to apply. For tables we'd continue to rely on the first map since we need that to build the |
I think we should get all the formatters/parsers in a separate query. Right now we get them together with the column definitions. This has two drawbacks:
I think it's very convenient to return table types (row types for existing tables) from RPCs - and then potentially use further embedding on the result. This would not be possible, if I wanted to return JSON from the RPC to make the transform myself. So having those formatters available for RPCs would be really helpful.
Ah yes, that's basically what I suggested above, too. |
d96284b
to
2212619
Compare
This PR has been basically rewritten from scratch. My initial work happened to coincide with a big refactoring, and since there were points about the actual mechanism of the functionality a 'big rewrite' ended up making sense. This PR depends on #2542.
|
aaf4618
to
00a84c9
Compare
This PR has been rebased on the latest update to #2542. The failing tests may be caused by this unexpected behaviour (tested on Postgres 9.6):
In recent versions
Indeed, this otherwise identical query does not cause an error (although it incorrectly gets due_at as the integer 2018):
DB Fiddle. A quick look around suggests early versions of Easy solution is to make data reps for payload parsing require Postgres 13 or above. Thoughts? Edit: another option would be to read the nested values as text (confirmed this does not error out) and ask the user to provide a text parser instead of a JSON parser. This would be illogical (a text parser to parse JSON), technically incorrect ( |
So the error for Postgres 10 and above is actually something else, it seems like. One minor error message change and lack of support for arrays of domain types breaks the |
3eef88f
to
c47bc5c
Compare
Summary of the last set of updates:
Most notably, I have resolved a hard-to-exercise edge case when using computed relations. I made a mistake when rewriting the PR so that data representations were no longer applied to fields read via a computed relation. This has now been fixed and given test coverage. |
0693f11
to
03d1014
Compare
@aljungberg Could you rebase this on |
@aljungberg More of a high level question. Since you mention plans for binary types:
On #1582 (comment) we're thinking of using pg AGGREGATEs to output custom binary types. Do you see that feature conflicting with what you have in mind? Or would it play along? Ideally we'd maintain a cohesive feature set. @wolfgangwalther Also, on #1582 (comment) you mentioned the possibility of using a custom function to parse input from the client - would data representations work better for that case? Or would both features be complementary? |
Sure, no problem. I'm pencilling this in for next week (other work permitting). I'll take a look at your aggregate question too. Thanks for picking this up! |
03d1014
to
d96284b
Compare
… fields. See PR PostgREST#2523. Most notable code changes: - Load data representation casts into schema cache. - Data representations for reads, filters, inserts, updates, views, over joins. - `CoercibleField` represents name references in queries where coercion may be needed. - `ResolverContext` help facilitate field resolution during planning. - Planner 'resolves' names in the API query and pairs them with any implicit conversions to be used in the query builder stage. - Tests for all of the above.
d96284b
to
1d4b9fa
Compare
Here's the update that hopefully ties up all known loose ends, @steve-chavez. Summary:
(The only remaining failure is that thing where the code coverage checker flags all data fields, even if they're in fact exercised by tests.) |
@aljungberg Is the above true? I'm trying to reproduce it as: create table qux(id bigint, due_at timestamptz);
insert into qux values (1, now());
curl 'localhost:3000/qux'
[{"id":1,"due_at":"2023-01-26T18:30:16.697152-05:00"}] The |
Just finished trying the features this brings, really cool! So it seems "data representations" would be superior to computed columns right? Because we can insert/update on them and they don't require an extra index for search. I guess computed columns would only make sense in cases where users can't/wan't to modify a table. Will review the code next. |
Also, just remembered a previous case where this was needed: #1597. Basically it required calling The solution was RPC, but this seems better as it doesn't require repeating the INSERT statement. It also seems useful for the PostGIS geometry type, which doesn't have a Which brings me to a question... could we extend this feature to also work for base types? It would not only be useful for |
…nsformation. - With the previous method, very long queries such as `ANY (ARRAY[test.color('000100'), test.color('CAFE12'), test.color('01E240'), ...` could be generated. Consider the case where the parser function name is 45 characters and there's a hundred literals. That's 4.5kB of SQL just for the function name alone! - New version uses `unnest`: `ANY (SELECT test.color(unnest('{000100,CAFE12,01E240,...}'::text[]))` to produce a much shorter query. - This is likely to be more performant and either way much more readable and debuggable in the logs.
2a456e6
to
8c35fce
Compare
|
Reading more in #1582 I see I may inadvertently have intersected with your prior work in some regards. As you rightfully pointed out, the idea of using aggregates there is quite related. I think data reps solve similar issues without requiring So your |
Yes. I'm close to finishing #1582(which stalled for years already). Let me finish that one, and then I'll also help with merging this PR - we'll also figure out how can the features play together. |
FYI, Wolfgang is proposing an idea with DOMAINs for #1582. |
Thanks for the heads up. That actually would work really well, posted my thoughts over there. |
… fields. See PR PostgREST#2523. Most notable code changes: - Load data representation casts into schema cache. - Data representations for reads, filters, inserts, updates, views, over joins. - `CoercibleField` represents name references in queries where coercion may be needed. - `ResolverContext` help facilitate field resolution during planning. - Planner 'resolves' names in the API query and pairs them with any implicit conversions to be used in the query builder stage. - Tests for all of the above. - More consistent naming (TypedX -> CoercibleX). New: unit tests for more data representation use cases; helpful as examples as well. New: update CHANGELOG with data representations feature description. Fixed failing idempotence test. New: replace date formatter test with one that does something. Fixup: inadvertent CHANGELOG change after rebase. Cleanup: `tfName` -> `cfName` and related. Document what IRType means. Formatting. New: use a subquery to interpret `IN` literals requiring data rep transformation. - With the previous method, very long queries such as `ANY (ARRAY[test.color('000100'), test.color('CAFE12'), test.color('01E240'), ...` could be generated. Consider the case where the parser function name is 45 characters and there's a hundred literals. That's 4.5kB of SQL just for the function name alone! - New version uses `unnest`: `ANY (SELECT test.color(unnest('{000100,CAFE12,01E240,...}'::text[]))` to produce a much shorter query. - This is likely to be more performant and either way much more readable and debuggable in the logs.
… fields. See PR PostgREST#2523. Most notable code changes: - Load data representation casts into schema cache. - Data representations for reads, filters, inserts, updates, views, over joins. - `CoercibleField` represents name references in queries where coercion may be needed. - `ResolverContext` help facilitate field resolution during planning. - Planner 'resolves' names in the API query and pairs them with any implicit conversions to be used in the query builder stage. - Tests for all of the above. - More consistent naming (TypedX -> CoercibleX). New: unit tests for more data representation use cases; helpful as examples as well. New: update CHANGELOG with data representations feature description. Fixed failing idempotence test. New: replace date formatter test with one that does something. Fixup: inadvertent CHANGELOG change after rebase. Cleanup: `tfName` -> `cfName` and related. Document what IRType means. Formatting. New: use a subquery to interpret `IN` literals requiring data rep transformation. - With the previous method, very long queries such as `ANY (ARRAY[test.color('000100'), test.color('CAFE12'), test.color('01E240'), ...` could be generated. Consider the case where the parser function name is 45 characters and there's a hundred literals. That's 4.5kB of SQL just for the function name alone! - New version uses `unnest`: `ANY (SELECT test.color(unnest('{000100,CAFE12,01E240,...}'::text[]))` to produce a much shorter query. - This is likely to be more performant and either way much more readable and debuggable in the logs.
I see that #2542 was merged. I don't know which version of PostgREST Supabase deploys but I hope the bug I'm encountering has been fixed. Since I've added a counter column of type After a bit of digging, I've found that PostgREST constructs a complex query from incoming requests. Here is one such query that was logged: WITH pgrst_source AS (WITH pgrst_payload AS (SELECT $1 AS json_data), pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload) INSERT INTO \"public\".\"collections\"(\"cat\", \"desc\", \"name\") SELECT \"cat\", \"desc\", \"name\" FROM json_populate_recordset (null::\"public\".\"collections\", (SELECT val FROM pgrst_body)) _ RETURNING \"public\".\"collections\".*) SELECT '' AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, array[]::text[] AS header, coalesce((json_agg(_postgrest_t)->0)::text, 'null') AS body, nullif(current_setting('response.headers', true), '') AS response_headers, nullif(current_setting('response.status', true), '') AS response_status FROM (SELECT \"collections\".* FROM \"pgrst_source\" AS \"collections\" ) _postgrest_t
SELECT "cat", "desc", "name"
FROM json_populate_recordset(null::"public"."collections", '[{"name":"test book","desc":"","cat":"books"}]'); …and of course I got the same error. Anyway, since it was replaced with |
And did you get a better result for
No, I don't think so. This does not seem to be connected to this MR either. While we do a lot of stuff with domains in the context of data representations, considering the Could you please open a separate issue with this with a minimal, reproducible example etc.? |
I'm coming back with the results and nope,
Yes, my apologies. I just wasn't sure to open a new issue. It should be fairly easy to reproduce. Let me come up with an example. Thanks. |
… fields. See PR #2523. Most notable code changes: - Load data representation casts into schema cache. - Data representations for reads, filters, inserts, updates, views, over joins. - `CoercibleField` represents name references in queries where coercion may be needed. - `ResolverContext` help facilitate field resolution during planning. - Planner 'resolves' names in the API query and pairs them with any implicit conversions to be used in the query builder stage. - Tests for all of the above. - More consistent naming (TypedX -> CoercibleX). New: unit tests for more data representation use cases; helpful as examples as well. New: update CHANGELOG with data representations feature description. Fixed failing idempotence test. New: replace date formatter test with one that does something. Fixup: inadvertent CHANGELOG change after rebase. Cleanup: `tfName` -> `cfName` and related. Document what IRType means. Formatting. New: use a subquery to interpret `IN` literals requiring data rep transformation. - With the previous method, very long queries such as `ANY (ARRAY[test.color('000100'), test.color('CAFE12'), test.color('01E240'), ...` could be generated. Consider the case where the parser function name is 45 characters and there's a hundred literals. That's 4.5kB of SQL just for the function name alone! - New version uses `unnest`: `ANY (SELECT test.color(unnest('{000100,CAFE12,01E240,...}'::text[]))` to produce a much shorter query. - This is likely to be more performant and either way much more readable and debuggable in the logs.
🚀 Merged on #2839 |
Data Representations
Present and receive API fields with custom formatting rules.
For certain APIs, the default conversion to and from JSON performed by PostgreSQL may not be right. For example you might wish to represent dates in UNIX epoch timestamp format, fixed precision decimals as strings rather than JSON floats, colours as CSS hex strings, binary blobs as base64 data and so forth. Perhaps you even need different representations in different API endpoints.
Data Representations allow you to create custom, two-way convertible representations of your data on a per field level using standard PostgreSQL functions. Just create a
domain
, which we can view as a kind of type alias that does nothing by default, and then define casts to and from JSON. (Normally PostgreSQL ignores domain casts, but Data Representations detects those casts and applies them automatically as needed.)Features:
GET
.POST
orPATCH
.TEXT->my custom type
cast.RETURNING
operations, like if you request the full body of your patched resource withreturn=presentation
.Example
PostgreSQL has a built-in formatter for dates:
Let's replace the
+00:00
suffix withZ
, an equivalent but shorter ISO 8601 date:The underlying data is unchanged since our custom domain is just an alias for
timestamp with time zone
.Let's look at some colours.
In this case we store our colour as an integer, so it's being formatted as an integer by PostgreSQL. Let's consider that an implementation detail we wish to hide from our API consumers.
To the outside world, it looks just like if your column type is a string with hexadecimal values, while under the hood it's an integer which even comes with a built in check constraint on validity. Your data is clean, space efficient, yet presented and interpreted in a user friendly manner.
What about filtering? For that we need one last converter. Since query parameters are given as strings, let's make a
text->color
cast. We already have the function for it.Now we can filter:
Under the hood the value is first converted to its underlying integer type before querying the table, which ensures maximum performance. Without Data Representations it would have been easy to end up with a situation where instead PostgresSQL had to format every row colour as a hexadecimal string and then compare, making index usage hard.
TODO
This pull request implements #2310. I renamed the concept from "transformers" to "data representations" which seemed more intuitive.
Future Direction
Some things that could be implemented in the future: