Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Bug when Null Filtering on embedded resources #2951

Merged
merged 3 commits into from
Sep 15, 2023

Conversation

laurenceisla
Copy link
Member

Closes #2800

@laurenceisla laurenceisla marked this pull request as ready for review September 15, 2023 02:08
@laurenceisla laurenceisla merged commit fa182c2 into PostgREST:main Sep 15, 2023
@wolfgangwalther
Copy link
Member

I'm not sure whether the IS DISTINCT FROM fix is the best way to go here. We always have at least one column that is joined on. The correct check would be to translate table=not.is.null to something like table.join_column IS NOT NULL.

@wolfgangwalther
Copy link
Member

Or even better... maybe just translate table=not.is.null to an INNER JOIN entirely?

@steve-chavez
Copy link
Member

steve-chavez commented Sep 15, 2023

Or even better... maybe just translate table=not.is.null to an INNER JOIN entirely?

That might be difficult as joins much change the shape of the query.

Edit: additionally ANTI JOINs and OR queries wouldn't work no more.

@steve-chavez
Copy link
Member

The correct check would be to translate table=not.is.null to something like table.join_column IS NOT NULL.

What would be the advantage of that one? DISTINCT FROM null is more similar to table=not.is.null.

@laurenceisla
Copy link
Member Author

Or even better... maybe just translate table=not.is.null to an INNER JOIN entirely?

It could be possible for anti joins but the OR queries wouldn't work this way. Although I get that it's not 1-1 with PostgreSQL (table=isdistinct.null might be more precise). With that in mind, using is not null gives the errors from the issue: it checks if there's a null value for every column in the row instead of the existence of the row.

@wolfgangwalther
Copy link
Member

Or even better... maybe just translate table=not.is.null to an INNER JOIN entirely?

That might be difficult as joins much change the shape of the query.

Edit: additionally ANTI JOINs and OR queries wouldn't work no more.

Yes. I didn't say table=is.null should be replaced with a join. Only table=is.not.null.

The correct check would be to translate table=not.is.null to something like table.join_column IS NOT NULL.

What would be the advantage of that one? DISTINCT FROM null is more similar to table=not.is.null.

Performance. I don't think the distinct from approach will be optimized very well, while comparing only the join columns is very likely optimized nicely.

@laurenceisla laurenceisla deleted the fix-notisnull branch September 18, 2023 18:41
@laurenceisla
Copy link
Member Author

laurenceisla commented Sep 19, 2023

I did some benchmarks using pgbench for each of the approaches to the resulting query of:

curl "http://localhost:3000/projects?select=name,clients(name)&clients=not.is.null"
Queries:
-- IS NOT NULL (Before the change)

WITH pgrst_source AS
  (SELECT "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE
   WHERE "projects_clients_1" IS NOT NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;


-- IS DISTINCT FROM NULL

WITH pgrst_source AS
  (SELECT "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE
   WHERE "projects_clients_1" IS DISTINCT FROM NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;


-- INNER JOIN

WITH pgrst_source AS
  (SELECT "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   INNER JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;
The `pgbench` for each of the queries:
-- IS NOT NULL (Before the change)

scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 25 s
number of transactions actually processed: 152304
number of failed transactions: 0 (0.000%)
latency average = 0.164 ms
initial connection time = 1.737 ms
tps = 6092.564790 (without initial connection time)


 -- IS DISTINCT FROM NULL

scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 25 s
number of transactions actually processed: 158569
number of failed transactions: 0 (0.000%)
latency average = 0.158 ms
initial connection time = 2.025 ms
tps = 6343.240056 (without initial connection time)


-- INNER JOIN

scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 25 s
number of transactions actually processed: 169379
number of failed transactions: 0 (0.000%)
latency average = 0.148 ms
initial connection time = 1.733 ms
tps = 6775.588488 (without initial connection time)

Performance. I don't think the distinct from approach will be optimized very well, while comparing only the join columns is very likely optimized nicely.

The INNER JOIN is indeed the most performant, but I don't see how it can be easily rewritten to allow the OR between ?or=(table1.not.is.null,table2.not.is.null), for instance.

The correct check would be to translate table=not.is.null to something like table.join_column IS NOT NULL.

I didn't know how to implement this for the test (and allow the OR across joins), just used the IS NOT NULL before the change in this PR.

@steve-chavez
Copy link
Member

-- IS NOT NULL (Before the change)
tps = 6092.564790 (without initial connection time)

-- IS DISTINCT FROM NULL
tps = 6343.240056 (without initial connection time)

So we getter perf now! 🎉

The INNER JOIN is indeed the most performant, but I don't see how it can be easily rewritten to allow the OR

Yeah, INNER JOIN cannot be done and this goes back to the introduction of top-level filtering: #1949 (comment). Iced-Sun's query helped us there.

@wolfgangwalther
Copy link
Member

The INNER JOIN is indeed the most performant, but I don't see how it can be easily rewritten to allow the OR between ?or=(table1.not.is.null,table2.not.is.null), for instance.

Right, the OR approach can't be written with JOIN easily. But in my latest comment I didn't say "write as JOIN", but "check only join columns". This is possible to do with OR as well.

The test you did is very artificial, because the tables are really small (few columns!). Once you add a lot more columns, those tests will show that both before the change and after the change with IS DISTINCT are both very slow compared to just testing the join columns.

So we getter perf now!

As I said before this is misleading. The performance is not really better, it's less bad. But not good, yet.

Yeah, INNER JOIN cannot be done

If you test the join columns, PG might be able to rewrite this to an inner join automatically.

@laurenceisla
Copy link
Member Author

laurenceisla commented Sep 21, 2023

But in my latest comment I didn't say "write as JOIN", but "check only join columns".

Yes, at the end of the tests I added that I couldn't find a way on how to do this, that's why it's not included there.

If you test the join columns, PG might be able to rewrite this to an inner join automatically.

Cool, then the problem right now is to rewrite the queries to allow filtering on the JOIN columns. It does not seem to be trivial: as it is now, we select from the LEFT JOIN LATERAL subquery, if the join column is not selected by the user ?select=joincol, then it's not included in that subquery, hence the filter cannot be done. We need to find a way around that, if anyone have some ideas I could give them a try.

Edit: opened an issue to keep track of this #2961

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Null filtering on Embedded Resources bug
3 participants