Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postgres: add index for task_run_file_input(input_file_id) #608

Closed
wants to merge 3 commits into from

Conversation

fho
Copy link
Collaborator

@fho fho commented Oct 2, 2024

Deleting untracked file_inputs from the database takes a very long time.
The table has a multicolumn b-tree index for both it's columns.
The input_file_id file is the second part of the index which still
requires the full index to be scanned1:
Constraints on columns to the right of these columns are checked in
the index, so they save visits to the table proper, but they do not
reduce the portion of the index that has to be scanned.

Create an index for the input_file_id column.

Comparison of the query:
EXPLAIN SELECT id FROM input_file WHERE NOT EXISTS (SELECT 1 FROM task_run_file_input AS trfi WHERE input_file.id = trfi.input_file_id);

Without index:
Gather (cost=2102870.71..2331530.97 rows=51672 width=4)
Workers Planned: 2
-> Parallel Hash Anti Join (cost=2101870.71..2325363.77 rows=21530 width=4)
Hash Cond: (input_file.id = trfi.input_file_id)
-> Parallel Index Only Scan using input_file_pkey on input_file (cost=0.42..1127.35 rows=28633 width=4)
-> Parallel Hash (cost=1170539.13..1170539.13 rows=56766813 width=4)
-> Parallel Seq Scan on task_run_file_input trfi (cost=0.00..1170539.13 rows=56766813 width=4)

With index:
Gather (cost=1000.99..24457.81 rows=51672 width=4) (actual time=194.390..206.568 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop Anti Join (cost=0.99..18290.61 rows=21530 width=4) (actual time=153.282..153.283 rows=0 loops=3)
-> Parallel Index Only Scan using input_file_pkey on input_file (cost=0.42..1127.35 rows=28633 width=4) (actual time=0.035..13.810 rows=22907 loops=3)
Heap Fetches: 19485
-> Index Only Scan using task_run_file_input_input_file_id_idx on task_run_file_input trfi (cost=0.57..157.21 rows=7992 width=4) (actual time=0.006..0.006 rows=1 loops=68720)
Index Cond: (input_file_id = input_file.id)
Heap Fetches: 1352

Footnotes

  1. https://www.postgresql.org/docs/current/indexes-multicolumn.html

@fho fho self-assigned this Oct 2, 2024
fho added 2 commits November 11, 2024 17:23
Deleting untracked file_inputs from the database takes a very long time.
The table has a multicolumn b-tree index for both it's columns.
The input_file_id file is the second part of the index which still
requires the full index to be scanned[^1]:
  Constraints on columns to the right of these columns are checked in
  the index, so they save visits to the table proper, but they do not
  reduce the portion of the index that has to be scanned.

Create an index for the input_file_id column.

Comparison of the query:
  EXPLAIN SELECT id FROM input_file WHERE NOT EXISTS (SELECT 1 FROM task_run_file_input AS trfi WHERE input_file.id = trfi.input_file_id);

Without index:
 Gather  (cost=2102870.71..2331530.97 rows=51672 width=4)
   Workers Planned: 2
   ->  Parallel Hash Anti Join  (cost=2101870.71..2325363.77 rows=21530 width=4)
         Hash Cond: (input_file.id = trfi.input_file_id)
         ->  Parallel Index Only Scan using input_file_pkey on input_file  (cost=0.42..1127.35 rows=28633 width=4)
         ->  Parallel Hash  (cost=1170539.13..1170539.13 rows=56766813 width=4)
               ->  Parallel Seq Scan on task_run_file_input trfi  (cost=0.00..1170539.13 rows=56766813 width=4)

With index:
 Gather  (cost=1000.99..24457.81 rows=51672 width=4) (actual time=194.390..206.568 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Nested Loop Anti Join  (cost=0.99..18290.61 rows=21530 width=4) (actual time=153.282..153.283 rows=0 loops=3)
         ->  Parallel Index Only Scan using input_file_pkey on input_file  (cost=0.42..1127.35 rows=28633 width=4) (actual time=0.035..13.810 rows=22907 loops=3)
               Heap Fetches: 19485
         ->  Index Only Scan using task_run_file_input_input_file_id_idx on task_run_file_input trfi  (cost=0.57..157.21 rows=7992 width=4) (actual time=0.006..0.006 rows=1 loops=68720)
               Index Cond: (input_file_id = input_file.id)
               Heap Fetches: 1352

[^1]: https://www.postgresql.org/docs/current/indexes-multicolumn.html
Instead of requiring an up2date database schema allow to define a
minimum and maximum compatible database schema version.
The max. schema version is always the newest existing one at the time of
a release. The minimum schema version can be an older one if it is still
compatible.

This can make upgrading baur easier, when db schema changes are
downwards compatible, multiple baur version can be used with the same
database schema.
Even better would be to use a semantic version as schema version.
This would require more work though :-)
@fho fho force-pushed the cleanup_file_inputs_query branch from f1a809a to 8ec4ac1 Compare November 11, 2024 16:24
@fho fho marked this pull request as ready for review November 11, 2024 16:40
@fho
Copy link
Collaborator Author

fho commented Nov 12, 2024

has been merged manually

@fho fho closed this Nov 12, 2024
@fho fho deleted the cleanup_file_inputs_query branch November 12, 2024 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant