Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database tables do not use indexes #43

Open
cameronmurdoch opened this issue Apr 25, 2022 · 0 comments
Open

Database tables do not use indexes #43

cameronmurdoch opened this issue Apr 25, 2022 · 0 comments

Comments

@cameronmurdoch
Copy link

Database tables hosts and hosts_source do not define any indexes. Even with small(ish) tables, this can lead to extra IO and 'slow' queries.

Background info:
https://www.postgresql.org/docs/14/datatype-json.html#JSON-INDEXING
https://www.postgresql.org/docs/11/datatype-json.html#JSON-INDEXING

In these examples hosts_source has 12763 rows and is approx 9.5MB.

A typical query:

db_cursor.execute(f"SELECT data FROM {self.db_source_table} WHERE data->>'hostname' = %s", [hostname])

This gives, for example, the following query plan:

# explain(analyse,buffers) select data from hosts_source WHERE data->>'hostname' = 'battersea.uio.no';
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on hosts_source  (cost=0.00..1364.44 rows=64 width=696) (actual time=0.829..11.086 rows=1 loops=1)
   Filter: ((data ->> 'hostname'::text) = 'battersea.uio.no'::text)
   Rows Removed by Filter: 12762
   Buffers: shared hit=1217
 Planning Time: 0.054 ms
 Execution Time: 11.120 ms
(6 rows)

PostgreSQL has no choice but to perform a sequential scan on the whole table. This is pretty fast, around 11ms, as the table is only 9.5MB. However, the query can run approx 250 times faster if we create a GIN index and rewrite the query to use the jsonb containment operator:

# create index hosts_source_gin_idx on hosts_source using GIN (data jsonb_path_ops);
CREATE INDEX
# explain(analyse,buffers) select data from hosts_source WHERE data @> '{"hostname": "battersea.uio.no"}';
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hosts_source  (cost=8.01..10.02 rows=1 width=696) (actual time=0.020..0.021 rows=1 loops=1)
   Recheck Cond: (data @> '{"hostname": "battersea.uio.no"}'::jsonb)
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on hosts_source_gin_idx  (cost=0.00..8.01 rows=1 width=0) (actual time=0.013..0.013 rows=1 loops=1)
         Index Cond: (data @> '{"hostname": "battersea.uio.no"}'::jsonb)
         Buffers: shared hit=4
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.190 ms
 Execution Time: 0.042 ms
(11 rows)

We can see that postgreSQL could use the index and required significantly less IO (buffers) than before.

Note the this will also speed up other queries (including updates and deletes) that search for hostname. For example:

db_cursor.execute(f"SELECT data FROM {self.db_source_table} WHERE data->>'hostname' = %s AND data->'sources' ? %s", [host.hostname, source])

# explain(analyse,buffers) select data from hosts_source WHERE data @> '{"hostname": "battersea.uio.no"}' and data->'sources' ? 'nivlheim';
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hosts_source  (cost=8.01..10.03 rows=1 width=696) (actual time=0.020..0.021 rows=1 loops=1)
   Recheck Cond: (data @> '{"hostname": "battersea.uio.no"}'::jsonb)
   Filter: ((data -> 'sources'::text) ? 'nivlheim'::text)
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on hosts_source_gin_idx  (cost=0.00..8.01 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=1)
         Index Cond: (data @> '{"hostname": "battersea.uio.no"}'::jsonb)
         Buffers: shared hit=4
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.113 ms
 Execution Time: 0.037 ms
(12 rows)

Here, PostgreSQL could use the index to find all the rows matching the hostname first and then apply the filter on sources.

However this query will not use the index:

db_cursor.execute(f"SELECT DISTINCT data->>'hostname' FROM {self.db_source_table} WHERE data->'sources' ? %s", [source])

It is possible to create an index that can be used here, eg

# create index hosts_source_gin_sources_idx on hosts_source using GIN ((data -> 'sources'));

But any performance improvement is somewhat dependent on how many rows can be discarded by the filter and how big the table is. The postgresql documentation says that if querying for particular items within the "sources" key is common, defining an index like this may be worthwhile. Some quick experiments led to 0x to 3-4x speed improvements depending on the search key used.

A final note is that on for example postgresql 14, jsonpath queries will use the jsonb_path_ops index created above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant