Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More restrictive spatial join yielding worse join strategy #455

Open
cpa opened this issue Nov 17, 2024 · 1 comment
Open

More restrictive spatial join yielding worse join strategy #455

cpa opened this issue Nov 17, 2024 · 1 comment

Comments

@cpa
Copy link

cpa commented Nov 17, 2024

Given a set of polygons, I'm trying to get all pairs of polygons whose intersection is at least some (non-zero) number.

My first query yields a IE_JOIN, which is good and indeed quite fast, but the second query (the same as first, just adding where ST_Area(ST_Intersection(p1.geom, p2.geom)) > 10) yields a BLOCKWISE_NL_JOIN which is very slow. I've tried to trick duckdb using a with construct, but to no avail.

These plans are from a test db (see below) but I've spotted the issue on a 200GB on my laptop so I can confirm that's it's not duckdb doing smart stuff because the data is small. Also, I can confirm that saving the result of the first query in a table and then performing the area filtering on this table is much faster than queries 2 or 3.

Actual results: queries 2 and 3 use BLOCKWISE_NL_JOIN
Expected results: queries 2 and 3 use IE_JOIN

D explain select p1.id, p2.id from polygon_data p1 join polygon_data p2 on ST_Intersects(p1.geom, p2.geom);

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             id            │
│             id            │
│                           │
│        ~100000 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ ST_Intersects(geom, geom) │
│                           │
│        ~100000 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│          IE_JOIN          │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        │
│ ST_XMin(ST_Extent(geom)) <│
│ = ST_XMax(ST_Extent(geom))│
│ ST_XMax(ST_Extent(geom)) >├──────────────┐
│ = ST_XMin(ST_Extent(geom))│              │
│ ST_YMin(ST_Extent(geom)) <│              │
│ = ST_YMax(ST_Extent(geom))│              │
│ ST_YMax(ST_Extent(geom)) >│              │
│ = ST_YMin(ST_Extent(geom))│              │
│                           │              │
│        ~100000 Rows       │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│        polygon_data       ││        polygon_data       │
│                           ││                           │
│        Projections:       ││        Projections:       │
│            geom           ││            geom           │
│             id            ││             id            │
│                           ││                           │
│        ~100000 Rows       ││        ~100000 Rows       │
└───────────────────────────┘└───────────────────────────┘
D explain select p1.id, p2.id from polygon_data p1 join polygon_data p2 on ST_Intersects(p1.geom, p2.geom) where ST_Area(ST_Intersection(p1.geom, p2.geom)) > 10;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             id            │
│             id            │
│                           │
│        ~100000 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│     BLOCKWISE_NL_JOIN     │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│         Condition:        ├──────────────┐
│ (ST_Intersects(geom, geom)│              │
│        AND (ST_Area       │              │
│   (ST_Intersection(geom,  │              │
│       geom)) > 10.0))     │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│        polygon_data       ││        polygon_data       │
│                           ││                           │
│        Projections:       ││        Projections:       │
│            geom           ││            geom           │
│             id            ││             id            │
│                           ││                           │
│        ~100000 Rows       ││        ~100000 Rows       │
└───────────────────────────┘└───────────────────────────┘

D explain with tmp as (select p1.id id1, p2.id id2, p1.geom g1, p2.geom g2 from polygon_data p1 join polygon_data p2 on ST_Intersects(p1.geom, p2.geom)) select id1, id2 from tmp where ST_Area(ST_Intersection(g1, g2)) > 10;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            id1            │
│            id2            │
│                           │
│        ~100000 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│     BLOCKWISE_NL_JOIN     │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│         Condition:        ├──────────────┐
│ (ST_Intersects(geom, geom)│              │
│        AND (ST_Area       │              │
│ (ST_Intersection(g1, g2)) │              │
│          > 10.0))         │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│        polygon_data       ││        polygon_data       │
│                           ││                           │
│        Projections:       ││        Projections:       │
│            geom           ││            geom           │
│             id            ││             id            │
│                           ││                           │
│        ~100000 Rows       ││        ~100000 Rows       │
└───────────────────────────┘└───────────────────────────┘

You can generate dummy data with this script:

import duckdb
import random
import math

def generate_valid_polygon(num_sides, min_x, max_x, min_y, max_y):
    # Generate points in polar coordinates centered within or around the bounding box
    center_x = (min_x + max_x) / 2
    center_y = (min_y + max_y) / 2
    radius = (max_x - min_x) / 2
    angles = sorted([random.uniform(0, 2 * math.pi) for _ in range(num_sides)])
    vertices = [
        (
            center_x + random.uniform(radius * 0.5, radius * 1.5) * math.cos(angle),
            center_y + random.uniform(radius * 0.5, radius * 1.5) * math.sin(angle)
        )
        for angle in angles
    ]
    # Ensure the polygon is closed by repeating the first point at the end
    vertices.append(vertices[0])
    return vertices

def main(num_polygons, num_sides, min_x, max_x, min_y, max_y):
    # Initialize an in-memory DuckDB instance
    con = duckdb.connect('polygons.db')
    con.execute("INSTALL 'spatial';")
    con.execute("LOAD 'spatial';")

    # Create a table for storing polygon data as text initially
    con.execute("CREATE TABLE IF NOT EXISTS polygon_data (id INTEGER, polygon TEXT)")

    # Generate N valid polygons and insert them into the database
    for i in range(1, num_polygons + 1):
        polygon_vertices = generate_valid_polygon(num_sides, min_x, max_x, min_y, max_y)
        wkt = 'POLYGON((' + ', '.join(f"{x:.2f} {y:.2f}" for x, y in polygon_vertices) + '))'
        con.execute("INSERT INTO polygon_data (id, polygon) VALUES (?, ?)", (i, wkt))

    # Alter the table to add a new geometry column and populate it
    con.execute("ALTER TABLE polygon_data ADD COLUMN geom GEOMETRY;")
    con.execute("UPDATE polygon_data SET geom = ST_MakeValid(ST_GeomFromText(polygon));")

    # Verify the conversion by selecting and printing some geometric data
    result = con.execute("SELECT id, ST_AsText(geom) FROM polygon_data LIMIT 10").fetchall()
    for row in result:
        print(row)

    # Close the connection
    con.close()

# Parameters
num_polygons = 100_000  # Number of polygons to generate
num_sides = 6           # Number of sides for each polygon
min_x, max_x = 0, 100   # Define bounding box
min_y, max_y = 0, 100

# Run the script
main(num_polygons, num_sides, min_x, max_x, min_y, max_y)
@Maxxen
Copy link
Member

Maxxen commented Nov 17, 2024

Hello!
This is currently expected, we only turn the join into an I.E. join if there is a single spatial predicate as the join condition.
However, it would theoretically be possible to extract other simple predicates from the join and apply them afterwards as filters, but that would require more advanced query planning and optimizer logic to weigh the different alternatives (either do a normal predicate join and apply the spatial filter afterwards, or do a spatial I.E join and apply the filter afterwards). Im not sure we have the capabilities to do that intelligently just yet given that we don't have any spatial statistics for the geometry type to estimate selectivity with (and DuckDB doesn't support custom statistics for custom types yet either). But we can leave this issue open as a tracking issue to implement this functionality in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants