-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM aggregate / group by
query on Geometry type
#435
Comments
Hi! Thanks for opening this issue! Depending on how many groups row_id creates, this means that you will have to keep all the unioned geometries in memory at the same time until the full operation finishes, so its unlikely that adjusting the amount of threads DuckDB uses would make any difference. We do have some work in progress in DuckDB core that can make use of partitioning (and eventually sorting) information to avoid keeping aggregate states around if the system knows no more entries will be part of the group, but that's probably a long way out. In general the spatial union operation is one of the most memory intensive things you can do in geospatial processing. I haven't looked at our implementation in a while so there might be some optimizations we could do that I missed the first time around, but as long as we rely on GEOS I think there will always be some amount of uncontrollable memory usage. Like most things related to spatial I'd like to one day provide our own implementation, but building a overlay engine in particular would require a to a ton of work, and testing to catch all the edge cases. Even then I doubt st_union_agg would ever be immune to memory constraints. So to summarize, I'll try to find some time to revisit this part of the code again and see if it can be optimized, but its probably always going to cause a lot of memory pressure, and I don't think there is any workarounds currently. |
Thank you for both such a prompt and thoughtful reply. I'm currently using DBT to orchestrate model creation. I'm going to have to employ some sort of dynamic batching feature. With zero memory limitations and changing the underlying table this CTAS pulls from to only 10_000_000 records, the build completed in 15 seconds and no OOM kill. Great. Thank you for clearing up my path of approach! |
Feel free to mark as closed, if you'd like, as I believe it's a design issue at this point on my end. |
Problem
Good morning! I want this query to run, and perform an aggregate over the
GEOMETRY
type column. Any help would be greatly appreciated.Traceback
alternatively, will get:
Steps to re-create
Running this query against a row table stored in
.duckdb
file, locally.my_geometry_table
is 35GiB on the.duckdb
file and comprises153180272
rows.limit 10
clause just to see, and same OOM timeframe.memory_limit
to varying values from512MB
up to60GB
(maximum memory on machine)threads=1
as well.ST_MakeValid(ST_Union_agg())
it runs.array_agg()
on theGEOMETRY
column, first, and in a subsequent query, perform theST_Union_agg()
. Even thearray_agg()
gets OOM killed in moments.I can't share the data underneath the query, unfortunately, but wondering if the information provided suffices to diagnose a memory leak or some optimization I am missing.
Environment:
My Machine:
The text was updated successfully, but these errors were encountered: