Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage reading parquet file generated from DuckDB #255

Open
niger-prequel opened this issue Jul 25, 2024 · 0 comments
Open

High memory usage reading parquet file generated from DuckDB #255

niger-prequel opened this issue Jul 25, 2024 · 0 comments

Comments

@niger-prequel
Copy link

niger-prequel commented Jul 25, 2024

Description

We're experiencing unexpectedly high memory usage when reading a parquet file using go-duckdb. The memory usage is orders of magnitude larger than the file being read. The issue arises during the final step where we read a parquet file that was compacted by DuckDB from multiple smaller files. Raised a parallel issue on the main repository because we were able to reproduce this with other clients.

Steps to Reproduce

Please refer to the provided repository which includes a main.go file and the parquet files necessary to reproduce this issue. Clone the repository and follow the README instructions to set up and trigger the problem. We experiencing the high memory utilization on the final step, where we read the Parquet file.

Expected Behavior

Memory usage should be proportional to the size of the parquet file being read, similar to executing the SQL commands directly without involving the DuckDB Golang driver.

Actual Behavior

The memory consumption spikes significantly on both our production Kubernetes cluster and local machine setups, going well beyond the actual size of the parquet file. This high memory usage is specific to when using the go-duckdb driver, as direct SQL execution does not replicate the issue. You can use the pure.sql script and instructions in the README to run a version of this without using the Go driver.

Production Kubernets Memory Monitoring
Screenshot 2024-07-25 at 2 05 08 PM

Memory Usage of Script on OSX
Screenshot 2024-07-25 at 2 26 30 PM

Screenshot 2024-07-25 at 2 36 49 PM

Environment

Go version: 1.21.7
DuckDB version: 1.0.0 and 0.10.0
go-duckdb version: 1.7.0
Operating System: Debian Buster and OSX Sonoma 14.5
Additional Information

  • The issue persists regardless of the number of threads configured (1 or 2).
  • We have set several DuckDB configurations and pragmas as part of our initialization process (e.g., memory limits, thread count, etc.).

Impact

This issue is causing significant resource allocation challenges in our production environment, leading to potential service disruptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant