Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory when sorting #157

Closed
djouallah opened this issue Jan 28, 2023 · 4 comments · Fixed by #176
Closed

Out of memory when sorting #157

djouallah opened this issue Jan 28, 2023 · 4 comments · Fixed by #176
Labels
bug Something isn't working

Comments

@djouallah
Copy link

djouallah commented Jan 28, 2023

Describe the bug
try a sort and export a parquet file using Colab generate an Out of memory error

To Reproduce

!curl -L 'https://drive.google.com/uc?export=download&id=18gv0Yd_a-Zc7CSolol8qeYVAAzSthnSN&confirm=t' > lineitem.parquet
from datafusion import SessionContext
ctx = SessionContext()
ctx.register_parquet('lineitem', 'lineitem.parquet')
df = ctx.sql("select * from lineitem order by l_shipdate")
df.write_parquet("lineitem_Datafusion.parquet")

Expected behavior
I expected to use only the available memory

here is the link comparing the same using Polars and DuckDB
https://colab.research.google.com/drive/1pfAPpIG7jpvGB_aHj-PXX66vRaRT0xlj#scrollTo=O8-lyg1y6RT2

@djouallah djouallah added the bug Something isn't working label Jan 28, 2023
@andygrove
Copy link
Member

I filed an issue against the core DataFusion repo - apache/datafusion#5108

@comphead
Copy link

@djouallah please check some advices in apache/datafusion#5108 on how to enable memory pool with spilling, to use only allocated memory instead of all available

@andygrove
Copy link
Member

We need to expose the memory/disk config in the Python bindings so that they can be set when creating the SessionContext. Here is Rust code for reference.

I will try and look at this over the weekend unless someone beats me to it.

 let runtime_config = crate::execution::runtime_env::RuntimeConfig::new()
            .with_memory_pool(Arc::new(crate::execution::memory_pool::GreedyMemoryPool::new(1024*1024*1024)))
            .with_disk_manager(crate::execution::disk_manager::DiskManagerConfig::new_specified(vec!["/Users/a/spill/".into()]));
        let runtime = Arc::new(crate::execution::runtime_env::RuntimeEnv::new(runtime_config).unwrap());
        let ctx = SessionContext::with_config_rt(SessionConfig::new(), runtime);

@andygrove
Copy link
Member

@djouallah Hopefully this helps: #165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants