Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(rust): Add new streaming CSV source #19694

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Nov 8, 2024

Adds a simple CSV source - we have one thread scanning for line batches which get sent to other threads for parsing

The performance is mostly equivalent with the in-memory-engine, except for one very specific type of query -

q = pl.scan_csv(path_base + "yellow_tripdata_2015-01.csv").slice(10_000_000)
print(q.collect())
# shape: (2_748_986, 19)

# 9.60s user 1.47s system 1093% cpu 1.012 total # in-memory engine
# 2.23s user 0.41s system  414% cpu 0.638 total # new-streaming engine

@github-actions github-actions bot added internal An internal refactor or improvement rust Related to Rust Polars labels Nov 8, 2024
@nameexhaustion nameexhaustion added blocked Cannot be worked on due to external dependencies, or significant new internal features needed first and removed blocked Cannot be worked on due to external dependencies, or significant new internal features needed first labels Nov 8, 2024
@nameexhaustion nameexhaustion force-pushed the csv-source branch 2 times, most recently from 3fb0430 to 868d5a8 Compare November 13, 2024 11:06
@@ -14,7 +14,7 @@ use super::row_group_decode::RowGroupDecoder;
use super::{AsyncTaskData, ParquetSourceNode};
use crate::async_executor;
use crate::async_primitives::connector::connector;
use crate::async_primitives::wait_group::{WaitGroup, WaitToken};
use crate::async_primitives::wait_group::IndexedWaitGroup;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor drive-by refactors of parquet source

line_batch_receivers,
// TODO: Refactor so that we don't unwrap, it's currently hard because
// `ComputeNode::{initialize, spawn}` doesn't return a `PolarsResult`
Arc::new(self.try_init_chunk_reader().unwrap()),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unwrap() is not ideal, but I don't currently have any good ideas on how to avoid this

Copy link

codecov bot commented Nov 13, 2024

Codecov Report

Attention: Patch coverage is 1.52381% with 517 lines in your changes missing coverage. Please review.

Project coverage is 79.37%. Comparing base (8cb7839) to head (644dbd9).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-stream/src/nodes/csv_source.rs 0.00% 486 Missing ⚠️
...s/polars-stream/src/async_primitives/wait_group.rs 0.00% 16 Missing ⚠️
py-polars/polars/io/csv/functions.py 37.50% 6 Missing and 4 partials ⚠️
...tes/polars-stream/src/nodes/parquet_source/init.rs 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19694      +/-   ##
==========================================
- Coverage   79.55%   79.37%   -0.19%     
==========================================
  Files        1544     1545       +1     
  Lines      213240   213735     +495     
  Branches     2441     2445       +4     
==========================================
+ Hits       169643   169647       +4     
- Misses      43048    43536     +488     
- Partials      549      552       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion marked this pull request as ready for review November 14, 2024 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant