Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for IO[bytes] and bytes in scan_{...} functions #18532

Merged
merged 27 commits into from
Sep 9, 2024

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Sep 3, 2024

This PR adds support for the bytes and IO[bytes] interface to all scan_{parquet, csv, ...} functions.

Since the same code was being touched. This PR also fixes #18581.

Fixes #4950.
Fixes #12617.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Sep 3, 2024
separator: u8,
quote_char: Option<u8>,
comment_prefix: Option<&CommentPrefix>,
eol_char: u8,
has_header: bool,
) -> PolarsResult<usize> {
let file = if is_cloud_url(path) || config::force_async() {
#[cfg(feature = "cloud")]
{
feature_gated!("cloud", {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by: this is the preferred way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much nicer.

let file_options = FileScanOptions {
slice: self.n_rows.map(|x| (0, x)),
with_columns: None,
cache: false,
row_index: self.row_index,
rechunk: self.rechunk,
file_counter: 0,
hive_options: Default::default(),
hive_options: HiveOptions {
enabled: Some(false),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveOptions::default sets enabled to Some(true) this is problematic. So I changed it here.


let memslice = match source {
ScanSourceRef::File(path) => {
let file = if run_async {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by: feature_gated is preferred


#[cfg(feature = "cloud")]
{
feature_gated!("cloud", {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by: feature_gated is preferred

@@ -11,7 +10,7 @@ use rayon::prelude::*;
use super::*;

pub struct IpcExec {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We remove the options here, basically because the memory map options do not do anything anymore

/// # Notes
///
/// - Scan sources with in-memory buffers are ignored.
pub(crate) fn agg_source_paths<'a>(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by: remove the allocation here

@coastalwhite coastalwhite marked this pull request as draft September 6, 2024 16:30
@coastalwhite
Copy link
Collaborator Author

Putting this back into draft for a second. I want to try to remove the specialized read_{_} impls as well with this PR now so I made it possible to pass opened files. This did create some further problems, which I still need to fix.

@coastalwhite coastalwhite changed the title feat: Add support for BytesIO in scan_{...} functions feat: Add support for IO[bytes] and bytes in scan_{...} functions Sep 8, 2024
@coastalwhite coastalwhite marked this pull request as ready for review September 8, 2024 12:43
@coastalwhite
Copy link
Collaborator Author

Putting this back into draft for a second. I want to try to remove the specialized read_{_} impls as well with this PR now so I made it possible to pass opened files. This did create some further problems, which I still need to fix.

In the end I chose not to do this, because it would require significant mappings.

@@ -339,29 +340,3 @@ def test_ipc_decimal_15920(
path = f"{tmp_path}/data"
df.write_ipc(path)
assert_frame_equal(pl.read_ipc(path), df)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this test as it does not really make much sense. The mmaps are unmapped when they are finished, I feel like a bug was triggering this behavior before. For the streaming engine it does make a lot of sense to have a test though and the checks for this are still in place.

@coastalwhite
Copy link
Collaborator Author

This solves one of the subgoals of #13040.

Copy link

codecov bot commented Sep 8, 2024

Codecov Report

Attention: Patch coverage is 81.60813% with 199 lines in your changes missing coverage. Please review.

Project coverage is 79.89%. Comparing base (6076421) to head (1e2fa0d).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-python/src/file.rs 79.31% 24 Missing ⚠️
crates/polars-plan/src/plans/ir/scan_sources.rs 83.91% 23 Missing ⚠️
crates/polars-lazy/src/scan/csv.rs 65.00% 21 Missing ⚠️
crates/polars-plan/src/plans/functions/count.rs 75.00% 21 Missing ⚠️
...ates/polars-plan/src/plans/optimizer/count_star.rs 15.00% 17 Missing ⚠️
crates/polars-python/src/conversion/mod.rs 73.68% 10 Missing ⚠️
crates/polars-lazy/src/scan/file_list_reader.rs 0.00% 8 Missing ⚠️
crates/polars-plan/src/plans/functions/mod.rs 69.23% 8 Missing ⚠️
crates/polars-lazy/src/scan/ndjson.rs 66.66% 6 Missing ⚠️
crates/polars-plan/src/plans/conversion/scans.rs 93.25% 6 Missing ⚠️
... and 22 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18532      +/-   ##
==========================================
- Coverage   79.93%   79.89%   -0.04%     
==========================================
  Files        1505     1506       +1     
  Lines      202628   203040     +412     
  Branches     2873     2889      +16     
==========================================
+ Hits       161976   162226     +250     
- Misses      40104    40264     +160     
- Partials      548      550       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coastalwhite
Copy link
Collaborator Author

In the end, a lot more things had to be touched than I thought originally. @ritchie46 sorry for the large review 😓

crates/polars-io/src/ipc/ipc_file.rs Outdated Show resolved Hide resolved
@ritchie46 ritchie46 added the highlight Highlight this PR in the changelog label Sep 9, 2024
@ritchie46 ritchie46 merged commit ac2456c into pola-rs:main Sep 9, 2024
27 checks passed
@coastalwhite coastalwhite deleted the feat-bytesio-scan branch September 9, 2024 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature highlight Highlight this PR in the changelog python Related to Python Polars rust Related to Rust Polars
Projects
None yet
2 participants