Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support skip_rows for CsvFormat #8824

Open
universalmind303 opened this issue Jan 10, 2024 · 13 comments
Open

support skip_rows for CsvFormat #8824

universalmind303 opened this issue Jan 10, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@universalmind303
Copy link
Contributor

Is your feature request related to a problem or challenge?

Some tools generate CSVs that contain some extra data at the top or bottom, such as bank statement exports

"Account Name : REDACTED"
"Account Number : REDACTED"
"Date Range : 01/01/2023-12/31/2023"
Transaction Date,Posted Date,Category,Debit
2023-01-01,2023-01-01,Payment/Credit,100.0

I'd like a way to skip the first n rows

Describe the solution you'd like

add a new option to CsvFormat for skipping rows

CsvFormat::default().with_skip_rows(2)

Describe alternatives you've considered

none that I can think of.

Additional context

GlareDB/glaredb#2035

@universalmind303 universalmind303 added the enhancement New feature or request label Jan 10, 2024
@alamb
Copy link
Contributor

alamb commented Jan 14, 2024

It might also make sense to add this upstream to arrow-rs (though we can add it first to DataFusion and then upstream it)

@comphead
Copy link
Contributor

Likely there is a support on Seek::StartFrom on a file level

                let decoder = if is_whole_file_scanned {
                        // Don't seek if no range as breaks FIFO files
                        file_compression_type.convert_read(file)?
                    } else {
                        file.seek(SeekFrom::Start(result.range.start as _))?;
                        file_compression_type.convert_read(
                            file.take((result.range.end - result.range.start) as u64),
                        )?
                    };

But imagine if there is parallel read for the source containing multiple csv files, how skip_rows should work in this case? 🤔

@Tangruilin
Copy link
Contributor

I am interested in it.

If there is not someone working with it, you can assigned it to me @alamb

@Tangruilin
Copy link
Contributor

I am interested in it.

If there is not someone working with it, you can assigned it to me @alamb

@alamb Do not forget it ~~~~

@comphead
Copy link
Contributor

Thanks @Tangruilin I think we need to find out how to deal with parallel reads and multiple files prior to development

@alamb
Copy link
Contributor

alamb commented Jan 26, 2024

I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code

@comphead
Copy link
Contributor

I think the idea of skipping N rows on the file level doesn't make much sense. What we can probably do is to skip N rows on dataframe level, but again there is no guarantee which exactly 2 rows will be skipped because ordering, shuffling, etc. IMHO it looks more a user task than DataFusion task as the user has more context when executing the query

I checked Spark but I haven't found the embedded functionality probably because of concerns above

@universalmind303 what is your vision as the ticket owner?

@Jefffrey
Copy link
Contributor

I think the idea of skipping N rows on the file level doesn't make much sense. What we can probably do is to skip N rows on dataframe level, but again there is no guarantee which exactly 2 rows will be skipped because ordering, shuffling, etc. IMHO it looks more a user task than DataFusion task as the user has more context when executing the query

If I understand the issue correctly, skipping rows at the DataFrame level would not work as the file wouldn't even be able to be parsed into a DataFrame in the first place, due to the initial rows being not CSV rows.

Is it possible to specify to skip specific file lines, instead of the first N rows of a file, perhaps? Though I'm not sure if that might be any simpler than trying to skip the first N rows.

@comphead
Copy link
Contributor

I think you can skip lines as I posted the code above #8824 (comment)

But my concern is for reading multiple files, is it expected to skip lines for all the files? but what if the format is mixed so the user'll lose some of data silently.

We can think of lets skip unparseable rows, but this solution has tons of nuances

@Tangruilin
Copy link
Contributor

I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code

Maybe I can write a draft which do not think about parallel read, and then we discuss about it.

@comphead
Copy link
Contributor

I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code

Maybe I can write a draft which do not think about parallel read, and then we discuss about it.

This task still has some unclear parts, I would happy if you describe your proposal in comments

@bbannier
Copy link
Member

I came across this issue since I was looking for a way to deal with CSV files containing comments (in my case: lines starting with # ). Please let me know if I should open a new issue for that.

I was originally looking for a way to replace the reader or hook after the builtin one, but from reading above comments this probably does not fit datafusion's input handling.

Skip first n rows wouldn't help in my case, but if there was a way to hook into the CSV reader by providing a line-based filter function it might (naively: a CsvOptions field like filter: FnOnce(&str) -> bool). Maybe something like this could even be generalized to a line-based transformer function a la filter_map (probably would need to return some Option<Cow> to not penalize use cases which only filter, but do not transform). If OP's lines to skip can be clearly identified this might be able to address their use case as well.

@alamb
Copy link
Contributor

alamb commented Apr 26, 2024

I came across this issue since I was looking for a way to deal with CSV files containing comments (in my case: lines starting with # ). Please let me know if I should open a new issue for that.

I think a new issue would be good as this one describes something different

I was originally looking for a way to replace the reader or hook after the builtin one, but from reading above comments this probably does not fit datafusion's input handling.

I think you would have to provide your own TableProvider / format for the ListingTable provider if you wanted to do this today

Skip first n rows wouldn't help in my case, but if there was a way to hook into the CSV reader by providing a line-based filter function it might (naively: a CsvOptions field like filter: FnOnce(&str) -> bool). Maybe something like this could even be generalized to a line-based transformer function a la filter_map (probably would need to return some Option<Cow> to not penalize use cases which only filter, but do not transform). If OP's lines to skip can be clearly identified this might be able to address their use case as well.

Since the readers are based on streams of data, I do wonder if we could implement some sort of filtering that happened prior to parsing that would let you transform the input however you wanted 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants