support `skip_rows` for `CsvFormat` #8824

universalmind303 · 2024-01-10T21:01:13Z

Is your feature request related to a problem or challenge?

Some tools generate CSVs that contain some extra data at the top or bottom, such as bank statement exports

"Account Name : REDACTED"
"Account Number : REDACTED"
"Date Range : 01/01/2023-12/31/2023"
Transaction Date,Posted Date,Category,Debit
2023-01-01,2023-01-01,Payment/Credit,100.0

I'd like a way to skip the first n rows

Describe the solution you'd like

add a new option to CsvFormat for skipping rows

CsvFormat::default().with_skip_rows(2)

Describe alternatives you've considered

none that I can think of.

Additional context

GlareDB/glaredb#2035

The text was updated successfully, but these errors were encountered:

alamb · 2024-01-14T19:51:35Z

It might also make sense to add this upstream to arrow-rs (though we can add it first to DataFusion and then upstream it)

comphead · 2024-01-24T00:55:33Z

Likely there is a support on Seek::StartFrom on a file level

                let decoder = if is_whole_file_scanned {
                        // Don't seek if no range as breaks FIFO files
                        file_compression_type.convert_read(file)?
                    } else {
                        file.seek(SeekFrom::Start(result.range.start as _))?;
                        file_compression_type.convert_read(
                            file.take((result.range.end - result.range.start) as u64),
                        )?
                    };

But imagine if there is parallel read for the source containing multiple csv files, how skip_rows should work in this case? 🤔

Tangruilin · 2024-01-26T05:06:43Z

I am interested in it.

If there is not someone working with it, you can assigned it to me @alamb

Tangruilin · 2024-01-26T11:57:32Z

I am interested in it.

If there is not someone working with it, you can assigned it to me @alamb

@alamb Do not forget it ~~~~

comphead · 2024-01-26T16:12:41Z

Thanks @Tangruilin I think we need to find out how to deal with parallel reads and multiple files prior to development

alamb · 2024-01-26T20:25:17Z

I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code

comphead · 2024-01-26T22:03:11Z

I think the idea of skipping N rows on the file level doesn't make much sense. What we can probably do is to skip N rows on dataframe level, but again there is no guarantee which exactly 2 rows will be skipped because ordering, shuffling, etc. IMHO it looks more a user task than DataFusion task as the user has more context when executing the query

I checked Spark but I haven't found the embedded functionality probably because of concerns above

@universalmind303 what is your vision as the ticket owner?

Jefffrey · 2024-01-26T22:58:48Z

I think the idea of skipping N rows on the file level doesn't make much sense. What we can probably do is to skip N rows on dataframe level, but again there is no guarantee which exactly 2 rows will be skipped because ordering, shuffling, etc. IMHO it looks more a user task than DataFusion task as the user has more context when executing the query

If I understand the issue correctly, skipping rows at the DataFrame level would not work as the file wouldn't even be able to be parsed into a DataFrame in the first place, due to the initial rows being not CSV rows.

Is it possible to specify to skip specific file lines, instead of the first N rows of a file, perhaps? Though I'm not sure if that might be any simpler than trying to skip the first N rows.

comphead · 2024-01-27T02:17:05Z

I think you can skip lines as I posted the code above #8824 (comment)

But my concern is for reading multiple files, is it expected to skip lines for all the files? but what if the format is mixed so the user'll lose some of data silently.

We can think of lets skip unparseable rows, but this solution has tons of nuances

Tangruilin · 2024-01-27T04:41:52Z

I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code

Maybe I can write a draft which do not think about parallel read, and then we discuss about it.

comphead · 2024-01-27T06:52:17Z

I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code

Maybe I can write a draft which do not think about parallel read, and then we discuss about it.

This task still has some unclear parts, I would happy if you describe your proposal in comments

bbannier · 2024-04-26T07:56:14Z

I came across this issue since I was looking for a way to deal with CSV files containing comments (in my case: lines starting with # ). Please let me know if I should open a new issue for that.

I was originally looking for a way to replace the reader or hook after the builtin one, but from reading above comments this probably does not fit datafusion's input handling.

Skip first n rows wouldn't help in my case, but if there was a way to hook into the CSV reader by providing a line-based filter function it might (naively: a CsvOptions field like filter: FnOnce(&str) -> bool). Maybe something like this could even be generalized to a line-based transformer function a la filter_map (probably would need to return some Option<Cow> to not penalize use cases which only filter, but do not transform). If OP's lines to skip can be clearly identified this might be able to address their use case as well.

alamb · 2024-04-26T18:21:10Z

I came across this issue since I was looking for a way to deal with CSV files containing comments (in my case: lines starting with # ). Please let me know if I should open a new issue for that.

I think a new issue would be good as this one describes something different

I was originally looking for a way to replace the reader or hook after the builtin one, but from reading above comments this probably does not fit datafusion's input handling.

I think you would have to provide your own TableProvider / format for the ListingTable provider if you wanted to do this today

Skip first n rows wouldn't help in my case, but if there was a way to hook into the CSV reader by providing a line-based filter function it might (naively: a CsvOptions field like filter: FnOnce(&str) -> bool). Maybe something like this could even be generalized to a line-based transformer function a la filter_map (probably would need to return some Option<Cow> to not penalize use cases which only filter, but do not transform). If OP's lines to skip can be clearly identified this might be able to address their use case as well.

Since the readers are based on streams of data, I do wonder if we could implement some sort of filtering that happened prior to parsing that would let you transform the input however you wanted 🤔

universalmind303 added the enhancement New feature or request label Jan 10, 2024

alamb assigned Tangruilin Jan 26, 2024

Tangruilin removed their assignment Jan 29, 2024

Tangruilin mentioned this issue Jan 29, 2024

Implement to_unixtime function #5568

Closed

bbannier mentioned this issue Apr 27, 2024

Support skipping comments in CsvReader #10262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support `skip_rows` for `CsvFormat` #8824

support `skip_rows` for `CsvFormat` #8824

universalmind303 commented Jan 10, 2024

alamb commented Jan 14, 2024

comphead commented Jan 24, 2024

Tangruilin commented Jan 26, 2024

Tangruilin commented Jan 26, 2024

comphead commented Jan 26, 2024

alamb commented Jan 26, 2024

comphead commented Jan 26, 2024

Jefffrey commented Jan 26, 2024

comphead commented Jan 27, 2024

Tangruilin commented Jan 27, 2024

comphead commented Jan 27, 2024

bbannier commented Apr 26, 2024

alamb commented Apr 26, 2024

support skip_rows for CsvFormat #8824

support skip_rows for CsvFormat #8824

Comments

universalmind303 commented Jan 10, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jan 14, 2024

comphead commented Jan 24, 2024

Tangruilin commented Jan 26, 2024

Tangruilin commented Jan 26, 2024

comphead commented Jan 26, 2024

alamb commented Jan 26, 2024

comphead commented Jan 26, 2024

Jefffrey commented Jan 26, 2024

comphead commented Jan 27, 2024

Tangruilin commented Jan 27, 2024

comphead commented Jan 27, 2024

bbannier commented Apr 26, 2024

alamb commented Apr 26, 2024

support `skip_rows` for `CsvFormat` #8824

support `skip_rows` for `CsvFormat` #8824