-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support skip_rows
for CsvFormat
#8824
Comments
It might also make sense to add this upstream to arrow-rs (though we can add it first to DataFusion and then upstream it) |
Likely there is a support on Seek::StartFrom on a file level
But imagine if there is parallel read for the source containing multiple csv files, how skip_rows should work in this case? 🤔 |
I am interested in it. If there is not someone working with it, you can assigned it to me @alamb |
Thanks @Tangruilin I think we need to find out how to deal with parallel reads and multiple files prior to development |
I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code |
I think the idea of skipping N rows on the file level doesn't make much sense. What we can probably do is to skip N rows on dataframe level, but again there is no guarantee which exactly 2 rows will be skipped because ordering, shuffling, etc. IMHO it looks more a user task than DataFusion task as the user has more context when executing the query I checked Spark but I haven't found the embedded functionality probably because of concerns above @universalmind303 what is your vision as the ticket owner? |
If I understand the issue correctly, skipping rows at the DataFrame level would not work as the file wouldn't even be able to be parsed into a DataFrame in the first place, due to the initial rows being not CSV rows. Is it possible to specify to skip specific file lines, instead of the first N rows of a file, perhaps? Though I'm not sure if that might be any simpler than trying to skip the first N rows. |
I think you can skip lines as I posted the code above #8824 (comment) But my concern is for reading multiple files, is it expected to skip lines for all the files? but what if the format is mixed so the user'll lose some of data silently. We can think of lets skip unparseable rows, but this solution has tons of nuances |
Maybe I can write a draft which do not think about parallel read, and then we discuss about it. |
This task still has some unclear parts, I would happy if you describe your proposal in comments |
I came across this issue since I was looking for a way to deal with CSV files containing comments (in my case: lines starting with I was originally looking for a way to replace the reader or hook after the builtin one, but from reading above comments this probably does not fit Skip first n rows wouldn't help in my case, but if there was a way to hook into the CSV reader by providing a line-based filter function it might (naively: a |
I think a new issue would be good as this one describes something different
I think you would have to provide your own TableProvider / format for the ListingTable provider if you wanted to do this today
Since the readers are based on streams of data, I do wonder if we could implement some sort of filtering that happened prior to parsing that would let you transform the input however you wanted 🤔 |
Is your feature request related to a problem or challenge?
Some tools generate CSVs that contain some extra data at the top or bottom, such as bank statement exports
I'd like a way to skip the first
n
rowsDescribe the solution you'd like
add a new option to
CsvFormat
for skipping rowsDescribe alternatives you've considered
none that I can think of.
Additional context
GlareDB/glaredb#2035
The text was updated successfully, but these errors were encountered: