Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow to read non-standard CSV #326

Merged
merged 5 commits into from
May 25, 2021
Merged

allow to read non-standard CSV #326

merged 5 commits into from
May 25, 2021

Conversation

kazuk
Copy link
Contributor

@kazuk kazuk commented May 19, 2021

Which issue does this PR close?

Closes #315.

Rationale for this change

What changes are included in this PR?

Reader::from_reader split into Reader::build_csv_reader, Reader::from_csv_reader.

Reader::build_csv_reader builds csv_crate::Reader,with all CSV reader options.
Rader::from_csv_reader builds arrow::csv::reader::Reader for csv_crate::Reader.

add fn infer_file_schema_with_csv_options .

change infer_file_schema calls infer_file_schema_with_csv_options with default options.

add fn infer_reader_schema_with_csv_options.
change infer_reader_schema calls infer_reader_schema with default options.

add escape quote terminator to ReaderBuilder

ReaderBuilder::build uses added options.

Are there any user-facing changes?

currently minimized API change.

  • add fn ReaderBuilder::escape
  • add fn ReaderBuilder::quote
  • add fn ReaderBuilder::terminator

please concider make public for fn infer_file_schema_with_csv_options, infer_reader_schema_with_csv_options , Reader::build_csv_reader , Reader::from_csv_reader.

kazuk added 3 commits May 19, 2021 11:21
split into build_csv_reader, from_csv_reader
add escape, quote, terminator arg to build_csv_reader
schema inference support for non-standard CSV

  add fn infer_file_schema_with_csv_options
  add fn infer_reader_schema_with_csv_options

ReaderBuilder support for non-standard CSV

add escape, quote, terminator field
add fn with_escape, with_quote, with_terminator
change ReaderBuilder::build for non-standard CSV
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @kazuk

This code looks really nice and clean. 👍

The only thing that I think is missing from this PR is some basic tests -- specifically to show everything hooked together properly and the options get to the csv reader.

I think such tests would be of most value to ensure that as we change this code in the future, we don't accidentally break this new functionality. Give we are delegating to the implementation in the csv reader crate I don't think we need to test a large number of corner cases -- just that we can read a CSV file with a non default escape, quote and terminator character

Thanks again!

@codecov-commenter
Copy link

Codecov Report

Merging #326 (4bf2e0c) into master (c863a2c) will decrease coverage by 0.02%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #326      +/-   ##
==========================================
- Coverage   82.52%   82.50%   -0.03%     
==========================================
  Files         162      162              
  Lines       44007    44036      +29     
==========================================
+ Hits        36316    36331      +15     
- Misses       7691     7705      +14     
Impacted Files Coverage Δ
arrow/src/csv/reader.rs 86.97% <66.66%> (-1.39%) ⬇️
arrow/src/array/transform/boolean.rs 76.92% <0.00%> (-7.70%) ⬇️
parquet/src/encodings/encoding.rs 94.85% <0.00%> (-0.20%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c863a2c...4bf2e0c. Read the comment docs.

add #[test] fn test_non_std_quote
add #[test] fn test_non_std_escape
add #[test] fn test_non_std_terminator
@kazuk
Copy link
Contributor Author

kazuk commented May 24, 2021

Thank you for review.

The only thing that I think is missing from this PR is some basic tests -- specifically to show everything hooked together properly and the options get to the csv reader.

I add a basic tests for detects feature broken.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @kazuk

@alamb alamb added arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog labels May 24, 2021
@alamb
Copy link
Contributor

alamb commented May 24, 2021

The MIRI CI check is not related to this PR (edit #345)

However, the lint failure seems to be: https://github.com/apache/arrow-rs/pull/326/checks?check_run_id=2658268180

I think it can be resolved by running cargo fmt locally and then pushing the result.

@kazuk
Copy link
Contributor Author

kazuk commented May 24, 2021

oh!, Thanks.

I run cargo fmt and push.

@alamb
Copy link
Contributor

alamb commented May 25, 2021

The MIRI failure is unrelated to this PR: #345

@alamb alamb merged commit 1d72b26 into apache:master May 25, 2021
@alamb
Copy link
Contributor

alamb commented May 25, 2021

Thanks again @kazuk

alamb pushed a commit that referenced this pull request Jun 4, 2021
* refactor Reader::from_reader

split into build_csv_reader, from_csv_reader
add escape, quote, terminator arg to build_csv_reader

* add escape,quote,terminator field to ReaderBuilder

schema inference support for non-standard CSV

  add fn infer_file_schema_with_csv_options
  add fn infer_reader_schema_with_csv_options

ReaderBuilder support for non-standard CSV

add escape, quote, terminator field
add fn with_escape, with_quote, with_terminator
change ReaderBuilder::build for non-standard CSV

* minimize API change

* add tests

add #[test] fn test_non_std_quote
add #[test] fn test_non_std_escape
add #[test] fn test_non_std_terminator

* apply cargo fmt
alamb added a commit that referenced this pull request Jun 5, 2021
* refactor Reader::from_reader

split into build_csv_reader, from_csv_reader
add escape, quote, terminator arg to build_csv_reader

* add escape,quote,terminator field to ReaderBuilder

schema inference support for non-standard CSV

  add fn infer_file_schema_with_csv_options
  add fn infer_reader_schema_with_csv_options

ReaderBuilder support for non-standard CSV

add escape, quote, terminator field
add fn with_escape, with_quote, with_terminator
change ReaderBuilder::build for non-standard CSV

* minimize API change

* add tests

add #[test] fn test_non_std_quote
add #[test] fn test_non_std_escape
add #[test] fn test_non_std_terminator

* apply cargo fmt

Co-authored-by: kazuhiko kikuchi <kazuk.dll@kazuk.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add with_escape for csv::ReaderBuilder
3 participants