Add csv-core based reader (#3338) #3365

tustvold · 2022-12-18T23:27:19Z

Which issue does this PR close?

Part of #3338
Closes #3364

Rationale for this change

Yields anything from 25% to 75% faster, with larger improvements with larger batch sizes.

4096 u64(0) - 128       time:   [360.72 µs 360.87 µs 361.05 µs]
                        change: [-23.673% -23.491% -23.370%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

4096 u64(0) - 1024      time:   [331.16 µs 331.27 µs 331.38 µs]
                        change: [-48.717% -48.649% -48.543%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

4096 u64(0) - 4096      time:   [330.40 µs 330.53 µs 330.71 µs]
                        change: [-75.546% -75.482% -75.444%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4096 i64(0) - 128       time:   [356.13 µs 356.22 µs 356.32 µs]
                        change: [-25.166% -25.142% -25.120%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

4096 i64(0) - 1024      time:   [328.21 µs 328.32 µs 328.43 µs]
                        change: [-48.985% -48.896% -48.843%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 i64(0) - 4096      time:   [327.60 µs 327.77 µs 327.94 µs]
                        change: [-75.634% -75.622% -75.610%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 f32(0) - 128       time:   [293.13 µs 293.22 µs 293.31 µs]
                        change: [-26.308% -26.126% -25.945%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

4096 f32(0) - 1024      time:   [265.46 µs 265.52 µs 265.59 µs]
                        change: [-50.188% -50.037% -49.897%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 f32(0) - 4096      time:   [266.15 µs 266.23 µs 266.32 µs]
                        change: [-75.647% -75.632% -75.618%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

4096 f64(0) - 128       time:   [316.88 µs 316.95 µs 317.03 µs]
                        change: [-25.096% -24.949% -24.860%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 f64(0) - 1024      time:   [285.10 µs 285.16 µs 285.23 µs]
                        change: [-51.985% -51.965% -51.946%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

4096 f64(0) - 4096      time:   [285.62 µs 285.72 µs 285.84 µs]
                        change: [-77.933% -77.901% -77.848%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 string(10, 0) - 128
                        time:   [163.92 µs 163.96 µs 164.00 µs]
                        change: [-37.060% -37.015% -36.956%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

4096 string(10, 0) - 1024
                        time:   [127.45 µs 127.49 µs 127.54 µs]
                        change: [-67.408% -67.348% -67.247%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

4096 string(10, 0) - 4096
                        time:   [125.65 µs 125.71 µs 125.77 µs]
                        change: [-86.901% -86.880% -86.842%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 string(30, 0) - 128
                        time:   [225.21 µs 225.25 µs 225.30 µs]
                        change: [-31.578% -31.423% -31.306%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 string(30, 0) - 1024
                        time:   [182.79 µs 182.88 µs 182.98 µs]
                        change: [-64.628% -64.497% -64.342%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

4096 string(30, 0) - 4096
                        time:   [183.36 µs 183.43 µs 183.52 µs]
                        change: [-84.714% -84.702% -84.691%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

4096 string(100, 0) - 128
                        time:   [490.86 µs 491.27 µs 491.70 µs]
                        change: [-19.107% -18.963% -18.816%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4096 string(100, 0) - 1024
                        time:   [445.64 µs 445.85 µs 446.07 µs]
                        change: [-49.217% -49.155% -49.065%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 string(100, 0) - 4096
                        time:   [490.16 µs 490.55 µs 490.89 µs]
                        change: [-75.088% -75.052% -75.012%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 26 outliers among 100 measurements (26.00%)
  15 (15.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 string(100, 0.5) - 128
                        time:   [357.47 µs 357.56 µs 357.66 µs]
                        change: [-27.187% -27.017% -26.761%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

4096 string(100, 0.5) - 1024
                        time:   [316.25 µs 316.36 µs 316.49 µs]
                        change: [-58.868% -58.846% -58.827%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 string(100, 0.5) - 4096
                        time:   [325.66 µs 325.82 µs 326.00 µs]
                        change: [-77.121% -77.107% -77.094%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.9s, enable flat sampling, or reduce sample count to 60.
4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128
                        time:   [1.1570 ms 1.1576 ms 1.1583 ms]
                        change: [-18.339% -18.157% -17.945%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  3 (3.00%) high mild
  16 (16.00%) high severe

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.1s, enable flat sampling, or reduce sample count to 60.
4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024
                        time:   [1.0139 ms 1.0146 ms 1.0154 ms]
                        change: [-35.902% -35.849% -35.794%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 23 outliers among 100 measurements (23.00%)
  7 (7.00%) low severe
  7 (7.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.3s, enable flat sampling, or reduce sample count to 60.
4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096
                        time:   [1.0514 ms 1.0524 ms 1.0535 ms]
                        change: [-63.795% -63.754% -63.713%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128
                        time:   [941.30 µs 941.55 µs 941.85 µs]
                        change: [-21.932% -21.887% -21.842%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024
                        time:   [802.95 µs 803.27 µs 803.65 µs]
                        change: [-40.134% -40.095% -40.059%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096
                        time:   [813.42 µs 813.84 µs 814.26 µs]
                        change: [-66.732% -66.690% -66.632%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

What changes are included in this PR?

Adds a custom record reader based on csv-core, that significantly reduces allocations whilst parsing arrow data.

Are there any user-facing changes?

Previously if provided with a schema that was a valid prefix of the columns, it wouldn't complain. It now will. This behaviour was undocumented, and I think an accident, but is technically a user-facing change

tustvold · 2022-12-18T23:29:50Z

arrow-csv/Cargo.toml

@@ -45,6 +45,7 @@ arrow-data = { version = "29.0.0", path = "../arrow-data" }
 arrow-schema = { version = "29.0.0", path = "../arrow-schema" }
 chrono = { version = "0.4.23", default-features = false, features = ["clock"] }
 csv = { version = "1.1", default-features = false }
+csv-core = { version = "0.1"}


This is already a dependency of csv, and has no default features

tustvold · 2022-12-18T23:30:53Z

arrow-csv/src/reader/mod.rs

 fn infer_reader_schema_with_csv_options<R: Read>(
    reader: R,
    roptions: ReaderOptions,
 ) -> Result<(Schema, usize), ArrowError> {
-    let mut csv_reader = Reader::build_csv_reader(
+    let mut csv_reader = build_csv_reader(


Schema inference still uses the old reader, both to reduce the size of this PR, but also because it is inherently row oriented, as opposed to parsing.

tustvold · 2022-12-18T23:31:25Z

arrow-csv/src/reader/mod.rs

@@ -383,6 +417,7 @@ impl<R: Read> Reader<R> {
    /// This constructor allows you more flexibility in what records are processed by the
    /// csv reader.
    #[allow(clippy::too_many_arguments)]
+    #[deprecated(note = "Use Reader::new or ReaderBuilder")]


Having two methods from_reader and new with identical signatures is a touch confusing. Let's just unify them

tustvold · 2022-12-18T23:31:53Z

arrow-csv/src/reader/mod.rs

-            schema,
-            projection,
-            reader: csv_reader,
-            line_number: if has_header { start + 1 } else { start },


This is the cause of #3364, it increments the start but not the end

tustvold · 2022-12-18T23:32:11Z

arrow-csv/src/reader/mod.rs

@@ -2074,4 +2032,31 @@ mod tests {
        let col1_arr = col1.as_any().downcast_ref::<StringArray>().unwrap();
        assert_eq!(col1_arr.value(5), "value5");
    }
+
+    #[test]
+    fn test_header_bounds() {


Test for #3364

tustvold · 2022-12-18T23:34:45Z

arrow-csv/src/reader/mod.rs

@@ -177,11 +179,36 @@ pub fn infer_reader_schema<R: Read>(
    infer_reader_schema_with_csv_options(reader, roptions)
 }

+/// Creates a `csv::Reader`
+fn build_csv_reader<R: Read>(


This method is just moved to be closer to where it is used. A subsequent PR might look to move the schema inference logic into its own file

tustvold · 2022-12-18T23:36:23Z

arrow-csv/src/reader/mod.rs

    col_idx: usize,
    precision: u8,
    scale: i8,
 ) -> Result<ArrayRef, ArrowError> {
    let mut decimal_builder = Decimal128Builder::with_capacity(rows.len());
-    for row in rows {


We now more strictly enforce that the schema actually matches the data read, I don't think it was a documented behaviour that rows could be missing fields, and I think largely an implementation quirk, but I think it is worth highlighting

tustvold · 2022-12-18T23:37:36Z

arrow-csv/src/reader/records.rs

+
+    /// Skips forward `to_skip` rows
+    pub fn skip(&mut self, mut to_skip: usize) -> Result<(), ArrowError> {
+        // TODO: This could be done by scanning for unquoted newline delimiters


Previously the implementation read to a ByteRecord, this will perform similarly or better, and so I don't think this is an issue

tustvold · 2022-12-18T23:47:16Z

The performance gain is significantly better than I expected, to the point where I wonder if I've messed something up 😅

In particular the timings not scaling with batch size seems somewhat suspect to me...

Dandandan · 2022-12-19T08:04:40Z

The performance gain is significantly better than I expected, to the point where I wonder if I've messed something up 😅

In particular the timings not scaling with batch size seems somewhat suspect to me...

Timings make sense to me - for a single batch performance difference will be pretty large (as in the benchmark), but for a full csv file with many batches the difference probably is smaller as the csv-based implementation re-uses the allocated StringRecords across batches (as long as they are large enough to hold the record).

Dandandan

Great improvement 👌

ursabot · 2022-12-19T09:02:27Z

Benchmark runs are scheduled for baseline = a8c9685 and contender = c344433. c344433 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

bjchambers · 2023-01-13T07:13:56Z

arrow-csv/src/reader/mod.rs

    col_idx: usize,
 ) -> Result<ArrayRef, ArrowError> {
    rows.iter()
        .enumerate()
        .map(|(row_index, row)| {
-            match row.get(col_idx) {
-                Some(s) => {
-                    if s.is_empty() {


This dropped the handling of null values for boolean arrays. Specifically, it removed the block below (which was previously present for primitives and booleans). The affect of this is that when parsing a CSV file containing a null value, it raises a parse error.

if s.is_empty() { return Ok(None); }

Oops will fix before the next release

DDtKey · 2023-01-19T09:05:51Z

arrow-csv/src/reader/records.rs

+                        ReadRecordResult::InputEmpty => break 'input, // Input exhausted, need to read more
+                        ReadRecordResult::OutputFull => break, // Need to allocate more capacity
+                        ReadRecordResult::OutputEndsFull => {
+                            return Err(ArrowError::CsvError(format!("incorrect number of fields, expected {} got more than {}", self.num_columns, field_count)))


Looks like something wrong with this condition.

I will investigate & create issue with MRE as soon as I will reproduce it on stable basis.
But just wanted to mention, I'm using datafusion & after updating to new CSV reader I started to encounter such errors:
incorrect number of fields, expected 7 got more than 7 (numbers always the same actually)

And it happens for cases which worked well before this update @tustvold

This may be expected, see apache/datafusion#4918

The TLDR is this may have been working by accident, and has never worked reliably

@tustvold wow. but CSV files are correct which produce this error (and schema passed directly without inference) - it has exactly number of fields as expected (however, let me check for null values inside).

Shouldn't the error message be changed in that case? I can't clearly understand the case when it happens, but trying

It happens when at least one row contains more fields than specified in the schema, i.e. more than schema.fields().len() - 1 delimiters

@tustvold I'm totally sure that number of rows were equal or less(I mean null values) than number of fields in schema

I'm double-checking that

When one of the rows contains more fields than specified in schema it's usually:
incorrect number of fields, expected X got Y (from ReadRecordResult::Record arm of match), but no this variant of error

If you update to the latest DataFusion, which includes arrow 31, it will print a line number which may help identify what is going on

@tustvold yes, I did & it prints. But the line in files are correct, it always refer to first one in my case. 🤔 As I said, I'll try to create a MRE, otherwise it's hard to explain

Yeah, that sounds like apache/datafusion#4918 and the error is occurring when it tries to skip the header

Add csv-core based reader (apache#3338)

9b8f8af

github-actions bot added the arrow Changes to the arrow crate label Dec 18, 2022

tustvold commented Dec 18, 2022

View reviewed changes

More docs

f45fd94

Dandandan approved these changes Dec 19, 2022

View reviewed changes

tustvold merged commit c344433 into apache:master Dec 19, 2022

alamb mentioned this pull request Dec 29, 2022

CSV Reader Bounds Incorrectly Handles Header #3364

Closed

This was referenced Jan 4, 2023

Update to arrow 30.0.1 apache/datafusion#4818

Merged

Improve CSV error messages #3468

Closed

bjchambers reviewed Jan 13, 2023

View reviewed changes

DDtKey reviewed Jan 19, 2023

View reviewed changes

tustvold mentioned this pull request Feb 8, 2023

RFC: Use Apache Arrow Parquet Crate pola-rs/polars#6735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add csv-core based reader (#3338) #3365

Add csv-core based reader (#3338) #3365

tustvold commented Dec 18, 2022 •

edited

Loading

tustvold Dec 18, 2022

tustvold Dec 18, 2022

tustvold Dec 18, 2022

tustvold Dec 18, 2022

tustvold Dec 18, 2022

tustvold Dec 18, 2022

tustvold Dec 18, 2022

tustvold Dec 18, 2022

tustvold commented Dec 18, 2022 •

edited

Loading

Dandandan commented Dec 19, 2022

Dandandan left a comment

ursabot commented Dec 19, 2022

bjchambers Jan 13, 2023

tustvold Jan 13, 2023

DDtKey Jan 19, 2023

tustvold Jan 19, 2023 •

edited

Loading

DDtKey Jan 19, 2023 •

edited

Loading

tustvold Jan 19, 2023 •

edited

Loading

DDtKey Jan 19, 2023

DDtKey Jan 19, 2023 •

edited

Loading

tustvold Jan 19, 2023

DDtKey Jan 19, 2023 •

edited

Loading

tustvold Jan 19, 2023 •

edited

Loading

Add csv-core based reader (#3338) #3365

Add csv-core based reader (#3338) #3365

Conversation

tustvold commented Dec 18, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Dec 18, 2022 • edited Loading

Dandandan commented Dec 19, 2022

Dandandan left a comment

Choose a reason for hiding this comment

ursabot commented Dec 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

DDtKey Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DDtKey Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DDtKey Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold commented Dec 18, 2022 •

edited

Loading

tustvold commented Dec 18, 2022 •

edited

Loading

tustvold Jan 19, 2023 •

edited

Loading

DDtKey Jan 19, 2023 •

edited

Loading

tustvold Jan 19, 2023 •

edited

Loading

DDtKey Jan 19, 2023 •

edited

Loading

DDtKey Jan 19, 2023 •

edited

Loading

tustvold Jan 19, 2023 •

edited

Loading