Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add csv-core based reader (#3338) #3365

Merged
merged 2 commits into from
Dec 19, 2022
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 18, 2022

Which issue does this PR close?

Part of #3338
Closes #3364

Rationale for this change

Yields anything from 25% to 75% faster, with larger improvements with larger batch sizes.

4096 u64(0) - 128       time:   [360.72 µs 360.87 µs 361.05 µs]
                        change: [-23.673% -23.491% -23.370%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

4096 u64(0) - 1024      time:   [331.16 µs 331.27 µs 331.38 µs]
                        change: [-48.717% -48.649% -48.543%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

4096 u64(0) - 4096      time:   [330.40 µs 330.53 µs 330.71 µs]
                        change: [-75.546% -75.482% -75.444%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4096 i64(0) - 128       time:   [356.13 µs 356.22 µs 356.32 µs]
                        change: [-25.166% -25.142% -25.120%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

4096 i64(0) - 1024      time:   [328.21 µs 328.32 µs 328.43 µs]
                        change: [-48.985% -48.896% -48.843%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 i64(0) - 4096      time:   [327.60 µs 327.77 µs 327.94 µs]
                        change: [-75.634% -75.622% -75.610%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 f32(0) - 128       time:   [293.13 µs 293.22 µs 293.31 µs]
                        change: [-26.308% -26.126% -25.945%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

4096 f32(0) - 1024      time:   [265.46 µs 265.52 µs 265.59 µs]
                        change: [-50.188% -50.037% -49.897%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 f32(0) - 4096      time:   [266.15 µs 266.23 µs 266.32 µs]
                        change: [-75.647% -75.632% -75.618%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

4096 f64(0) - 128       time:   [316.88 µs 316.95 µs 317.03 µs]
                        change: [-25.096% -24.949% -24.860%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 f64(0) - 1024      time:   [285.10 µs 285.16 µs 285.23 µs]
                        change: [-51.985% -51.965% -51.946%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

4096 f64(0) - 4096      time:   [285.62 µs 285.72 µs 285.84 µs]
                        change: [-77.933% -77.901% -77.848%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 string(10, 0) - 128
                        time:   [163.92 µs 163.96 µs 164.00 µs]
                        change: [-37.060% -37.015% -36.956%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

4096 string(10, 0) - 1024
                        time:   [127.45 µs 127.49 µs 127.54 µs]
                        change: [-67.408% -67.348% -67.247%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

4096 string(10, 0) - 4096
                        time:   [125.65 µs 125.71 µs 125.77 µs]
                        change: [-86.901% -86.880% -86.842%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 string(30, 0) - 128
                        time:   [225.21 µs 225.25 µs 225.30 µs]
                        change: [-31.578% -31.423% -31.306%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 string(30, 0) - 1024
                        time:   [182.79 µs 182.88 µs 182.98 µs]
                        change: [-64.628% -64.497% -64.342%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

4096 string(30, 0) - 4096
                        time:   [183.36 µs 183.43 µs 183.52 µs]
                        change: [-84.714% -84.702% -84.691%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

4096 string(100, 0) - 128
                        time:   [490.86 µs 491.27 µs 491.70 µs]
                        change: [-19.107% -18.963% -18.816%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4096 string(100, 0) - 1024
                        time:   [445.64 µs 445.85 µs 446.07 µs]
                        change: [-49.217% -49.155% -49.065%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 string(100, 0) - 4096
                        time:   [490.16 µs 490.55 µs 490.89 µs]
                        change: [-75.088% -75.052% -75.012%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 26 outliers among 100 measurements (26.00%)
  15 (15.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 string(100, 0.5) - 128
                        time:   [357.47 µs 357.56 µs 357.66 µs]
                        change: [-27.187% -27.017% -26.761%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

4096 string(100, 0.5) - 1024
                        time:   [316.25 µs 316.36 µs 316.49 µs]
                        change: [-58.868% -58.846% -58.827%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 string(100, 0.5) - 4096
                        time:   [325.66 µs 325.82 µs 326.00 µs]
                        change: [-77.121% -77.107% -77.094%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.9s, enable flat sampling, or reduce sample count to 60.
4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128
                        time:   [1.1570 ms 1.1576 ms 1.1583 ms]
                        change: [-18.339% -18.157% -17.945%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  3 (3.00%) high mild
  16 (16.00%) high severe

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.1s, enable flat sampling, or reduce sample count to 60.
4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024
                        time:   [1.0139 ms 1.0146 ms 1.0154 ms]
                        change: [-35.902% -35.849% -35.794%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 23 outliers among 100 measurements (23.00%)
  7 (7.00%) low severe
  7 (7.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.3s, enable flat sampling, or reduce sample count to 60.
4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096
                        time:   [1.0514 ms 1.0524 ms 1.0535 ms]
                        change: [-63.795% -63.754% -63.713%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128
                        time:   [941.30 µs 941.55 µs 941.85 µs]
                        change: [-21.932% -21.887% -21.842%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024
                        time:   [802.95 µs 803.27 µs 803.65 µs]
                        change: [-40.134% -40.095% -40.059%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096
                        time:   [813.42 µs 813.84 µs 814.26 µs]
                        change: [-66.732% -66.690% -66.632%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

What changes are included in this PR?

Adds a custom record reader based on csv-core, that significantly reduces allocations whilst parsing arrow data.

Are there any user-facing changes?

Previously if provided with a schema that was a valid prefix of the columns, it wouldn't complain. It now will. This behaviour was undocumented, and I think an accident, but is technically a user-facing change

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 18, 2022
@@ -45,6 +45,7 @@ arrow-data = { version = "29.0.0", path = "../arrow-data" }
arrow-schema = { version = "29.0.0", path = "../arrow-schema" }
chrono = { version = "0.4.23", default-features = false, features = ["clock"] }
csv = { version = "1.1", default-features = false }
csv-core = { version = "0.1"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already a dependency of csv, and has no default features

fn infer_reader_schema_with_csv_options<R: Read>(
reader: R,
roptions: ReaderOptions,
) -> Result<(Schema, usize), ArrowError> {
let mut csv_reader = Reader::build_csv_reader(
let mut csv_reader = build_csv_reader(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema inference still uses the old reader, both to reduce the size of this PR, but also because it is inherently row oriented, as opposed to parsing.

@@ -383,6 +417,7 @@ impl<R: Read> Reader<R> {
/// This constructor allows you more flexibility in what records are processed by the
/// csv reader.
#[allow(clippy::too_many_arguments)]
#[deprecated(note = "Use Reader::new or ReaderBuilder")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two methods from_reader and new with identical signatures is a touch confusing. Let's just unify them

schema,
projection,
reader: csv_reader,
line_number: if has_header { start + 1 } else { start },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cause of #3364, it increments the start but not the end

@@ -2074,4 +2032,31 @@ mod tests {
let col1_arr = col1.as_any().downcast_ref::<StringArray>().unwrap();
assert_eq!(col1_arr.value(5), "value5");
}

#[test]
fn test_header_bounds() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for #3364

@@ -177,11 +179,36 @@ pub fn infer_reader_schema<R: Read>(
infer_reader_schema_with_csv_options(reader, roptions)
}

/// Creates a `csv::Reader`
fn build_csv_reader<R: Read>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is just moved to be closer to where it is used. A subsequent PR might look to move the schema inference logic into its own file

col_idx: usize,
precision: u8,
scale: i8,
) -> Result<ArrayRef, ArrowError> {
let mut decimal_builder = Decimal128Builder::with_capacity(rows.len());
for row in rows {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now more strictly enforce that the schema actually matches the data read, I don't think it was a documented behaviour that rows could be missing fields, and I think largely an implementation quirk, but I think it is worth highlighting


/// Skips forward `to_skip` rows
pub fn skip(&mut self, mut to_skip: usize) -> Result<(), ArrowError> {
// TODO: This could be done by scanning for unquoted newline delimiters
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the implementation read to a ByteRecord, this will perform similarly or better, and so I don't think this is an issue

@tustvold
Copy link
Contributor Author

tustvold commented Dec 18, 2022

The performance gain is significantly better than I expected, to the point where I wonder if I've messed something up 😅

In particular the timings not scaling with batch size seems somewhat suspect to me...

@Dandandan
Copy link
Contributor

The performance gain is significantly better than I expected, to the point where I wonder if I've messed something up 😅

In particular the timings not scaling with batch size seems somewhat suspect to me...

Timings make sense to me - for a single batch performance difference will be pretty large (as in the benchmark), but for a full csv file with many batches the difference probably is smaller as the csv-based implementation re-uses the allocated StringRecords across batches (as long as they are large enough to hold the record).

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement 👌

@tustvold tustvold merged commit c344433 into apache:master Dec 19, 2022
@ursabot
Copy link

ursabot commented Dec 19, 2022

Benchmark runs are scheduled for baseline = a8c9685 and contender = c344433. c344433 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

col_idx: usize,
) -> Result<ArrayRef, ArrowError> {
rows.iter()
.enumerate()
.map(|(row_index, row)| {
match row.get(col_idx) {
Some(s) => {
if s.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dropped the handling of null values for boolean arrays. Specifically, it removed the block below (which was previously present for primitives and booleans). The affect of this is that when parsing a CSV file containing a null value, it raises a parse error.

if s.is_empty() {
  return Ok(None);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops will fix before the next release

ReadRecordResult::InputEmpty => break 'input, // Input exhausted, need to read more
ReadRecordResult::OutputFull => break, // Need to allocate more capacity
ReadRecordResult::OutputEndsFull => {
return Err(ArrowError::CsvError(format!("incorrect number of fields, expected {} got more than {}", self.num_columns, field_count)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like something wrong with this condition.

I will investigate & create issue with MRE as soon as I will reproduce it on stable basis.
But just wanted to mention, I'm using datafusion & after updating to new CSV reader I started to encounter such errors:
incorrect number of fields, expected 7 got more than 7 (numbers always the same actually)

And it happens for cases which worked well before this update @tustvold

Copy link
Contributor Author

@tustvold tustvold Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be expected, see apache/datafusion#4918

The TLDR is this may have been working by accident, and has never worked reliably

Copy link
Contributor

@DDtKey DDtKey Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold wow. but CSV files are correct which produce this error (and schema passed directly without inference) - it has exactly number of fields as expected (however, let me check for null values inside).

Shouldn't the error message be changed in that case? I can't clearly understand the case when it happens, but trying

Copy link
Contributor Author

@tustvold tustvold Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens when at least one row contains more fields than specified in the schema, i.e. more than schema.fields().len() - 1 delimiters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold I'm totally sure that number of rows were equal or less(I mean null values) than number of fields in schema

I'm double-checking that

Copy link
Contributor

@DDtKey DDtKey Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When one of the rows contains more fields than specified in schema it's usually:
incorrect number of fields, expected X got Y (from ReadRecordResult::Record arm of match), but no this variant of error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you update to the latest DataFusion, which includes arrow 31, it will print a line number which may help identify what is going on

Copy link
Contributor

@DDtKey DDtKey Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold yes, I did & it prints. But the line in files are correct, it always refer to first one in my case. 🤔 As I said, I'll try to create a MRE, otherwise it's hard to explain

Copy link
Contributor Author

@tustvold tustvold Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds like apache/datafusion#4918 and the error is occurring when it tries to skip the header

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV Reader Bounds Incorrectly Handles Header
5 participants