Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Simplified reading parquet #532

Merged
merged 5 commits into from
Oct 18, 2021
Merged

Simplified reading parquet #532

merged 5 commits into from
Oct 18, 2021

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Oct 16, 2021

This PR simplifies the code to read parquet, making it a bit more future proof and opening the doors to improve performance in writing by re-using buffers (improvements upstream).

I do not observe differences in performance (vs main) in the following parquet configurations:

  • single page vs multiple pages
  • compressed vs uncompressed
  • different types

It generates flamegraphs that imo are quite optimized:

Screenshot 2021-10-16 at 06 44 55

This corresponds to

cargo flamegraph --features io_parquet,io_parquet_compression \
    --example parquet_read fixtures/pyarrow3/v1/multi/snappy/benches_1048576.parquet \
    1 0

i.e. reading a f64 column from a single row group with 1M rows with a page size of 1Mb (default in pyarrow).

Screenshot 2021-10-16 at 06 53 44

(same but for a utf8 column (column index 2))

The majority of the time is used deserializing the data to arrow, which means that the main gains to have continue to be on that front.

Backward incompatible

  • The API to read parquet now uses FallibleStreamingIterator instead of StreamingIterator (of Result<Page>). As before, we re-export these APIs in io::parquet::read.
  • The API to write parquet now expects the user to decompress the pages. This is only relevant when not using RowGroupIterator (i.e. in parallelizing). This is now enforced by the type system (DataPage vs CompressedDataPage), so that we do not get it wrong.

@codecov
Copy link

codecov bot commented Oct 16, 2021

Codecov Report

Merging #532 (56c98c7) into main (13f8d09) will decrease coverage by 0.01%.
The diff coverage is 80.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #532      +/-   ##
==========================================
- Coverage   81.28%   81.26%   -0.02%     
==========================================
  Files         380      380              
  Lines       23432    23376      -56     
==========================================
- Hits        19046    18997      -49     
+ Misses       4386     4379       -7     
Impacted Files Coverage Δ
src/io/parquet/mod.rs 0.00% <0.00%> (ø)
src/io/parquet/write/binary/basic.rs 84.72% <ø> (-0.42%) ⬇️
src/io/parquet/write/binary/nested.rs 94.73% <ø> (-0.92%) ⬇️
src/io/parquet/write/boolean/basic.rs 97.43% <ø> (-0.13%) ⬇️
src/io/parquet/write/boolean/nested.rs 94.73% <ø> (-0.92%) ⬇️
src/io/parquet/write/fixed_len_bytes.rs 97.22% <ø> (+6.52%) ⬆️
src/io/parquet/write/primitive/basic.rs 95.23% <ø> (-0.22%) ⬇️
src/io/parquet/write/primitive/nested.rs 94.73% <ø> (-0.92%) ⬇️
src/io/parquet/write/utf8/basic.rs 91.66% <ø> (-0.34%) ⬇️
src/io/parquet/write/utf8/nested.rs 94.73% <ø> (-0.92%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13f8d09...56c98c7. Read the comment docs.

@ritchie46
Copy link
Collaborator

Any opportunity for simdutf8 in validation of utf8?

Do you think we could parallize the pages (I believe it were the pages, correct me if I am wrong) decoding similar to what was done in for writing? Maybe accept a closure that takes the pages and returns them decode ones?

@jorgecarleitao
Copy link
Owner Author

After jorgecarleitao/parquet2#63 we get a nice flamegraph composed of 3 main blocks:

  1. thrift deserialization
  2. decompression
  3. deserialization

Screenshot 2021-10-17 at 08 25 34

@Dandandan
Copy link
Collaborator

Any opportunity for simdutf8 in validation of utf8?

@jorgecarleitao is the flame graph taken with or without the simdutf8 feature on? I think it's disabled or individual strings are smaller than the simdutf8 boundary.

@jorgecarleitao
Copy link
Owner Author

Sorry, what do you mean? There is no simdutf8 feature: is always on.

@Dandandan
Copy link
Collaborator

Sorry, what do you mean? There is no simdutf8 feature: is always on.

You're right. I suppose the from_utf8 frome core comes from the fallback case in simdutf8 which depends on string length.

@ritchie46
Copy link
Collaborator

You're right. I suppose the from_utf8 frome core comes from the fallback case in simdutf8 which depends on string length.

In the csv-parser, I circumvent this by doing the simd utf8 check at the end on the whole [u8] buffer. Maybe that's an option here as well?

@Dandandan
Copy link
Collaborator

You're right. I suppose the from_utf8 frome core comes from the fallback case in simdutf8 which depends on string length.

In the csv-parser, I circumvent this by doing the simd utf8 check at the end on the whole [u8] buffer. Maybe that's an option here as well?

Is that enough though? Indivually invalid UTF-8 strings might be made valid UTF-8 on the whole buffer (I am not sure though)?

@ritchie46
Copy link
Collaborator

Is that enough though? Indivually invalid UTF-8 strings might be made valid UTF-8 on the whole buffer (I am not sure though)?

Yeah, come to think of it. 🤔

@jorgecarleitao
Copy link
Owner Author

I do not think it is sufficient, but I can't find an example. Asked SO for guidance.

@jorgecarleitao
Copy link
Owner Author

The answer seems to be no:

fn main() {
    let a = "π";  // valid utf8
    let a = a.as_bytes(); // [207, 128]
    assert!(std::str::from_utf8(&a[..1]).is_ok());
}

@ritchie46
Copy link
Collaborator

ritchie46 commented Oct 18, 2021

I did a local benchmark with the csv-parser, and found that delaying and then verifying in a short loop had a lot of wins.

Best case: no simd validation

 Performance counter stats for 'target/release/memcheck':

         33.047,24 msec task-clock                #   10,120 CPUs utilized          
             8.085      context-switches          #    0,245 K/sec                  
               167      cpu-migrations            #    0,005 K/sec                  
         1.044.339      page-faults               #    0,032 M/sec                  
   117.592.547.005      cycles                    #    3,558 GHz                    
   188.107.619.210      instructions              #    1,60  insn per cycle         
    36.279.636.823      branches                  # 1097,811 M/sec                  
       136.945.502      branch-misses             #    0,38% of all branches        

       3,265390685 seconds time elapsed

      30,962444000 seconds user
       2,082338000 seconds sys

Immediate validation

 Performance counter stats for 'target/release/memcheck':

         41.289,61 msec task-clock                #   10,303 CPUs utilized          
            10.193      context-switches          #    0,247 K/sec                  
               134      cpu-migrations            #    0,003 K/sec                  
         1.065.864      page-faults               #    0,026 M/sec                  
   147.827.020.236      cycles                    #    3,580 GHz                    
   226.553.476.805      instructions              #    1,53  insn per cycle         
    45.155.132.062      branches                  # 1093,620 M/sec                  
       333.913.124      branch-misses             #    0,74% of all branches        

       4,007702461 seconds time elapsed

      39,149625000 seconds user
       2,133354000 seconds sys

Delayed validation: short loop

 Performance counter stats for 'target/release/memcheck':

         37.373,08 msec task-clock                #   10,390 CPUs utilized          
             8.738      context-switches          #    0,234 K/sec                  
                97      cpu-migrations            #    0,003 K/sec                  
         1.295.653      page-faults               #    0,035 M/sec                  
   133.258.060.058      cycles                    #    3,566 GHz                    
   224.312.115.282      instructions              #    1,68  insn per cycle         
    46.320.508.217      branches                  # 1239,408 M/sec                  
       220.777.819      branch-misses             #    0,48% of all branches        

       3,597132052 seconds time elapsed

      34,999141000 seconds user
       2,374095000 seconds sys

Example of the code used.

  let mut is_valid = true;
  if delay_utf8_validation(v.encoding, v.ignore_errors) {
      let mut start = 0usize;

      for &end in &v.offsets[1..] {
          let slice= v.data.get_unchecked(start..end as usize);
          start = end as usize;
          is_valid &= simdutf8::basic::from_utf8(slice).is_ok();
      }

      if !is_valid {
          return Err(PolarsError::ComputeError("invalid utf8 data in csv".into()))
      }
  }

@jorgecarleitao jorgecarleitao merged commit fdbaa18 into main Oct 18, 2021
@jorgecarleitao jorgecarleitao deleted the parquet_compression branch October 18, 2021 17:14
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants