Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix RowWriter index out of bounds error #2968

Merged
merged 4 commits into from
Jul 29, 2022
Merged

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Jul 27, 2022

Which issue does this PR close?

Closes #2910 #2963 .

Rationale for this change

The weird bug was reported, it can be reproduced locally in row_writer_resize_test, the reason being when the column is not nullable then 1 extra byte added to initial offset, and this breaks the buffer resize logic, causing out of range issues on self.data[varlena_offset..varlena_offset + size].copy_from_slice(bytes);

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 27, 2022
@comphead
Copy link
Contributor Author

@yjshen please check this PR draft, as I noticed the row_writer logic is mostly done by you

@comphead comphead marked this pull request as ready for review July 27, 2022 01:04
@codecov-commenter
Copy link

codecov-commenter commented Jul 27, 2022

Codecov Report

Merging #2968 (7901ccc) into master (176f432) will increase coverage by 0.00%.
The diff coverage is 96.29%.

@@           Coverage Diff           @@
##           master    #2968   +/-   ##
=======================================
  Coverage   85.75%   85.76%           
=======================================
  Files         281      281           
  Lines       51494    51511   +17     
=======================================
+ Hits        44161    44179   +18     
+ Misses       7333     7332    -1     
Impacted Files Coverage Δ
datafusion/row/src/writer.rs 91.21% <91.66%> (+1.35%) ⬆️
datafusion/core/src/dataframe.rs 89.05% <100.00%> (+0.76%) ⬆️
datafusion/expr/src/window_frame.rs 92.43% <0.00%> (-0.85%) ⬇️
datafusion/core/src/physical_plan/metrics/value.rs 86.93% <0.00%> (-0.51%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 77.43% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us.

@thomas-k-cameron
Copy link
Contributor

Hi,

I'm the one reported the bug.

If it is necessary for the resizing logic to decide on the basis of length, not the capacity, of the data field, then shouldn't we do the same for the pub(crate) fn write_field_binary as well?

I wonder if pub(crate) fn write_field_binary could create a similar bug.

@kmitchener
Copy link
Contributor

Per my note on #2963, this PR fixes that issue.

@liukun4515 liukun4515 added the bug Something isn't working label Jul 27, 2022
@liukun4515 liukun4515 requested a review from yjshen July 27, 2022 13:56
@liukun4515
Copy link
Contributor

@yjshen PTAL

@yjshen
Copy link
Member

yjshen commented Jul 27, 2022

I'm aware of this PR but ran out of time today. Will come back tomorrow morning.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comphead

I also think @yjshen should review this if he has time before we merge this in

Comment on lines 1301 to 1303
// let df = ctx.sql("SELECT * FROM test").await.unwrap();
// df.show_limit(10).await.unwrap();
// dbg!(df.schema());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave it in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better remove this I suppose.

@alamb alamb changed the title fix writer index out of bounds fix RowWriter index out of bounds error Jul 27, 2022
@comphead
Copy link
Contributor Author

Thanks @comphead

I also think @yjshen should review this if he has time before we merge this in

Hi @alamb I'm waiting for @yjshen and after his review I can do the same for binary datatype

Copy link
Member

@yjshen yjshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comphead!

I propose we fix the followings:

  • The new_width > to.data.len() change makes sense to me, but on the resize part:
    if new_width > to.data.len() {
        to.data.resize(max(to.data.capacity(), new_width), 0);
    }

I think we could just resize to capacity to avoid reallocating, instead of the previous double capacity way.

  • write_field_binary should be adapted accordingly as well.

  • The capacity() bug also exists in the end_padding logic, it should be:

    /// End each row at 8-byte word boundary.
    pub(crate) fn end_padding(&mut self) {
        let payload_width = self.current_width();
        self.row_width = round_upto_power_of_2(payload_width, 8);
        if self.data.len() < self.row_width {
            self.data.resize(self.row_width, 0);
        }
    }

@liukun4515 I cannot quite get your +8 logic while calculating size in the comments above, I didn't get panic with your 110 empty string case either. Please correct me if I have missed something important.

@@ -269,29 +269,29 @@ impl RowWriter {

/// Stitch attributes of tuple in `batch` at `row_idx` and returns the tuple width
pub fn write_row(
row: &mut RowWriter,
row_writer: &mut RowWriter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This renaming is unrelated, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought its better naming, because when reading the code I supposed the row is either incoming row or row buffer, both are not correct. row_writer better reflects the object imho.

Comment on lines 1301 to 1303
// let df = ctx.sql("SELECT * FROM test").await.unwrap();
// df.show_limit(10).await.unwrap();
// dbg!(df.schema());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better remove this I suppose.

@liukun4515
Copy link
Contributor

Thanks @comphead!

I propose we fix the followings:

  • The new_width > to.data.len() change makes sense to me, but on the resize part:
    if new_width > to.data.len() {
        to.data.resize(max(to.data.capacity(), new_width), 0);
    }

I think we could just resize to capacity to avoid reallocating, instead of the previous double capacity way.

  • write_field_binary should be adapted accordingly as well.
  • The capacity() bug also exists in the end_padding logic, it should be:
    /// End each row at 8-byte word boundary.
    pub(crate) fn end_padding(&mut self) {
        let payload_width = self.current_width();
        self.row_width = round_upto_power_of_2(payload_width, 8);
        if self.data.len() < self.row_width {
            self.data.resize(self.row_width, 0);
        }
    }

@liukun4515 I cannot quite get your +8 logic while calculating size in the comments above, I didn't get panic with your 110 empty string case either. Please correct me if I have missed something important.

Sorry for the error comments with my wrong understanding for row layout.

@@ -349,9 +349,9 @@ pub(crate) fn write_field_utf8(
let from = from.as_any().downcast_ref::<StringArray>().unwrap();
let s = from.value(row_idx);
let new_width = to.current_width() + s.as_bytes().len();
if new_width > to.data.capacity() {
if new_width > to.data.len() {
// double the capacity to avoid repeated resize
Copy link
Member

@yjshen yjshen Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated code comment

@@ -365,9 +365,9 @@ pub(crate) fn write_field_binary(
let from = from.as_any().downcast_ref::<BinaryArray>().unwrap();
let s = from.value(row_idx);
let new_width = to.current_width() + s.len();
if new_width > to.data.capacity() {
if new_width > to.data.len() {
// double the capacity to avoid repeated resize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@yjshen
Copy link
Member

yjshen commented Jul 29, 2022

@alamb want to take another look?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great -- thank you all for your help and attention to this matter

@alamb alamb merged commit 811bad4 into apache:master Jul 29, 2022
@ursabot
Copy link

ursabot commented Jul 29, 2022

Benchmark runs are scheduled for baseline = 176f432 and contender = 811bad4. 811bad4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

index out of range error from datafusion_row::write::write_field
8 participants