-
Notifications
You must be signed in to change notification settings - Fork 223
Fixed error reading unbounded Avro list #1253
Conversation
Codecov ReportBase: 83.15% // Head: 83.16% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1253 +/- ##
==========================================
+ Coverage 83.15% 83.16% +0.01%
==========================================
Files 359 359
Lines 38063 38158 +95
==========================================
+ Hits 31650 31735 +85
- Misses 6413 6423 +10
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
-len | ||
} else { | ||
len | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks correct to me, but it actually fixes a different bug than the one I was referring to in #1252. The fix here addresses the handling the case when an avro list item block encodes an optional byte_size
for data skipping as <-1*block_count><byte_size>...
.
The issue I was referring to, however, is the call to array.try_push_valid()?;
on line 145 (136 original), which is called for each block deserialised. Shouldn't this call be moved right outside of the loop {...}
onto line 147, so that we only push valid once per list item?
Thinking out loud, for an empty list, looks like the if len == 0 { break; }
causes this loop to short circuit currently meaning we won't push a valid for that case too. I wonder if this is also an unintended bug, of having this call to try_push_valid in the block reading loop as opposed to outside.
To clarify, I am concerned about the handling of the following two cases:
-
List [1,2,3,4]
, encoded as two "blocks"- Currently we make 2 calls to
try_push_valid
whereas I would expect 1
- Currently we make 2 calls to
-
List []
, encoded as 0 blocks, indicated by a prefixed byte oflen == 0
- Currently we make 0 calls to
try_push_valid
whereas I would expect 1, as you mentioned before that empty lists/structs are still "valid" in Arrow so as to maintain O(1) validity checks.
- Currently we make 0 calls to
Again, I am not well versed in the Avro format, so feel free to correct me if I have an incorrect understanding here. I came across these potential bugs while reading the deserialisation code thoroughly, as I was considering taking a stab at implementing Map type support for Avro, which will be very similar to List
. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Really good call.
Did another push with an extra fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, LGTM
Fixed #1252 - Thanks to @shaeqahmed for the report!