-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-2822: [C++] Zero padding bytes in PoolBuffer #2239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2239 +/- ##
=========================================
+ Coverage 84.25% 86.66% +2.4%
=========================================
Files 290 236 -54
Lines 44305 41724 -2581
=========================================
- Hits 37330 36160 -1170
+ Misses 6944 5564 -1380
+ Partials 31 0 -31
Continue to review full report at Codecov.
|
54282ec to
372c5e2
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! On the principle this looks sane. A couple comments below.
cpp/src/arrow/builder.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this desirable? If an error occurs while building an array, it is normal not to finish the builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as mentioned in JIRA, it throws a lot of false positive. How about I set the warning level to ARROW_DEBUG, so it doesn't get printed by default, but still can be used to find problems, if the need arises?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there should be a warning at this point at all. What could emit a warning are the various builder methods that allow to bypass the Finish call (for example ArrayBuilder::null_bitmap).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about hiding this behind a flag like ARROW_LOG_UNFINISHED_BUILDERS. Having this output in debug builds always seems like too much to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to rework it and move the warning to null_bitmap and data methods. It seems to be doable, but I'd need some more time.
We can still then move it behind a special compile flag.
cpp/src/arrow/builder.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note null buffers are problematic, see PR #2243.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left a review + comments there. I think we should try to merge this before merging #2243 so we can decide what to do there
cpp/src/arrow/buffer.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a native English speaker, but "between padding" looks grammatically incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in" instead of "between" sounds fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, comment editing fail, I'll fix it.
cpp/src/arrow/buffer.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in" instead of "between" sounds fine
cpp/src/arrow/builder.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about hiding this behind a flag like ARROW_LOG_UNFINISHED_BUILDERS. Having this output in debug builds always seems like too much to me
cpp/src/arrow/builder.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for this change? All values in NullBuilder are null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, then I've misunderstood it's purpose. I thought it should be used as in python_to_arrow.cc.
I'll revert it and add some unit cases, you could look at.
cpp/src/arrow/builder.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto re: decision above
cpp/src/arrow/builder.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is setting unfinished here needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since unfinished signals that there were values appended after the last FinishInternal call, I would say yes.
cpp/src/arrow/builder.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are nulls appended to the values_builder_, so it could resized and contain uninitialized padding, if we never call finish on it.
cpp/src/arrow/builder.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should only be set when the builder is initialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was what I've tried at first. Then I hit the problem that sometimes the builders are initialized but never appended to and never finished, i.e. in DictionaryBuilder. So I changed the logic to "set to false on every append".
I'll see if I can change it back after the rework I've mentioned before.
cpp/src/arrow/io/file.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems error prone to have to remember to zero the padding. What would be the performance consequences (if any, realistically) of zeroing in Resize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things may be repeatedly resized if you don't know the exact size up front (e.g. in BinaryBuilder).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, in the event that you shrank the buffer in BinaryBuilder when Finish is called, you would want to zero the padding there, too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if you shrink the memory then the memory you padded before goes unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worst case is when you Reserve a huge amount of memory and then gradually Resize it, as the new elements come in. You would expect these operations to be virtually free while in reality you are zeroing everything between size and capacity on every call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. OK, let's leave things as they are, then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the way that NullBuilder is being used in this module doesn't look right; probably related to the NullBuilder changes above cc @pcmoritz. It looks like a BooleanBuilder should be used instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the NullBuilder changes and switched nones_ to BooleanBuilder.
@pcmoritz could you check if 6753dcc is fine?
|
Hi guys. I'll try to incorporate the changes and rework the warning logic to be more precise in the comming days (hopefully tomorrow). |
|
Ok, so there was a possibility I totally haven't considered before: we can just remove the Does anyone know, if some 3rd party code relies on these methods? I would argue, they should be considered implementation detail, but we might want to put a deprecation warning before removing them alltogether. |
I would deprecate them first indeed. |
|
Rebased on the master. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the deprecation may be problematic after all. Sorry for all the back-and-forth!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I think it's a bit problematic that we seem to be introducing a second data copy here (first from source to converted, then from converted to the builder). Can we avoid the second copy without using builder->data()? Otherwise we may have to undeprecate the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also confused to see a stack allocation here and above. Is it not possible to use source in AppendValues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note we don't have any unit tests for this code path so have to be extra careful
cpp/src/arrow/buffer.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "fixed-size" is the correct spelling. Perhaps a native English speaker can confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I googled a bit. "fixed size" seems more common, and reads OK to me
cpp/src/arrow/buffer.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the article here ("a resizeable buffer") shouldn't be removed.
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to make all this more precise + clean things up. I left some comments / questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also confused to see a stack allocation here and above. Is it not possible to use source in AppendValues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stack allocation here
cpp/src/arrow/buffer.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I googled a bit. "fixed size" seems more common, and reads OK to me
cpp/src/arrow/builder.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left a review + comments there. I think we should try to merge this before merging #2243 so we can decide what to do there
cpp/src/arrow/builder.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these lines equivalent (I thought so)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I changed it back.
cpp/src/arrow/builder.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
cpp/src/arrow/builder.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
cpp/src/arrow/builder.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is trivial, it can be omitted altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcmoritz or @robertnishihara take a look to make sure these changes look right.
Should this be Append(false)? I would expect nones_ to be an all non-null boolean array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union array expects a null_count in addition to the null_bitmap. Since the boolean array doesn't track the number of true/false values, we add a null here. We could of course count falses manually after finishing.
That's the reason why I though the NullBuilder behaviour which I had previously made sense: you could add both null and non-null values and then extract the null bitmap and the null count.
python/pyarrow/tests/test_builder.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
So, what orc adapter did was to use My solution is to overload the An additional problem was that orc adapter used data pointer to directly add a computed value (time in nanoseconds). What is easily done in python with generators, is not supported in any current C++ version directly (excluding non-standard extensions). I added the I really think we should remove/deprecate |
|
Ok, so funny thing: MSVC14 (i.e. 2015) issues warning on narrowing conversion when using Which raises the question, if the orc adapter in its current form has even been through CI, because [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/adapters/orc/adapter.cc#L535] it totaly copies With the supplied changes it's actually pretty easy to fix the warning: just wrap the iterator with a lambda which does a An alternative would be to silence the warning, since it's apparently absolutely non-standard. EDIT: Later MSVC do raise the same warning. gcc and clang aren't bothered, even on |
|
@alendit can you confirm when the PR is ready for review again? Thanks for working on this! |
Sounds good to me |
2811817 to
9ad72a4
Compare
|
Reviewing this again now |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, wow thanks @alendit for going to the effort of adding iterator versions for all the AppendValues functions, definitely going above and beyond. This will be really helpful going forward.
I took a pretty careful read through this and it looks good to me. @pitrou do you want to look any more or do you mind if I go ahead and merge? #2243 will need to be rebased after but it shouldn't be too bad
| target_type* target = reinterpret_cast<target_type*>(builder->data()->mutable_data()); | ||
| auto cast_iter = internal::MakeLazyRange( | ||
| [&source](int64_t index) { return static_cast<target_type>(source[index]); }, | ||
| length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool
| int64_t bit_offset = length_ % 8; | ||
| uint8_t bitset = null_bitmap_data_[byte_offset]; | ||
|
|
||
| for (auto iter = begin; iter != end; ++iter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, would you expect the optimized code to be significantly different here vs. the original const uint8_t* valid_bytes version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With STL iterators and raw pointers it should be exactly the same. With more complicated iterators, it's hard to tell, though the LazyRange ones give the same performance as loops, at least on GCC.
cpp/src/arrow/builder.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think returning a const value makes sense? Either return a value or a const reference;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these are always C types I would think that just value_type is fine; I am not an expert on the fine details, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yupp, the first const should go.
|
|
||
| BENCHMARK(BM_lazy_copy)->Repetitions(3)->Unit(benchmark::kMillisecond); | ||
|
|
||
| // std::copy with a lazy iterator which does static cast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious: what do you call a lazy iterator? ISTM iterators are always lazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe he's referring to the object returned by internal::MakeLazyRange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the LazyRange is lazy about calling the supplied generator function instead of generating a new container eagerly like std::generate, for example.
With these benchmarks I wanted to make sure, that the compilers can inline the lambdas properly so we don't sacrifice performance compared to a loop.
EDIT: reading this again, maybe it should be a lazy container to avoid confusion.
|
By the way, these were just nits in case @alendit want to gold-plate it. I'm ok for this to go in :-) |
|
Nice, glad it went through. Thx both of you for the help and patience :) Once further question: for really small changes, like removing that one |
|
You can open a PR but don't bother creating a JIRA. I'll merge once the build runs |
… comments Few changes we discussed in the PR #2239. Author: Dimitri Vorona <vorona@in.tum.de> Closes #2293 from alendit/zero-padding-addendum and squashes the following commits: b5f2a8a <Dimitri Vorona> Revert "Fix type cast" c0aed06 <Dimitri Vorona> Fix type cast cb1e8dd <Dimitri Vorona> Remove the unneeded const qualifier and clarify the comments
See the JIRA discussion.