Skip to content

Conversation

@alendit
Copy link

@alendit alendit commented Jul 10, 2018

See the JIRA discussion.

@codecov-io
Copy link

codecov-io commented Jul 10, 2018

Codecov Report

Merging #2239 into master will increase coverage by 2.4%.
The diff coverage is 89.45%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #2239     +/-   ##
=========================================
+ Coverage   84.25%   86.66%   +2.4%     
=========================================
  Files         290      236     -54     
  Lines       44305    41724   -2581     
=========================================
- Hits        37330    36160   -1170     
+ Misses       6944     5564   -1380     
+ Partials       31        0     -31
Impacted Files Coverage Δ
cpp/src/arrow/util/io-util.h 100% <ø> (ø) ⬆️
cpp/src/arrow/adapters/orc/adapter.cc 0.27% <0%> (ø) ⬆️
cpp/src/arrow/io/hdfs.cc 0.33% <0%> (-0.01%) ⬇️
cpp/src/arrow/test-util.h 74.61% <100%> (-1.36%) ⬇️
cpp/src/arrow/util/lazy.h 100% <100%> (ø)
cpp/src/arrow/buffer-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/io/memory.cc 86.3% <100%> (+0.09%) ⬆️
cpp/src/arrow/util/hash.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/array-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/python/numpy_to_arrow.cc 89.06% <100%> (ø) ⬆️
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f8cd36a...95bae66. Read the comment docs.

@alendit alendit force-pushed the zero-padding branch 4 times, most recently from 54282ec to 372c5e2 Compare July 11, 2018 17:36
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! On the principle this looks sane. A couple comments below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this desirable? If an error occurs while building an array, it is normal not to finish the builder.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as mentioned in JIRA, it throws a lot of false positive. How about I set the warning level to ARROW_DEBUG, so it doesn't get printed by default, but still can be used to find problems, if the need arises?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there should be a warning at this point at all. What could emit a warning are the various builder methods that allow to bypass the Finish call (for example ArrayBuilder::null_bitmap).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about hiding this behind a flag like ARROW_LOG_UNFINISHED_BUILDERS. Having this output in debug builds always seems like too much to me

Copy link
Author

@alendit alendit Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to rework it and move the warning to null_bitmap and data methods. It seems to be doable, but I'd need some more time.

We can still then move it behind a special compile flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note null buffers are problematic, see PR #2243.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left a review + comments there. I think we should try to merge this before merging #2243 so we can decide what to do there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a native English speaker, but "between padding" looks grammatically incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in" instead of "between" sounds fine

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, comment editing fail, I'll fix it.

@pitrou pitrou changed the title [C++] Zero padding bytes in PoolBuffer ARROW-2822: [C++] Zero padding bytes in PoolBuffer Jul 12, 2018
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in" instead of "between" sounds fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about hiding this behind a flag like ARROW_LOG_UNFINISHED_BUILDERS. Having this output in debug builds always seems like too much to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for this change? All values in NullBuilder are null

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, then I've misunderstood it's purpose. I thought it should be used as in python_to_arrow.cc.

I'll revert it and add some unit cases, you could look at.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto re: decision above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is setting unfinished here needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since unfinished signals that there were values appended after the last FinishInternal call, I would say yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this is needed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are nulls appended to the values_builder_, so it could resized and contain uninitialized padding, if we never call finish on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should only be set when the builder is initialized

Copy link
Author

@alendit alendit Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was what I've tried at first. Then I hit the problem that sometimes the builders are initialized but never appended to and never finished, i.e. in DictionaryBuilder. So I changed the logic to "set to false on every append".

I'll see if I can change it back after the rework I've mentioned before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems error prone to have to remember to zero the padding. What would be the performance consequences (if any, realistically) of zeroing in Resize?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things may be repeatedly resized if you don't know the exact size up front (e.g. in BinaryBuilder).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in the event that you shrank the buffer in BinaryBuilder when Finish is called, you would want to zero the padding there, too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if you shrink the memory then the memory you padded before goes unused

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst case is when you Reserve a huge amount of memory and then gradually Resize it, as the new elements come in. You would expect these operations to be virtually free while in reality you are zeroing everything between size and capacity on every call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. OK, let's leave things as they are, then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way that NullBuilder is being used in this module doesn't look right; probably related to the NullBuilder changes above cc @pcmoritz. It looks like a BooleanBuilder should be used instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the NullBuilder changes and switched nones_ to BooleanBuilder.

@pcmoritz could you check if 6753dcc is fine?

@alendit
Copy link
Author

alendit commented Jul 12, 2018

Hi guys. I'll try to incorporate the changes and rework the warning logic to be more precise in the comming days (hopefully tomorrow).

@alendit
Copy link
Author

alendit commented Jul 13, 2018

Ok, so there was a possibility I totally haven't considered before: we can just remove the data() and null_bitmap from the builders. Amazingly, it doesn't break much besides couple of tests and the DictionaryBuilder. I adjusted DictionaryBuilder and it seems to work like a charm. No is_finished_ shenanigans and no hacks to filter out false positives.

Does anyone know, if some 3rd party code relies on these methods? I would argue, they should be considered implementation detail, but we might want to put a deprecation warning before removing them alltogether.

@pitrou
Copy link
Member

pitrou commented Jul 13, 2018

Does anyone know, if some 3rd party code relies on these methods? I would argue, they should be considered implementation detail, but we might want to put a deprecation warning before removing them alltogether.

I would deprecate them first indeed.

@alendit
Copy link
Author

alendit commented Jul 16, 2018

Rebased on the master.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the deprecation may be problematic after all. Sorry for all the back-and-forth!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I think it's a bit problematic that we seem to be introducing a second data copy here (first from source to converted, then from converted to the builder). Can we avoid the second copy without using builder->data()? Otherwise we may have to undeprecate the latter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also confused to see a stack allocation here and above. Is it not possible to use source in AppendValues?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we don't have any unit tests for this code path so have to be extra careful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "fixed-size" is the correct spelling. Perhaps a native English speaker can confirm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I googled a bit. "fixed size" seems more common, and reads OK to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the article here ("a resizeable buffer") shouldn't be removed.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to make all this more precise + clean things up. I left some comments / questions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also confused to see a stack allocation here and above. Is it not possible to use source in AppendValues?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stack allocation here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I googled a bit. "fixed size" seems more common, and reads OK to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left a review + comments there. I think we should try to merge this before merging #2243 so we can decide what to do there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these lines equivalent (I thought so)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I changed it back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is trivial, it can be omitted altogether?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcmoritz or @robertnishihara take a look to make sure these changes look right.

Should this be Append(false)? I would expect nones_ to be an all non-null boolean array

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union array expects a null_count in addition to the null_bitmap. Since the boolean array doesn't track the number of true/false values, we add a null here. We could of course count falses manually after finishing.

That's the reason why I though the NullBuilder behaviour which I had previously made sense: you could add both null and non-null values and then extract the null bitmap and the null count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@alendit
Copy link
Author

alendit commented Jul 18, 2018

So, what orc adapter did was to use std::copy to copy data from int64_t* input to a Int32 builder by specifying data() as a target. This works because std::copy takes iterators and they just need to conform to the interator interface. IThe builders, on the other hand, either accept a vector or a raw pointer, both have to have the exact expected type to work.

My solution is to overload the AppendValues method to accept iterators. Then we can just use std::copy internally. The number of existing overloads with all possible combination of pointer and vector arguments suggests that we might want to switch to iterator-accepting interfaces at some point in the future, since they are more general. In the ideal world we'd have ranges, but alas...

An additional problem was that orc adapter used data pointer to directly add a computed value (time in nanoseconds). What is easily done in python with generators, is not supported in any current C++ version directly (excluding non-standard extensions). I added the LazyRange, a thin wrapper around a callable (e.g. lambda) which returns objects conforming to the iterator interface. What looks like a complicated construct, can be optimized by the compiler very well (I added some benchmarks).

I really think we should remove/deprecate data() and null_bitmap() from the builders, but it's apparent, we'd need some interface restructuring. I think the changes would be beneficial in the long term, but I'm also not happy about the size of this diff. I would suggest that as long as the general direction is acceptable, we fix this PR's code and leave the more farreaching consideration to be fixed in further PRs.

@alendit
Copy link
Author

alendit commented Jul 18, 2018

Ok, so funny thing: MSVC14 (i.e. 2015) issues warning on narrowing conversion when using std::copy. The funny part is that none other compiler, not even later MSVC versions, seem to be bothered by it.

Which raises the question, if the orc adapter in its current form has even been through CI, because [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/adapters/orc/adapter.cc#L535] it totaly copies int64_t to int32_t and smaller. Of course, it's not a problem, since we check the supplied type beforehand, but it shouldn't compile with MSVC 2015.

With the supplied changes it's actually pretty easy to fix the warning: just wrap the iterator with a lambda which does a static_cast to the right type. Should I do it? Should be 0 costs when compiled on Release.

An alternative would be to silence the warning, since it's apparently absolutely non-standard.

EDIT: Later MSVC do raise the same warning. gcc and clang aren't bothered, even on -Wall -Wextra -pedantic.

@pitrou
Copy link
Member

pitrou commented Jul 18, 2018

@alendit can you confirm when the PR is ready for review again? Thanks for working on this!

@wesm
Copy link
Member

wesm commented Jul 18, 2018

I would suggest that as long as the general direction is acceptable, we fix this PR's code and leave the more farreaching consideration to be fixed in further PRs.

Sounds good to me

@alendit alendit force-pushed the zero-padding branch 3 times, most recently from 2811817 to 9ad72a4 Compare July 19, 2018 11:44
@wesm
Copy link
Member

wesm commented Jul 19, 2018

Reviewing this again now

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, wow thanks @alendit for going to the effort of adding iterator versions for all the AppendValues functions, definitely going above and beyond. This will be really helpful going forward.

I took a pretty careful read through this and it looks good to me. @pitrou do you want to look any more or do you mind if I go ahead and merge? #2243 will need to be rebased after but it shouldn't be too bad

target_type* target = reinterpret_cast<target_type*>(builder->data()->mutable_data());
auto cast_iter = internal::MakeLazyRange(
[&source](int64_t index) { return static_cast<target_type>(source[index]); },
length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

int64_t bit_offset = length_ % 8;
uint8_t bitset = null_bitmap_data_[byte_offset];

for (auto iter = begin; iter != end; ++iter) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, would you expect the optimized code to be significantly different here vs. the original const uint8_t* valid_bytes version?

Copy link
Author

@alendit alendit Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With STL iterators and raw pointers it should be exactly the same. With more complicated iterators, it's hard to tell, though the LazyRange ones give the same performance as loops, at least on GCC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think returning a const value makes sense? Either return a value or a const reference;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are always C types I would think that just value_type is fine; I am not an expert on the fine details, though

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yupp, the first const should go.


BENCHMARK(BM_lazy_copy)->Repetitions(3)->Unit(benchmark::kMillisecond);

// std::copy with a lazy iterator which does static cast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious: what do you call a lazy iterator? ISTM iterators are always lazy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe he's referring to the object returned by internal::MakeLazyRange

Copy link
Author

@alendit alendit Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the LazyRange is lazy about calling the supplied generator function instead of generating a new container eagerly like std::generate, for example.

With these benchmarks I wanted to make sure, that the compilers can inline the lambdas properly so we don't sacrifice performance compared to a loop.

EDIT: reading this again, maybe it should be a lazy container to avoid confusion.

@pitrou
Copy link
Member

pitrou commented Jul 19, 2018

By the way, these were just nits in case @alendit want to gold-plate it. I'm ok for this to go in :-)

@wesm
Copy link
Member

wesm commented Jul 19, 2018

👍 ok, I'm merging so we can get your other PR rebased. Thanks @alendit @pitrou for your careful attention on this issue

@wesm wesm closed this in aedba2c Jul 19, 2018
@alendit
Copy link
Author

alendit commented Jul 19, 2018

Nice, glad it went through. Thx both of you for the help and patience :)

Once further question: for really small changes, like removing that one const, do I open a JIRA issue, too? Or is there a simplified procedure, so to not to swamp the queue?

@wesm
Copy link
Member

wesm commented Jul 19, 2018

You can open a PR but don't bother creating a JIRA. I'll merge once the build runs

wesm pushed a commit that referenced this pull request Jul 19, 2018
… comments

Few changes we discussed in the PR #2239.

Author: Dimitri Vorona <vorona@in.tum.de>

Closes #2293 from alendit/zero-padding-addendum and squashes the following commits:

b5f2a8a <Dimitri Vorona> Revert "Fix type cast"
c0aed06 <Dimitri Vorona> Fix type cast
cb1e8dd <Dimitri Vorona> Remove the unneeded const qualifier and clarify the comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants