Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-94: [Format] Expand list example to clarify null vs empty list #58

Closed
wants to merge 8 commits into from

Conversation

emkornfield
Copy link
Contributor

WIP to make sure what I've done so far looks good. Per discussion on the JIRA item I started conversion of examples images to "text diagrams", but I wanted to get feedback to see if this is actually desirable (and if the way I'm approaching it is desirable). The remaining diagrams are for unions which I can convert if the existing changes look OK to others (although I think the Union diagrams are are pretty reasonable/compact).

This change also includes some other minor cleanup, as well as including a statement about endianness per the discussion on the mailing list.

Rendered markdown can be seen at: https://github.com/emkornfield/arrow/blob/emk_doc_fixes_PR3/format/Layout.md


While a struct does not have physical storage for each of its semantic slots
(i.e. each scalar C-like struct), an entire struct slot can be set to null via
the null bitmap. Any of the child field arrays can have null values according
to their respective independent null bitmaps.
In the example above, the child arrays have a valid entries for the null struct
but are 'hidden' from the consumer by the parent array's null bitmap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall there have been some questions around whether the child arrays' bitmaps must necessarily be set to null if the parent struct slot is null. I think the answer is "no" (and, in fact, you could combine a set of immutable constructed-elsewhere arrays with a bitmap that you layer on top to "null out" those other values), so having to twiddle other bitmaps would be onerous and ultimately not that useful.

If you agree perhaps we can spell this out here also specifically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree and I thought I read this elsewhere in the spec, but sounds like it is worth confirming the mailing list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this statement already: "Any of the child field arrays can have null values according to their respective independent null bitmaps." I don't think it was ever a controversial point but rather just unclear -- feel free to raise it on the mailing list if you like, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rereading the spec, I agree. I will update the main text accordingly.

@wesm
Copy link
Member

wesm commented Apr 9, 2016

This looks great to me -- I think using text diagrams going forward will be more sustainable and clear to external readers, though hopefully we are close to finalizing some of the lingering format details!

@emkornfield emkornfield changed the title [WIP] ARROW-94: [Format] Expand list example to clarify null vs empty list ARROW-94: [Format] Expand list example to clarify null vs empty list Apr 10, 2016
@@ -198,7 +308,8 @@ types (which can all be distinct), called its fields.
Typically the fields have names, but the names and their types are part of the
type metadata, not the physical memory layout.

A struct does not have any additional allocated physical storage.
A struct array does not have any additional allocated physical storage for its values.
A struct array must still have an allocated null bitmap, if it has one or more null values.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a struct array will always have the null bitmap, regardless being of any null entry or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't required to have the memory allocated if there are no nulls, like the other array types.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thx. I'm not sure about how these types are used. When creating the object of such type, it must be known before if any null exists by scanning the data to put into it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following dataset:

data = [
  {foo: 1, bar: null},
  null,
  null,
  {foo: 0, bar: 5}
]       

@wesm
Copy link
Member

wesm commented Apr 13, 2016

Do you want to also delete the existing graphical diagrams as part of this patch?

@emkornfield
Copy link
Contributor Author

Might as well just pushed another commit.

@wesm
Copy link
Member

wesm commented Apr 14, 2016

+1. Merging this. If there is something causing lingering controversy from this patch I encourage everyone to continue discussion on the mailing list so we can make further decisions there where relevant.

@asfgit asfgit closed this in 37f7271 Apr 14, 2016
@emkornfield emkornfield deleted the emk_doc_fixes_PR3 branch April 19, 2016 07:36
wesm added a commit to wesm/arrow that referenced this pull request Sep 2, 2018
…related refactoring

This also restores passing on user's `CMAKE_CXX_FLAGS`, which had unfortunately led some compiler warnings to creep into our build.

Author: Wes McKinney <wesm@apache.org>

Closes apache#58 from wesm/PARQUET-457 and squashes the following commits:

4bf12ed [Wes McKinney] * SerializeThriftMsg now writes into an OutputStream. * Refactor page serialization in advance of compression tests * Test compression roundtrip on random bytes for snappy and gzip * Trying LZO compression results in ParquetException * Don't lose user's CMAKE_CXX_FLAGS * Remove Travis CI directory caching for now * Fix gzip memory leak if you do not call inflateEnd, deflateEnd
wesm added a commit to wesm/arrow that referenced this pull request Sep 4, 2018
…related refactoring

This also restores passing on user's `CMAKE_CXX_FLAGS`, which had unfortunately led some compiler warnings to creep into our build.

Author: Wes McKinney <wesm@apache.org>

Closes apache#58 from wesm/PARQUET-457 and squashes the following commits:

4bf12ed [Wes McKinney] * SerializeThriftMsg now writes into an OutputStream. * Refactor page serialization in advance of compression tests * Test compression roundtrip on random bytes for snappy and gzip * Trying LZO compression results in ParquetException * Don't lose user's CMAKE_CXX_FLAGS * Remove Travis CI directory caching for now * Fix gzip memory leak if you do not call inflateEnd, deflateEnd

Change-Id: I44a58ef2d22f8e5064d198d0abeecde7ba4de3cb
wesm added a commit to wesm/arrow that referenced this pull request Sep 6, 2018
…related refactoring

This also restores passing on user's `CMAKE_CXX_FLAGS`, which had unfortunately led some compiler warnings to creep into our build.

Author: Wes McKinney <wesm@apache.org>

Closes apache#58 from wesm/PARQUET-457 and squashes the following commits:

4bf12ed [Wes McKinney] * SerializeThriftMsg now writes into an OutputStream. * Refactor page serialization in advance of compression tests * Test compression roundtrip on random bytes for snappy and gzip * Trying LZO compression results in ParquetException * Don't lose user's CMAKE_CXX_FLAGS * Remove Travis CI directory caching for now * Fix gzip memory leak if you do not call inflateEnd, deflateEnd

Change-Id: I44a58ef2d22f8e5064d198d0abeecde7ba4de3cb
wesm added a commit to wesm/arrow that referenced this pull request Sep 7, 2018
…related refactoring

This also restores passing on user's `CMAKE_CXX_FLAGS`, which had unfortunately led some compiler warnings to creep into our build.

Author: Wes McKinney <wesm@apache.org>

Closes apache#58 from wesm/PARQUET-457 and squashes the following commits:

4bf12ed [Wes McKinney] * SerializeThriftMsg now writes into an OutputStream. * Refactor page serialization in advance of compression tests * Test compression roundtrip on random bytes for snappy and gzip * Trying LZO compression results in ParquetException * Don't lose user's CMAKE_CXX_FLAGS * Remove Travis CI directory caching for now * Fix gzip memory leak if you do not call inflateEnd, deflateEnd

Change-Id: I44a58ef2d22f8e5064d198d0abeecde7ba4de3cb
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
…related refactoring

This also restores passing on user's `CMAKE_CXX_FLAGS`, which had unfortunately led some compiler warnings to creep into our build.

Author: Wes McKinney <wesm@apache.org>

Closes apache#58 from wesm/PARQUET-457 and squashes the following commits:

4bf12ed [Wes McKinney] * SerializeThriftMsg now writes into an OutputStream. * Refactor page serialization in advance of compression tests * Test compression roundtrip on random bytes for snappy and gzip * Trying LZO compression results in ParquetException * Don't lose user's CMAKE_CXX_FLAGS * Remove Travis CI directory caching for now * Fix gzip memory leak if you do not call inflateEnd, deflateEnd

Change-Id: I44a58ef2d22f8e5064d198d0abeecde7ba4de3cb
zhouyuan pushed a commit to zhouyuan/arrow that referenced this pull request Dec 22, 2021
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Feb 8, 2022
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Mar 3, 2022
rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request Mar 23, 2022
GerHobbelt pushed a commit to GerHobbelt/arrow that referenced this pull request Oct 18, 2024
Add List input and output types for Gandiva functions. Add new reference implementations for array_contains and array_remove, tested via integration with Dremio. int32, int64, double and float list types have been tested.

Support List types in function specification and llvm code generation.
Pass back function type information through the expression registry.
See 1p here: https://docs.google.com/document/d/1exwXdUUnk5FqZLzVZyTdhqgwxTk0u9bL54aLVNM5Tas/edit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants