Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nested types in ORC writer #3696

Merged
merged 13 commits into from
Oct 13, 2021

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Sep 29, 2021

This fixes #3494 .

And also adds support for lists.

Signed-off-by: Firestarman firestarmanllc@gmail.com

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman firestarman marked this pull request as draft September 29, 2021 06:12
@firestarman
Copy link
Collaborator Author

firestarman commented Sep 29, 2021

Depending on rapidsai/cudf#9334. So mark it as draft.

@firestarman firestarman added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Sep 29, 2021
@firestarman firestarman linked an issue Sep 29, 2021 that may be closed by this pull request
Signed-off-by: Firestarman <firestarmanllc@gmail.com>

/**
* (We could try to merge this with `parquetWriterOptionsFromSchema` after fixing the issue
* https://github.com/rapidsai/cudf/issues/7654)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is circumvented by pruning masks ourselves and we are already calling the remove_validity_if_needed from the writeORCChunk in TableJni.cpp. So I don't see why this should be preventing us from merging the two methods?

Copy link
Collaborator Author

@firestarman firestarman Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info. But I tried locally and still getting the exception below when running the Parquet test test_write_map_nullable after changing the nullable to go back to the actual setting.

E                   Caused by: ai.rapids.cudf.CudfException: cuDF failure at: /home/liangcail/work/projects/on_github/cudf/cpp/src/io/parquet/writer_impl.cu:377: Mismatch in metadata prescribed nullability and input column nullability. Metadata for nullable input column cannot prescribe nullability = false
E                   	at ai.rapids.cudf.Table.writeParquetChunk(Native Method)
E                   	at ai.rapids.cudf.Table.access$300(Table.java:39)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a bug in plugin. Working on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it by using m.valueContainsNull instead of the parameter nullable when building options from MapType.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Nested map is not supported yet.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman firestarman changed the title Support structs and lists in ORC writer Support nested types in ORC writer Oct 9, 2021
@firestarman firestarman marked this pull request as ready for review October 11, 2021 00:05
@firestarman firestarman marked this pull request as draft October 11, 2021 00:08
@firestarman firestarman marked this pull request as ready for review October 11, 2021 00:19
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman firestarman linked an issue Oct 11, 2021 that may be closed by this pull request
@firestarman
Copy link
Collaborator Author

@razajafri Could you review it again ?

@@ -21,7 +21,8 @@ import java.util.Optional
import scala.collection.mutable.ArrayBuffer
import scala.language.implicitConversions

import ai.rapids.cudf.{ColumnView, Table}
import ai.rapids.cudf._
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good way to import all from cudf.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good suggestion, but this is done by IDE, if you prefer, i can change it back.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Since it's a breaking change, let's merge it as soon as possible.

@@ -21,7 +21,8 @@ import java.util.Optional
import scala.collection.mutable.ArrayBuffer
import scala.language.implicitConversions

import ai.rapids.cudf.{ColumnView, Table}
import ai.rapids.cudf._
import ai.rapids.cudf.ColumnWriterOptions._
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find ColumnWriterOptions in cudf side. Is there a pending PR?

Copy link
Collaborator Author

@firestarman firestarman Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it depends on rapidsai/cudf#9334. I will merge it once this PR gets approvals.

@wbo4958
Copy link
Collaborator

wbo4958 commented Oct 12, 2021

overall, LTGM

@firestarman
Copy link
Collaborator Author

Will trigger CI after rapidsai/cudf#9334 being merged.

@revans2 revans2 marked this pull request as draft October 12, 2021 11:44
@revans2
Copy link
Collaborator

revans2 commented Oct 12, 2021

Marked as draft just because rapidsai/cudf#9334 is not merged yet. When it is merged this can go back to ready.

@firestarman
Copy link
Collaborator Author

firestarman commented Oct 12, 2021

Marked as draft just because rapidsai/cudf#9334 is not merged yet. When it is merged this can go back to ready.

rapidsai/cudf#9334 is a breaking change to plugin, and this PR should be merged as soon as possible after #9334 being merged. So I plan to get an approval first, then merge the cudf PR and trigger the CI. Then I can merge this PR once CI is done. Otherwise it will take more time to get this in.
Even this got an approval, it still can not be merged because the CI is not passed yet.

Copy link
Collaborator

@res-life res-life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@firestarman
Copy link
Collaborator Author

build

@firestarman firestarman marked this pull request as ready for review October 13, 2021 01:35
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build


orc_write_basic_struct_gen = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(orc_write_basic_gens)])

# Some array/struct gens, but not all because of nesting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind expanding on this comment a bit, why "not all because of nesting"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I will update it in a following PR since this PR should be merged as soon as possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this comment.

@@ -852,7 +852,8 @@ object GpuOverrides extends Logging {
(OrcFormatType, FileFormatChecks(
cudfRead = (TypeSig.commonCudfTypes + TypeSig.ARRAY + TypeSig.DECIMAL_64 +
TypeSig.STRUCT + TypeSig.MAP).nested(),
cudfWrite = TypeSig.commonCudfTypes,
cudfWrite = (TypeSig.commonCudfTypes + TypeSig.ARRAY +
Copy link
Collaborator

@abellina abellina Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I get confused with TypeSig. So we can read nested maps, but not write them? Or did I misunderstand? Also is there a follow on issue to handle the same types that the reader supports.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the truth. We can read nested map but can not write nested map. Which is limited by the cudf native orc writer.
I will file an issue to cudf first. Once cudf supported it, we can update the TypeSig here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I removed the map support due to the test failures only happened in pre-merge builds.
We have the issue #3784 to track this.

builder.withDecimalColumn(name, dt.precision, nullable)
case TimestampType =>
builder.withTimestampColumn(name, writeInt96, nullable)
case s: StructType =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version of this code for struct and array defaulted both to nullable (and had a comment that is missing in your change).

// we are setting this to nullable, in case the parent is a Map's key and wants to
// set this to false

Why don't these columns need to be set nullable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had discussion above for Zara's comments.
I moved similar comments to the beginning of method writerOptionsFromSchema.

Here setting to nullable with the comment before because nullale was hard-coded to true due to the issue rapidsai/cudf#7654.
But this issue has been circumvented by PR rapidsai/cudf#9061, so we can still keep it being nullable, but the comment is no longer needed.

Copy link
Collaborator Author

@firestarman firestarman Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the structBuilder and listBuilder are still using nullable for parent columns, while for child columns, we should use containsNull for array and valueContainsNull for map to tell whether the children are nullable.

@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

firestarman commented Oct 13, 2021

Can not find out which test failed from the log, re-build again.
My local build for 320 passed.

@firestarman
Copy link
Collaborator Author

build

GaryShen2008
GaryShen2008 previously approved these changes Oct 13, 2021
Copy link
Collaborator

@GaryShen2008 GaryShen2008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved first to unblock the CI.

Try to fix a build error in premerge builds which running tests in parallel.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Because map tests failed in premerge builds where tests run in parallel.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

firestarman commented Oct 13, 2021

Tests of orc writing maps always fail, so removed the map support. It can not reproduce locally. We will add map support back in the future.

@firestarman
Copy link
Collaborator Author

firestarman commented Oct 13, 2021

@revans2 @razajafri
I am going to merge this to unblock CI.
If any concern, I will fix it in following PRs.

Copy link
Collaborator

@pxLi pxLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve to unblock CI. Please file following issues later

@firestarman
Copy link
Collaborator Author

approve to unblock CI. Please file following issues later

Thanks, and the test failure can be tracked by #3784 .

@GaryShen2008 GaryShen2008 merged commit 7d73931 into NVIDIA:branch-21.12 Oct 13, 2021
@firestarman firestarman deleted the orc-write branch October 13, 2021 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support structs in ORC writer
8 participants