Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-27805][Connectors/ORC] bump orc version to 1.7.5 #19844

Closed
wants to merge 4 commits into from

Conversation

liujiawinds
Copy link
Contributor

@liujiawinds liujiawinds commented May 30, 2022

What is the purpose of the change

In order to use new features (zstd compression, column encryption etc.) in 1.6.x and 1.7.x.

Release Notes

Brief change log

  • Update orc.version to 1.7.5
  • Clone a new version of PhysicalFsWriter for files to create a PhysicalWriterImpl for streams
  • Enable encryption & mask configuration.
  • Extract encryption setup methods from WriterImpl for PhysicalWriterImpl.
  • Unify the column names used in the test case to match the column names in the file.

Verifying this change

This change added tests and can be verified as follows:

  • Read and write ORC file with ZSTD compression
  • Read and write ORC file with encryption &mask key configuration

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@flinkbot
Copy link
Collaborator

flinkbot commented May 30, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@liujiawinds
Copy link
Contributor Author

@JingsongLi Could you help review this pr?

@liujiawinds
Copy link
Contributor Author

I have submitted two pr to ORC community.
ORC-1200: Extracting encryption setup logic from WriterImpl
ORC-1198: Add a new PhysicalFsWriter constructor with FSDataOutputStream parameter

I will refactor this part of the code after they are merged.

@liujiawinds liujiawinds changed the title [FLINK-27805][Connectors/ORC] bump orc version to 1.7.2 [FLINK-27805][Connectors/ORC] bump orc version to 1.7.5 Jun 20, 2022
@liujiawinds
Copy link
Contributor Author

@lirui-apache Please take a look.

@dongjoon-hyun
Copy link
Member

For a record, to reviewers, ORC-1198 is already shipped via Apache ORC 1.7.5 and included in this PR.

@dongjoon-hyun
Copy link
Member

Could you review this when you have some time, @mbalassi and @gyfora ? :)

@gyfora
Copy link
Contributor

gyfora commented Jul 21, 2022

Sure, we will take a look :)

* This class is designed to not close the underlying flink stream to avoid exceptions when
* checkpointing.
*/
public class HadoopNoCloseStream extends FSDataOutputStream {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit scary, @gyfora?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liujiawinds could you please clarify why we need the HadoopNoCloseStream here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@morhidi Because BulkWriter needs to rely on a flink FSDataOutputStream, and ORC writer uses hadoop FSDataOutputStream. So I wrapped flink FSDataOutputStream.
Additionally, BulkWriter closes the underlying ORC writer stream at checkpoint, which will cause flink to throw a ClosedChannelException if the close action is passed to flink FSDataOutputStream.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Is this covered in a test case somewhere?

Copy link

@morhidi morhidi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR contains reasonable changes, added some minor comments

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Apache Flink community, @mbalassi , @gyfora , @morhidi and the author, @liujiawinds of this PR.

  • As you see, this patch's encryption part is not a part of official Apache ORC.
  • ORC-1200 is not accepted by Apache ORC community yet and not reviewed properly. So, Apache ORC community doesn't provide any backward compatibility for this encryption part and still reserves all rights to change in the future. Please consider this patch as some 3rd party approach to hack those part.

Personally, I'd like to recommend you to remove Encryption part from this PR completely.

@MartijnVisser
Copy link
Contributor

Personally, I'd like to recommend you to remove Encryption part from this PR completely.

Since the encryption is not part of the official Apache ORC, +1 for removing this from the PR

@dongjoon-hyun
Copy link
Member

Also, cc @williamhyun since he works as a release manager of Apache ORC 1.8.0.

@liujiawinds
Copy link
Contributor Author

@dongjoon-hyun @MartijnVisser Encryption part has been removed from this PR.

@@ -48,14 +49,30 @@ public class OrcBulkWriterTest {

@Test
public void testOrcBulkWriter() throws Exception {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using @ParameterizedTest @EnumSource(CompressionKind.class) would be more elegant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dongjoon-hyun
Copy link
Member

Hi, could you review this once more, @mbalassi , @gyfora , @morhidi , @MartijnVisser ?

@pgaref
Copy link
Contributor

pgaref commented Mar 13, 2023

@liujiawinds are you still working on this one? Happy to take over if its up for grabs

@liujiawinds
Copy link
Contributor Author

@pgaref Feel free to take over this.

@dongjoon-hyun
Copy link
Member

Thank you, @pgaref and @liujiawinds .

@dmvk
Copy link
Member

dmvk commented May 3, 2023

Surpassed by #22481

@dmvk dmvk closed this May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants