Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-32585] Use flink-connector-pulsar shade pattern #23003

Closed
wants to merge 3 commits into from

Conversation

tisonkun
Copy link
Member

This commit apache/flink-connector-pulsar@c78689c change the relocation strategy for flink-connector-pulsar to try to resolve class loading issue. As it doesn't introduce regression, I'd prefer to apply the new shade pattern to all the deps there for consistency.

If we have a suppression file for these checks it can be more smooth, but I'm OK with cross repo updating as long as it's temporary traffic and we can finally converge.

cc @zentol

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Signed-off-by: tison <wander4096@gmail.com>
@flinkbot
Copy link
Collaborator

flinkbot commented Jul 16, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Signed-off-by: tison <wander4096@gmail.com>
Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit wary about modifying existing entries; wouldn't that prevent the pulsar 2.x connector version from working against 1.18?

@zentol zentol self-assigned this Jul 17, 2023
@tisonkun
Copy link
Member Author

@zentol IIUC LicenseCheck works for CI only and this is a false positive. As long as previous connector versions pin a previous ci-tool version, it should be fine.

But I notice that change the relocate pattern may not fix the real class not found issue, so if it's proven I'll close this PR.

@tisonkun tisonkun marked this pull request as draft July 17, 2023 10:29
@zentol
Copy link
Contributor

zentol commented Jul 17, 2023

But I notice that change the relocate pattern may not fix the real class not found issue, so if it's proven I'll close this PR.

Can you link some ticket or context for the problem?

@tisonkun
Copy link
Member Author

@zentol this is the related PR - apache/flink-connector-pulsar#54

I'm filing a ticket now.

It seems without this new shade pattern tests always failed. According to your suggestions above, duplicating the before and after shade patterns can be a solution?

Signed-off-by: tison <wander4096@gmail.com>
@tisonkun tisonkun marked this pull request as ready for review July 17, 2023 13:27
@zentol
Copy link
Contributor

zentol commented Jul 17, 2023

I notice that change the relocate pattern may not fix the real class not found issue,

It seems without this new shade pattern tests always failed.

I'd love to get a summary of what works and what doesn't with what changes to the shade-plugin / ci-tools; I can't wrap my head around the current state.

The PR description states that the original PR failed with a CNFE for a relocated netty class, using the original pulsar relocation.
You then tried to change the relocation pattern (why?), which required changes to ci-tools; what's still unclear to me is whether it fixed the issue or not.

If some class is missing from pulsar then another round of relocations shouldn't make a difference unless you also bundle another netty version and relocate that as well.
Double-check whether the final jar actually contains the relocated netty.
In fact, double-check that the original pulsar jars actually contain the relocated netty.
Let's make sure we aren't trying to build on a rotten foundation here (== packaging issues on the pulsar side).

@tisonkun
Copy link
Member Author

a rotten foundation

Unfortunately, it's complicated on this packing issue to explain in several words.

Double-check whether the final jar actually contains the relocated netty.

I checked the bundled 'flink-sql-connector-pulsar' JAR and it contains -

unzip -t flink-connector-pulsar/flink-sql-connector-pulsar/target/flink-sql-connector-pulsar-4.0-SNAPSHOT.jar | grep ChannelMatchers 
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers$InstanceMatcher.class   OK
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers$ClassMatcher.class   OK
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers$1.class   OK
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers.class   OK
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers$CompositeMatcher.class   OK
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers$InvertMatcher.class   OK

... also the pulsar-client-all 3.0

unzip -t ~/.m2/repository/org/apache/pulsar/pulsar-client-all/3.0.0/pulsar-client-all-3.0.0.jar | grep ChannelMatchers
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers$InstanceMatcher.class   OK
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers$ClassMatcher.class   OK
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers$1.class   OK
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers.class   OK
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers$CompositeMatcher.class   OK
    testing: org/apache/pulsar/shade/io/netty/channel/group/ChannelMatchers$InvertMatcher.class   OK

You then tried to change the relocation pattern (why?)

I don't know, actually. If a class loading issue happened, it can be a name resolution issue. Then I found it strange we shade "flink-connector-pulsar" classes as "org.apache.pulsar.shade" which can be conflicted with the Pulsar bundled one. Since the connector use its own classes, it should use its own shade patten in the beginning.

Now the result is that without relocation, tests always failed. With the change, it can pass - apache/flink-connector-pulsar#54

Given the proposed shade pattern changed isn't a hack but what it should be initially done (IMO), it doesn't cause regressions.

@tisonkun
Copy link
Member Author

I hope the upgrade doesn't cause anything else to change, or if I can understand why tests failed.

But the test can pass in locally -

image

So it can be a CI environment issue 🤣

@tisonkun
Copy link
Member Author

It seems the failure is related to other exceptions instead of CNFE. I'll check for details.

It's strange that it's related to the change, but if an initial correct solution can resolve the issue, it is not a hack to owe some debts.

@tisonkun
Copy link
Member Author

tisonkun commented Jul 17, 2023

From another aspect, if with the shade pattern changes we can avoid the issue, but we cannot update LicenseChecker so that we change the shade pattern particially (like apache/flink-connector-pulsar@2cb31bf change only the shaded io.netty can "fix" from the current status), it's a hack to me.

@tisonkun
Copy link
Member Author

The original patch failed seems on difference about OOM, perhaps retry helps. I don't know if we can extend the memory limit or it's something that the pulsar connector consume too much memory.

@tisonkun
Copy link
Member Author

Closing...

Although I'd prefer to use the connector's shade pattern instead of the pulsar's one even without the CI failure, I can spend time to investigate the other issue now.

@tisonkun tisonkun closed this Jul 17, 2023
@tisonkun tisonkun reopened this Jul 17, 2023
@tisonkun
Copy link
Member Author

tisonkun commented Jul 17, 2023

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either Flink Master requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'jobmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The Flink Master has to be shutdown...

The OOM is about class loading. The backtrace is:

[JobManager] STDOUT: 2023-07-17 12:09:55,039 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.
[JobManager] STDOUT: java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either Flink Master requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'jobmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The Flink Master has to be shutdown...
[JobManager] STDOUT: 2023-07-17 12:09:55,039 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.
[JobManager] STDOUT: java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either Flink Master requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'jobmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The Flink Master has to be shutdown...
[JobManager] STDOUT: 	at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_342]
[JobManager] STDOUT: 	at java.lang.ClassLoader.defineClass(ClassLoader.java:756) ~[?:1.8.0_342]
[JobManager] STDOUT: 	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) ~[?:1.8.0_342]
[JobManager] STDOUT: 	at java.net.URLClassLoader.defineClass(URLClassLoader.java:473) ~[?:1.8.0_342]
[JobManager] STDOUT: 	at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_342]
[JobManager] STDOUT: 	at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_342]
[JobManager] STDOUT: 	at java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_342]
[JobManager] STDOUT: 	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_342]
[JobManager] STDOUT: 	at java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_342]
[JobManager] STDOUT: 	at org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71) ~[flink-dist-1.17.0.jar:1.17.0]
[JobManager] STDOUT: 	at org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:51) ~[flink-dist-1.17.0.jar:1.17.0]
[JobManager] STDOUT: 	at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_342]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:86) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:881) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.channel.DefaultChannelPipeline.write(DefaultChannelPipeline.java:1015) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.channel.AbstractChannel.write(AbstractChannel.java:301) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsQueryContext.writeQuery(DnsQueryContext.java:178) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsQueryContext.sendQuery(DnsQueryContext.java:141) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsQueryContext.query(DnsQueryContext.java:136) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsNameResolver.query0(DnsNameResolver.java:1322) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:450) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:1154) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsResolveContext.internalResolve(DnsResolveContext.java:362) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsResolveContext.resolve(DnsResolveContext.java:215) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsNameResolver.resolveNow(DnsNameResolver.java:1208) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsNameResolver.doResolveAllUncached0(DnsNameResolver.java:1194) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsNameResolver.access$500(DnsNameResolver.java:93) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.resolver.dns.DnsNameResolver$7.run(DnsNameResolver.java:1142) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]
[JobManager] STDOUT: 	at org.apache.pulsar.shade.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[blob_p-b27adc8f726f7be55998719e732c5f4ecfae67b5-de2cfa0e85f194197c05016b94980ff1:4.0-SNAPSHOT]

... at least relocating don't help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants