Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Reject illformed GBK Coders #21935

Open
pskevin opened this issue Jun 17, 2022 · 6 comments
Open

[Bug]: Reject illformed GBK Coders #21935

pskevin opened this issue Jun 17, 2022 · 6 comments

Comments

@pskevin
Copy link
Contributor

pskevin commented Jun 17, 2022

What happened?

p.apply(
        Create.of(KV.of("X", "Y"))
                .withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of())))
        .apply(GroupByKey.create())
        .setCoder(
            SerializableCoder.of(
                TypeDescriptors.kvs(
                    TypeDescriptors.strings(),
                    TypeDescriptors.iterables(TypeDescriptors.strings()))));

The pipeline above exemplifies how it is, at the moment, acceptable to author pipelines with opaque input/output GBK coders whose semantics clearly violate that of Beam's programming model.
Pipelines of this kind should be rejected by SDKs during pipeline creation time itself.

Issue Priority

Priority: 1

Issue Component

Component: sdk-java-core

@manuzhang
Copy link
Contributor

I'd suggest to initiate a discussion on dev@beam.apache.org for wider audience.

@kennknowles
Copy link
Member

I think this is probably P2 only because most users are not going in and manually overriding coders for primitive operations. Totally agree with the issue though.

@apilloud
Copy link
Member

apilloud commented Aug 5, 2022

This also affects the python SDK and appears to be much easier to do there with type hints.

lukecwik added a commit to lukecwik/incubator-beam that referenced this issue Aug 11, 2022
lukecwik added a commit that referenced this issue Aug 17, 2022
MarcoRob pushed a commit to MarcoRob/beam that referenced this issue Sep 5, 2022
@RustedBones
Copy link
Contributor

Recent changes on java SDK for coder validation are affecting scio: spotify/scio#4549

In scio, we wrap coders into a thin layer able to reference user code in the exception stack trace upon failure.
Here, as you are checking strict equality, users MUST use beam's KVCoder and IterableCoder which makes things difficult for customization

@kennknowles
Copy link
Member

Ah, interesting. The reason for it is that a runner needs to be able to understand the format in order to handle the element for GBK and state operations, and is expected to output from GBK in a standard format. Is there a way to have the wrapper in the Java layer but move the check to the portability layer perhaps?

@lukecwik
Copy link
Member

The error seen in SCIO was:

input: KV(MaterializedCoder(StringUtf8Coder), MaterializedCoder(VarIntCoder)) expected output KV(MaterializedCoder(StringUtf8Coder), Iterable(MaterializedCoder(VarIntCoder)) actual output KV(MaterializedCoder(StringUtf8Coder), MaterializedCoder(Iterable(VarIntCoder))

It seems as though there is logic already in SCIO to handle the case where the input/output is a KvCoder and is not wrapped with a MaterializedCoder. I suggested on the SCIO issue that the logic be extended for the value coder on the output PCollection to wrap the Iterable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants