-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken metadata values encoding #56
Comments
Thanks for the report @slavirok, I'll look into it. |
Can you paste your MirrorMaker configs? I'll set up a repro. I wonder if this may be IoT Hub sending in a specific encoding. |
Yup - if you have an EH producer adding headers to AMQP messages (e.g. IoT Hub enriching messages), the message headers will be AMQP encoded when read by a Kafka consumer. EH is encoding-agnostic, we just pass around bytes. I'll still do a repro on my own, but you should try doing an AMQP decoding on the headers. |
Producer:
Consumer:
|
@arerlend, thanks for reply.
By any chance do you have an example how AMQP decoding on the headers looks like? |
Sorry about taking a while to respond, that is a good question... Here's the spec - http://docs.oasis-open.org/amqp/core/v1.0/amqp-core-types-v1.0.xml But I don't know if that's the most helpful answer.... |
The .Net AMQP library (https://www.nuget.org/packages/Microsoft.Azure.Amqp/) supports it. This is one example.
If you use Java, take a look at the proton-j package (https://mvnrepository.com/artifact/org.apache.qpid/proton-j). Using the DecoderImpl class you should be able to read an object from a buffer. |
I see, thanks Xin. Link to docs for proton-j decoder class - https://qpid.apache.org/releases/qpid-proton-j-0.33.1/api/index.html |
Thanks for your help, @arerlend, @xinchen10. In case someone is interested, this is Java code to decode Kafka header values using proton-j library.
|
Does anyone know of a way to decode these values in a Python UDF? We have a use case where we can't run the scala example. |
Have you looked into the pamqp package? https://github.com/gmr/pamqp |
If someone happends to want to decode AMQP encoded strings in spark (e.g. IoT hub headers), I found that this worked (as long as you know that it is strings, since all it does is ditch the typing information in the first bytes): def decode_amqp_str(val_col):
"""Amqp adds some bytes of type-metadata. We ditch it and parses the rest as a string"""
#2147483647: https://stackoverflow.com/questions/57867088/pyspark-substr-without-length
return F.substring(val_col, 3,2147483647).cast("string") |
Here are some slightly more robust ones @udf("string")
def string_from_amqp(val: bytes):
# See https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-types-v1.0-os.html#type-string
if val == None:
return None
match val[0]:
# An AMQP Null value
case 0x40:
return None
# An AMQP string up to 2^8 - 1 octets worth of UTF-8 Unicode (with no byte order mark)
case 0xa1:
return val[2:].decode('UTF-8', 'strict')
# an AMQP string up to 2^32 - 1 octets worth of UTF-8 Unicode (with no byte order mark)
case 0xb1:
return val[5:].decode('UTF-8', 'strict')
#TODO: maybe this should fail here instead
return val.hex()
@udf("boolean")
def bool_from_amqp(val: bytes):
# See https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-types-v1.0-os.html#type-boolean
if val == None:
return None
match val[0]:
case 0x56:
if val[1] == 0x1:
return True
return False
case 0x41:
return True
case 0x42:
return False
#TODO: maybe this should fail here instead
return None |
Description
We have been using MirrorMaker to copy data from EventHub to Kafka for a while. Everything worked well since then except for one thing.
When consuming messages from Kafka we noticed that Header values got weird encoding. Please see the screenshot below.
P.S. I skipped the checklist because I don't think it would bring any value.
The text was updated successfully, but these errors were encountered: