Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Fix handling of bytes #124

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

brighton1101
Copy link
Contributor

Closes #79

One of our operators has an optional field that requires bytes. If that field is populated, it will error out when it gets formatted. This should fix that.

@lepe92 lepe92 requested a review from vchiapaikeo July 22, 2021 16:43
@vchiapaikeo
Copy link
Contributor

@brighton1101 , won't the leading b cause issues with decoding?

@vchiapaikeo
Copy link
Contributor

vchiapaikeo commented Jul 22, 2021

I think this will turn it into a string and we don't have to worry about bytes after that?

https://github.com/etsy/boundary-layer/pull/125/files

Copy link
Contributor

@vchiapaikeo vchiapaikeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should resolve but really appreciate the help here!

https://github.com/etsy/boundary-layer/pull/125/files

@brighton1101
Copy link
Contributor Author

@vchiapaikeo I think that hack is alright, but the leading b means that it is a byte literal. It has nothing to do with encoding/decoding by design, since the contents are only bytes. Python docs.

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

Afaik Calling str on a byte literal just outputs the byte literal string like below:

>>> tm = "™".encode("utf-8")
>>> print(tm)
b'\xe2\x84\xa2'
>>> print(str(tm)) # here is what we'd be doing
b'\xe2\x84\xa2'
>>> str(tm)
"b'\\xe2\\x84\\xa2'"
>>> tm.decode("utf-8")
'™'

That operator is expecting bytes. I feel like it is a regression for boundary-layer to pass the incorrect parameter for a type just because it currently works.

@brighton1101
Copy link
Contributor Author

Also - for context this likely wasn't in boundary layer before because Python 2.x did not support byte literals. Since we don't support python 2 anymore, we don't have to worry about this.

@vchiapaikeo
Copy link
Contributor

I think the problem here is that if you call the str function around a bytes type, it will fail to b64 decode later on. Example:

>>> encoded_bytes = base64.b64encode('{"name": "admin_threads", "version": 1, "send_to_bigquery": 1, "export_from_bigquery": 1, "copy_from_bigquery": 1}'.encode('utf-8'))
>>> 
>>> encoded_bytes
b'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'
>>> 
>>> str(encoded_bytes)
"b'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'"
>>> 
>>> # Now try to decode this
... 
>>> 
>>> base64.b64decode(str(encoded_bytes))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/base64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

Alternatively:

>>> encoded_bytes = base64.b64encode('{"name": "admin_threads", "version": 1, "send_to_bigquery": 1, "export_from_bigquery": 1, "copy_from_bigquery": 1, "dataflow_service_account": "dataflow-dev-pii@etsy-hadoop-sandbox-dev.iam.gserviceaccount.com", "bq_project": "etsy-data-warehouse-prod"}'.encode('utf-8'))
>>> 
>>> encoded_bytes
b'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'
>>> 
>>> encoded_bytes.decode('utf-8')
'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'
>>> 
>>> decoded_bytes = encoded_bytes.decode('utf-8')
>>> 
>>> # This will decode properly
... 
>>> base64.b64decode(decoded_bytes)
b'{"name": "admin_threads", "version": 1, "send_to_bigquery": 1, "export_from_bigquery": 1, "copy_from_bigquery": 1}'

@vchiapaikeo
Copy link
Contributor

vchiapaikeo commented Jul 22, 2021

Ah I think your code snippet only works this way in Py2 and not Py3

>>> tm = "™".encode("utf-8")
>>> print(tm)
b'\xe2\x84\xa2'
>>> print(str(tm)) # here is what we'd be doing
b'\xe2\x84\xa2'
>>> str(tm)
"b'\\xe2\\x84\\xa2'"
>>> tm.decode("utf-8")
'™'

@@ -120,7 +120,7 @@ def format_value(value):

return '{{ {items} }}'.format(items=','.join(pairs))

if isinstance(value, (int, float, type(None))):
if isinstance(value, (int, float, type(None), bytes)):
Copy link
Contributor

@vchiapaikeo vchiapaikeo Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to support bytes, we should just return the bytes and not pass it through the str function. e.g.,

if isinstance(value, bytes):
    return value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vchiapaikeo I considered that, however since this function is returning exclusively strs apart from this, I wanted to keep the return type uniform. I understand that now with jinja this is not causing problems. However, what if someone decorated this method call with more logic thinking they were getting a str and instead got bytes? Maybe this is not a problem and i am overthinking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GCP Pubsub Publish Operator / Preprocessor Output - Bytes not supported generating DAGs
2 participants