Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] linux://doc:source/data/doc_code/loading_data is failing/flaky on master. #35586

Closed
ArturNiederfahrenhorst opened this issue May 22, 2023 · 4 comments · Fixed by #35638
Closed
Assignees
Labels
flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ release-blocker P0 Issue that blocks the release

Comments

@ArturNiederfahrenhorst
Copy link
Contributor

....
Generated from flaky test tracker. Please do not edit the signature in this section.
DataCaseName-linux://doc:source/data/doc_code/loading_data-END
....

@ArturNiederfahrenhorst ArturNiederfahrenhorst added release-blocker P0 Issue that blocks the release flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ labels May 22, 2023
@ArturNiederfahrenhorst
Copy link
Contributor Author

ArturNiederfahrenhorst commented May 22, 2023

Screenshot 2023-05-21 at 22 23 56

@ericl This issue appears to have been introduced by #35419.
Also, it exists on release branch, where it was picked with #35525.

@zhe-thoughts
Copy link
Collaborator

@ericl is OOO. @raulchen @amogkam could you help take a look? Thanks

@raulchen raulchen self-assigned this May 22, 2023
@raulchen
Copy link
Contributor

raulchen commented May 22, 2023

I'll take a look.
@amogkam is already looking at this.

@amogkam
Copy link
Contributor

amogkam commented May 22, 2023

This is the issue: apache/arrow#26470

Seems like the recommendation is to use object dtype for variable length binary data (i.e. any output from read_binary_files).

In [4]: regular_list = {"bytes": [b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x1c\x00\x00\x00\x1c\x08\x00\x00\x00\x00Wf\x80H\x00\x00\x00\xf2IDATx\x9cc`\x18x`<\xff\xef|#\x1cr\x
   ...: 06\xef\xfe\xfc\xf9\xf3\x16\xbb\x9c\xd9\xe3\xbf\x7f\xde\xbf\xfcc\xc9\x86)\xc5e\xf3\xe0\xcf\xdf?\xa7B\xfe\xfc\xad\xc2\x94\\\x0c4\xf2\xef\x9f?\t{\xff.\xc7t\xcb\xbb\xbf\x7f\xf7\
   ...: x15\xfd}\xa2\xef\xff\x7f\x056\xb7l\xe6\xf1\xae\x14e`\xf8\xfb\x19\xcd\xc1jK\xff\xbe\xbc\x10\x02a\xff\xfd\xb3\x14E\x8e}\xd3\x9f\x0f\xee\xc220\xc9\xc3(\x92\x96\x7f\xfe\xd8\xc39
   ...: \xe8\x92\xc7\xfe\xeeCp\xfe\xff=\x82,\xe7\xf3\xedO\x01\x82\xf7\xf7\xcf\x14d\xc9\xd0?\xcf$\xe1\xd6\xb7\xff\xdd\xc5\x83*y\x1f.\xd7\xfc\xe7\xa1;\x03\xaa\xe4D(\xcb`\xe9\x9f\xb5\x
   ...: 0c\xa8 \xec\xefC\x08\xa3\xe8\xdd\xdfEhr@\x9d?'\x19\xc8\x86nz\xf8\xf7\xfer\x0bL\xc9?\x7f\x9e^\x07\x12G\x9a\xd0\xa5\x18\x18d\x8e\x83#\xe4\xe5DL) \x90l\x00J\xf6\xaab\x95\xa3+\x
   ...: 00\x00\xac\xbeyx\x8en\x844\x00\x00\x00\x00IEND\xaeB`\x82"]}

In [5]: ndarray = {"bytes": np.array([b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x1c\x00\x00\x00\x1c\x08\x00\x00\x00\x00Wf\x80H\x00\x00\x00\xf2IDATx\x9cc`\x18x`<\xff\xef|#\x1
   ...: cr\x06\xef\xfe\xfc\xf9\xf3\x16\xbb\x9c\xd9\xe3\xbf\x7f\xde\xbf\xfcc\xc9\x86)\xc5e\xf3\xe0\xcf\xdf?\xa7B\xfe\xfc\xad\xc2\x94\\\x0c4\xf2\xef\x9f?\t{\xff.\xc7t\xcb\xbb\xbf\x7f\
   ...: xf7\x15\xfd}\xa2\xef\xff\x7f\x056\xb7l\xe6\xf1\xae\x14e`\xf8\xfb\x19\xcd\xc1jK\xff\xbe\xbc\x10\x02a\xff\xfd\xb3\x14E\x8e}\xd3\x9f\x0f\xee\xc220\xc9\xc3(\x92\x96\x7f\xfe\xd8\
   ...: xc39\xe8\x92\xc7\xfe\xeeCp\xfe\xff=\x82,\xe7\xf3\xedO\x01\x82\xf7\xf7\xcf\x14d\xc9\xd0?\xcf$\xe1\xd6\xb7\xff\xdd\xc5\x83*y\x1f.\xd7\xfc\xe7\xa1;\x03\xaa\xe4D(\xcb`\xe9\x9f\x
   ...: b5\x0c\xa8 \xec\xefC\x08\xa3\xe8\xdd\xdfEhr@\x9d?'\x19\xc8\x86nz\xf8\xf7\xfer\x0bL\xc9?\x7f\x9e^\x07\x12G\x9a\xd0\xa5\x18\x18d\x8e\x83#\xe4\xe5DL) \x90l\x00J\xf6\xaab\x95\xa
   ...: 3+\x00\x00\xac\xbeyx\x8en\x844\x00\x00\x00\x00IEND\xaeB`\x82"])}
In [10]: pyarrow.Table.from_pydict(regular_list)
Out[10]:
pyarrow.Table
bytes: binary
----
bytes: [[89504E470D0A1A0A0000000D494844520000001C0000001C080000000057668048000000F249444154789C63601878603CFFEF7C231C7206EFFEFCF9F316BB9CD9E3BF7FDEBFFC63C98629C565F3E0CFDF3FA742FEFCADC2945C0C34F2EF9F3F097BFF2EC774CBBBBF7FF715FD7DA2EFFF7F0536B76CE6F1AE146560F8FB19CDC16A4BFFBEBC100261FFFDB314458E7DD39F0FEEC23230C9C32892967FFED8C339E892C7FEEE4370FEFF3D822CE7F3ED4F0182F7F7CF1464C9D03FCF24E1D6B7FFDDC5832A791F2ED7FCE7A13B03AAE44428CB60E99FB50CA820ECEF4308A3E8DDDF456872409D3F2719C8866E7AF8F7FE720B4CC93F7F9E5E0712479AD0A51818648E8323E4E5444C2920906C004AF6AA6295A32B0000ACBE79788E6E84340000000049454E44AE426082]]
In [14]: pyarrow.Table.from_pydict(ndarray)
Out[14]:
pyarrow.Table
bytes: binary
----
bytes: [[89504E470D0A1A0A]]

amogkam added a commit that referenced this issue May 24, 2023
Closes #35586
See #35586 (comment)

Numpy treats variable length byte data as zero-terminated bytes. So if there are zero bytes encoded into the bytestring itself, those will be discarded.

Instead, per recommendation in apache/arrow#26470, it seems that variable length bytes should be treated as python objects.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
amogkam added a commit to amogkam/ray that referenced this issue May 24, 2023
…oject#35638)

Closes ray-project#35586
See ray-project#35586 (comment)

Numpy treats variable length byte data as zero-terminated bytes. So if there are zero bytes encoded into the bytestring itself, those will be discarded.

Instead, per recommendation in apache/arrow#26470, it seems that variable length bytes should be treated as python objects.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
amogkam added a commit that referenced this issue May 24, 2023
#35703)

Closes #35586
See #35586 (comment)

Numpy treats variable length byte data as zero-terminated bytes. So if there are zero bytes encoded into the bytestring itself, those will be discarded.

Instead, per recommendation in apache/arrow#26470, it seems that variable length bytes should be treated as python objects.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
scv119 pushed a commit to scv119/ray that referenced this issue Jun 16, 2023
…oject#35638)

Closes ray-project#35586
See ray-project#35586 (comment)

Numpy treats variable length byte data as zero-terminated bytes. So if there are zero bytes encoded into the bytestring itself, those will be discarded.

Instead, per recommendation in apache/arrow#26470, it seems that variable length bytes should be treated as python objects.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
arvind-chandra pushed a commit to lmco/ray that referenced this issue Aug 31, 2023
…oject#35638)

Closes ray-project#35586
See ray-project#35586 (comment)

Numpy treats variable length byte data as zero-terminated bytes. So if there are zero bytes encoded into the bytestring itself, those will be discarded.

Instead, per recommendation in apache/arrow#26470, it seems that variable length bytes should be treated as python objects.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants