Adding big endian support in XLA code base #5

kun-lu20 · 2022-08-10T18:09:55Z

kun-lu20
Aug 10, 2022

Several XLA related test cases in tensorflow repo would fail on s390x (Big-Endian architecture) due to endianness issue, such as //tensorflow/compiler/xla/tests:bitcast_convert_test_cpu.

From my understanding, the rationale behind the issue is that some tensor buffers may contain endian specific data when performing model input/output or serialization/deserialization in TensorFlow/XLA code. If the endianness issue isn't taken into consideration, these codes may run correctly on LE platform but fail on BE platform.

For this test case, Bitcast operation could not be done in the same way (i.e., direct memory copy) on BE platform as on LE platform.

PR tensorflow/tensorflow#57067 has been raised to fix this test case on BE machines. However, a holistic solution for adding big endian support in XLA code base could solve the similar issues once for all.

Is it possible to add a plan for solving the endianness issue in XLA code base? Any suggestions or ideas would be greatly appreciated. Thanks!

joker-eph · 2022-08-10T20:26:45Z

joker-eph
Aug 10, 2022
Collaborator

Supporting big endian seems valuable, thanks for forking on this. As I commented on tensorflow/tensorflow#57060 I would mostly be wary of sprinkling endianness handling code everywhere.

That is I would be looking to identify where does endianness matters and build abstractions around it. For example if this is only about how we access the data for constant buffers/array we should work on the class abstracting access to the data to make it portable.

3 replies

kun-lu20 Aug 10, 2022
Author

Thanks @joker-eph ! It would be great that the endianness issue could be addressed via a holistic solution in TensorFlow/XLA code. Yes, to my knowledge most of the test case failures are related to constant tensor buffers. Please keep us updated reg your findings.

joker-eph Aug 10, 2022
Collaborator

To be clear: I don't plan to investigate further, I was giving a general direction for whoever is interested in looking into solving this!

kun-lu20 Aug 10, 2022
Author

@joker-eph Got it. Thanks for your valuable advice! That really makes sense.

sherhut · 2022-08-11T19:11:22Z

sherhut
Aug 11, 2022

@kun-lu20 as you already looked into some of the issues, would it be possible to provide a list of the kinds of issues you have seen? That would be a great first step to look into more general solutions for groups of issues and avoid many small one-off fixes.

1 reply

kun-lu20 Aug 11, 2022
Author

Yes, sure, I'll sort out a list of known issues and post it here.

Thanks @sherhut !

kun-lu20 · 2022-08-15T15:42:50Z

kun-lu20
Aug 15, 2022
Author

Hi All,

The following is a list of endianness related scenarios in TF/XLA code we've seen so far:

In TensorFlow core module, When TensorFlow models are saved/loaded across LE/BE systems, the tensor data in constant buffers/array should be adjusted according to endianness. We have contributed PRs such as Fix Const op tensor_content on s390x during save/load tensorflow/tensorflow#45339 and Fix swapping of tensor_content when loading a SavedModel on s390x tensorflow/tensorflow#50152 to address this issue, which swap the data from BE to LE format before saving the models on BE systems, and vice versa after loading them.
In TensorFlow Lite module, when serialized models are moved across endian specific archs, incorrectly accessing buffers field in the Flatbuffers would cause endianness issue. This was discussed in TF Lite issue when loading a saved TF Lite model on platforms with different endianness tensorflow/tensorflow#45009. TF community might've opened an internal issue for tracking it. Also, quantization operation sometimes may cause endianness issue in Flatbuffers as well, as seen in Fix endianness issue in un-quantized quant_model on s390x tensorflow/tensorflow#57065 .
In XLA code and TensorFlow compiler module, we also encountered endianness issue in test cases related to certain operations implemented in XLA backends, such as Bitcast op, as seen in Fix endianness issue in BitcastConvert operation of HloEvaluator on s390x tensorflow/tensorflow#57067.
In some test cases (such as TensorFlow Tools doctest) , LE output is used in the expected results, as seen in io_ops.py docstrings for serialize_tensor method generates different output on s390x architecture tensorflow/tensorflow#56937. We've also observed that tensorflow/lite/python:lite_test uses a pre-generated LE model (ssd_mobilenet_v1_quantized_300x300_coco14_sync_2018_07_18.tar.gz) and Reshaping fails on BE systems.

Any supplements or improvements of this list are highly appreciated. Hope this could be a good start for generic big endian support in TF/XLA code base.

@sherhut

2 replies

sherhut Aug 22, 2022

Thanks for the overview. For the constant loading issues, it would be nicer if we had some general policy and support for handling big endian when loading/storing constants in the various formats that are used. Otherwise, I am worried that we will have special casing through tests without fixing the underlying issue.

For the XLA hlo evaluator, I agree that a local fix is needed.

Is the a x390x CI running already that ensures this does not regress? Do you plan to have one?

Also, what is the longer term plan wrt. the XLA compiler itself. Do you also intend to make the existing LLVM based code generation work or do you have your own backend?

kun-lu20 Dec 5, 2022
Author

Hi @joker-eph , @sherhut ,

Hope all is well. Thanks very much for your valuable suggestions in this discussion and support to review our previous PRs.

Reg the generic solution for adding big-endian support, recently I've raised 3 PRs in TensorFlow repo:

PR Add big-endian support to TFLite FlatBuffers tensorflow/tensorflow#58494 is to fix the TFLite FlatBuffers endianness issue.
PR Fix the endianness issue in v1 frozen graphs in python:lite_test on BE machines tensorflow/tensorflow#58601 and Fix graphdef2mlir:const-values.pbtxt test failure on s390x tensorflow/tensorflow#58769 have refactored and improved the previous byte swapping code (in C++ and Python) for saving/loading TF v2 saved_model and added support for v1 frozen graph.

These PRs aim to provide a generic solution for the endianness issue in TF code base. I've tested these code changes on s390x, all the endianness related test failures in lite module and most of the failures in compiler module could be fixed in TF v2.9.1. I also verified the effectiveness of these changes on master branch.

After applying these PRs, we can avoid small one-off fixes such as tensorflow/tensorflow#57060 or tensorflow/tensorflow#57065.

Please take a look at the above 3 PRs when you have some time. Thanks again!

kun-lu20 · 2022-08-22T16:39:24Z

kun-lu20
Aug 22, 2022
Author

Thanks @sherhut !

Currently we have a s390x CI running nightly builds on TF master branch (http://ibmz-ci.osuosl.org/job/TensorFlow_IBMZ_CI/). We've run the regression tests on the fix for XLA hlo evaluator and no regression was found.

Reg the longer term plan, I think the SystemZ LLVM backend works on s390x, but it needs more work to add support for FP16/F16 related operations. We've filed an issue here: llvm/llvm-project#50374

0 replies

cheshire · 2022-08-24T17:34:46Z

cheshire
Aug 24, 2022

@kun-lu20 Could you provide more information on how far is your CI from being able to pass all XLA tests? Looking at https://ibmz-ci.osuosl.org/job/TensorFlow_IBMZ_CI/, last two builds are green, but I presume not all test are passing?

0 replies

kun-lu20 · 2022-08-24T18:11:23Z

kun-lu20
Aug 24, 2022
Author

Thanks @cheshire . Currently our s390x CI is not running the tests. We ran the test suite from TensorFlow code base locally and some XLA tests in compiler module failed.

Part of the failed tests could pass when -c opt --copt=-O is passed to bazel test command, we think the cause lies in issue llvm/llvm-project#50374.

Some test cases failed due to the endianness issue, such as the failure in tensorflow/tensorflow#57067.

0 replies

cheshire · 2022-08-24T18:27:16Z

cheshire
Aug 24, 2022

Some test cases failed due to the endianness issue

Would it be possible to know how many? And how many PRs would you anticipate to fix them?

2 replies

kun-lu20 Aug 24, 2022
Author

So far we only encountered one test case failure (//tensorflow/compiler/xla/tests:bitcast_convert_test_cpu) which is in XLA category and failed due to the endianness issue. PR tensorflow/tensorflow#57067 was raised to address it. Other test case failures are mainly in mlir_quantization category.

kun-lu20 Aug 24, 2022
Author

We'll let you know if more test case failures in XLA category are detected on s390x.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding big endian support in XLA code base #5

{{title}}

Replies: 7 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Adding big endian support in XLA code base #5

kun-lu20 Aug 10, 2022

Replies: 7 comments · 8 replies

joker-eph Aug 10, 2022 Collaborator

kun-lu20 Aug 10, 2022 Author

joker-eph Aug 10, 2022 Collaborator

kun-lu20 Aug 10, 2022 Author

sherhut Aug 11, 2022

kun-lu20 Aug 11, 2022 Author

kun-lu20 Aug 15, 2022 Author

sherhut Aug 22, 2022

kun-lu20 Dec 5, 2022 Author

kun-lu20 Aug 22, 2022 Author

cheshire Aug 24, 2022

kun-lu20 Aug 24, 2022 Author

cheshire Aug 24, 2022

kun-lu20 Aug 24, 2022 Author

kun-lu20 Aug 24, 2022 Author

kun-lu20
Aug 10, 2022

Replies: 7 comments 8 replies

joker-eph
Aug 10, 2022
Collaborator

kun-lu20 Aug 10, 2022
Author

joker-eph Aug 10, 2022
Collaborator

kun-lu20 Aug 10, 2022
Author

sherhut
Aug 11, 2022

kun-lu20 Aug 11, 2022
Author

kun-lu20
Aug 15, 2022
Author

kun-lu20 Dec 5, 2022
Author

kun-lu20
Aug 22, 2022
Author

cheshire
Aug 24, 2022

kun-lu20
Aug 24, 2022
Author

cheshire
Aug 24, 2022

kun-lu20 Aug 24, 2022
Author

kun-lu20 Aug 24, 2022
Author