Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XLA:CPU] [oneDNN] Enable Dot op (MatMul) in BF16 Type #8402

Conversation

mahmoud-abuzaina
Copy link
Contributor

This PR adds BF16 support in oneDNN Matmul op by allowing the Dot op to maintain the BF16 type until handled by OneDnnMatMulRewriter pass.

@github-actions github-actions bot added the kokoro:force-run Forces CI to rerun label Jan 11, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Jan 11, 2024
@golechwierowicz golechwierowicz requested review from ezhulenev and d0k and removed request for ezhulenev January 12, 2024 09:02
@penpornk penpornk added ready to pull PR ready for merge process kokoro:force-run Forces CI to rerun labels Jan 15, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Jan 15, 2024
copybara-service bot pushed a commit that referenced this pull request Jan 16, 2024
Imported from GitHub PR #8402

This PR adds BF16 support in oneDNN Matmul op by allowing the Dot op to maintain the BF16 type until handled by OneDnnMatMulRewriter pass.
Copybara import of the project:

--
4f7ddbc by Mahmoud Abuzaina <mahmoud.abuzaina@intel.com>:

Enable MatMul op in BF16

Merging this change closes #8402

FUTURE_COPYBARA_INTEGRATE_REVIEW=#8402 from Intel-tensorflow:mabuzain/enable-bf16-matmul 4f7ddbc
PiperOrigin-RevId: 598823232
copybara-service bot pushed a commit that referenced this pull request Jan 16, 2024
Imported from GitHub PR #8402

This PR adds BF16 support in oneDNN Matmul op by allowing the Dot op to maintain the BF16 type until handled by OneDnnMatMulRewriter pass.
Copybara import of the project:

--
4f7ddbc by Mahmoud Abuzaina <mahmoud.abuzaina@intel.com>:

Enable MatMul op in BF16

Merging this change closes #8402

FUTURE_COPYBARA_INTEGRATE_REVIEW=#8402 from Intel-tensorflow:mabuzain/enable-bf16-matmul 4f7ddbc
PiperOrigin-RevId: 598823232
copybara-service bot pushed a commit that referenced this pull request Jan 16, 2024
Imported from GitHub PR #8402

This PR adds BF16 support in oneDNN Matmul op by allowing the Dot op to maintain the BF16 type until handled by OneDnnMatMulRewriter pass.
Copybara import of the project:

--
4f7ddbc by Mahmoud Abuzaina <mahmoud.abuzaina@intel.com>:

Enable MatMul op in BF16

Merging this change closes #8402

FUTURE_COPYBARA_INTEGRATE_REVIEW=#8402 from Intel-tensorflow:mabuzain/enable-bf16-matmul 4f7ddbc
PiperOrigin-RevId: 598823232
copybara-service bot pushed a commit that referenced this pull request Jan 16, 2024
Imported from GitHub PR #8402

This PR adds BF16 support in oneDNN Matmul op by allowing the Dot op to maintain the BF16 type until handled by OneDnnMatMulRewriter pass.
Copybara import of the project:

--
4f7ddbc by Mahmoud Abuzaina <mahmoud.abuzaina@intel.com>:

Enable MatMul op in BF16

Merging this change closes #8402

FUTURE_COPYBARA_INTEGRATE_REVIEW=#8402 from Intel-tensorflow:mabuzain/enable-bf16-matmul 4f7ddbc
PiperOrigin-RevId: 598823232
copybara-service bot pushed a commit that referenced this pull request Jan 16, 2024
Imported from GitHub PR #8402

This PR adds BF16 support in oneDNN Matmul op by allowing the Dot op to maintain the BF16 type until handled by OneDnnMatMulRewriter pass.
Copybara import of the project:

--
4f7ddbc by Mahmoud Abuzaina <mahmoud.abuzaina@intel.com>:

Enable MatMul op in BF16

Merging this change closes #8402

FUTURE_COPYBARA_INTEGRATE_REVIEW=#8402 from Intel-tensorflow:mabuzain/enable-bf16-matmul 4f7ddbc
PiperOrigin-RevId: 598823232
copybara-service bot pushed a commit that referenced this pull request Jan 16, 2024
Imported from GitHub PR #8402

This PR adds BF16 support in oneDNN Matmul op by allowing the Dot op to maintain the BF16 type until handled by OneDnnMatMulRewriter pass.
Copybara import of the project:

--
4f7ddbc by Mahmoud Abuzaina <mahmoud.abuzaina@intel.com>:

Enable MatMul op in BF16

Merging this change closes #8402

FUTURE_COPYBARA_INTEGRATE_REVIEW=#8402 from Intel-tensorflow:mabuzain/enable-bf16-matmul 4f7ddbc
PiperOrigin-RevId: 598823232
copybara-service bot pushed a commit that referenced this pull request Jan 17, 2024
Imported from GitHub PR #8402

This PR adds BF16 support in oneDNN Matmul op by allowing the Dot op to maintain the BF16 type until handled by OneDnnMatMulRewriter pass.
Copybara import of the project:

--
4f7ddbc by Mahmoud Abuzaina <mahmoud.abuzaina@intel.com>:

Enable MatMul op in BF16

Merging this change closes #8402

FUTURE_COPYBARA_INTEGRATE_REVIEW=#8402 from Intel-tensorflow:mabuzain/enable-bf16-matmul 4f7ddbc
PiperOrigin-RevId: 598823232
copybara-service bot pushed a commit that referenced this pull request Jan 17, 2024
Imported from GitHub PR #8402

This PR adds BF16 support in oneDNN Matmul op by allowing the Dot op to maintain the BF16 type until handled by OneDnnMatMulRewriter pass.
Copybara import of the project:

--
4f7ddbc by Mahmoud Abuzaina <mahmoud.abuzaina@intel.com>:

Enable MatMul op in BF16

Merging this change closes #8402

FUTURE_COPYBARA_INTEGRATE_REVIEW=#8402 from Intel-tensorflow:mabuzain/enable-bf16-matmul 4f7ddbc
PiperOrigin-RevId: 598823232
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Jan 17, 2024
Imported from GitHub PR openxla/xla#8402

This PR adds BF16 support in oneDNN Matmul op by allowing the Dot op to maintain the BF16 type until handled by OneDnnMatMulRewriter pass.
Copybara import of the project:

--
4f7ddbcd5ecf7a4b3cfd140abd9a73d193e9ca39 by Mahmoud Abuzaina <mahmoud.abuzaina@intel.com>:

Enable MatMul op in BF16

Merging this change closes #8402

PiperOrigin-RevId: 599132673
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 19, 2024
…recision

Imported from GitHub PR openxla/xla#10687

Several weeks ago it was a change which enables "simplify-fp-conversions" pass in cpu_compiler.cc for intel cpus unconditionally.

[PR-8402](openxla/xla#8402) - [XLA:CPU] [oneDNN] Enable Dot op (MatMul) in BF16 Type

I noticed the following issue with having "simplify-fp-conversions" pass in cpu_compiler.cc enabled unconditionally.

My model uses bf16 operators (e.g. convolution). I want to jit compile and run it on CPU preserving intermediate bf16 accuracy.

Cpu compiler uses`float-normalization-bf16` pass which converts bf16 convolution to f32_convolution + convert_to_bf16 + convert_to_f32. (because typical cpu does not support bf16 computation)

Cpu compiler (on XEON) also uses `simplify-fp-conversions` pass which simplifies `f32_convolution + convert_to_bf16 + convert_to_f32` to just `f32_convolution`.

As the result - the whole model was converted to f32 precision internally and conversion to bf16 happens only at the very end.

In some cases we want to execute bf16 model on CPU but get results with accuracy similar to the case when it is executed on bf16 hardware.

To control the accuracy we can use debug_option `xla_allow_excess_precision`
By default it is true - hence, `simplify-fp-conversions` pass is enabled.

If we need to emulate bf16 computation on intel cpu we can set `XLA_FLAGS="--xla_allow_excess_precision=false"` - in this case `simplify-fp-conversions` will not be added to cpu_compiler pipeline. f32 ops results will be converted to bf16 immediately. This will preserve bf16 accuracy internally.

[gpu_compiler.cc](https://github.com/openxla/xla/blob/main/xla/service/gpu/gpu_compiler.cc#L1359) already enables `SimplifyFPConversions` pass only if `debug_options.xla_allow_excess_precision()` is true.
Copybara import of the project:

--
796dc83ef34455e53b83c02dc68cd6d71306e654 by Alexander Pivovarov <pivovaa@amazon.com>:

[CPU] Add SimplifyFPConversions only if xla_allow_excess_precision

Merging this change closes #10687

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#10687 from apivovarov:fix_cpu_SimplifyFPConversions 796dc83ef34455e53b83c02dc68cd6d71306e654
PiperOrigin-RevId: 617252815
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 19, 2024
…recision

Imported from GitHub PR openxla/xla#10687

Several weeks ago it was a change which enables "simplify-fp-conversions" pass in cpu_compiler.cc for intel cpus unconditionally.

[PR-8402](openxla/xla#8402) - [XLA:CPU] [oneDNN] Enable Dot op (MatMul) in BF16 Type

I noticed the following issue with having "simplify-fp-conversions" pass in cpu_compiler.cc enabled unconditionally.

My model uses bf16 operators (e.g. convolution). I want to jit compile and run it on CPU preserving intermediate bf16 accuracy.

Cpu compiler uses`float-normalization-bf16` pass which converts bf16 convolution to f32_convolution + convert_to_bf16 + convert_to_f32. (because typical cpu does not support bf16 computation)

Cpu compiler (on XEON) also uses `simplify-fp-conversions` pass which simplifies `f32_convolution + convert_to_bf16 + convert_to_f32` to just `f32_convolution`.

As the result - the whole model was converted to f32 precision internally and conversion to bf16 happens only at the very end.

In some cases we want to execute bf16 model on CPU but get results with accuracy similar to the case when it is executed on bf16 hardware.

To control the accuracy we can use debug_option `xla_allow_excess_precision`
By default it is true - hence, `simplify-fp-conversions` pass is enabled.

If we need to emulate bf16 computation on intel cpu we can set `XLA_FLAGS="--xla_allow_excess_precision=false"` - in this case `simplify-fp-conversions` will not be added to cpu_compiler pipeline. f32 ops results will be converted to bf16 immediately. This will preserve bf16 accuracy internally.

[gpu_compiler.cc](https://github.com/openxla/xla/blob/main/xla/service/gpu/gpu_compiler.cc#L1359) already enables `SimplifyFPConversions` pass only if `debug_options.xla_allow_excess_precision()` is true.
Copybara import of the project:

--
796dc83ef34455e53b83c02dc68cd6d71306e654 by Alexander Pivovarov <pivovaa@amazon.com>:

[CPU] Add SimplifyFPConversions only if xla_allow_excess_precision

Merging this change closes #10687

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#10687 from apivovarov:fix_cpu_SimplifyFPConversions 796dc83ef34455e53b83c02dc68cd6d71306e654
PiperOrigin-RevId: 617252815
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 20, 2024
…recision

Imported from GitHub PR openxla/xla#10687

Several weeks ago it was a change which enables "simplify-fp-conversions" pass in cpu_compiler.cc for intel cpus unconditionally.

[PR-8402](openxla/xla#8402) - [XLA:CPU] [oneDNN] Enable Dot op (MatMul) in BF16 Type

I noticed the following issue with having "simplify-fp-conversions" pass in cpu_compiler.cc enabled unconditionally.

My model uses bf16 operators (e.g. convolution). I want to jit compile and run it on CPU preserving intermediate bf16 accuracy.

Cpu compiler uses`float-normalization-bf16` pass which converts bf16 convolution to f32_convolution + convert_to_bf16 + convert_to_f32. (because typical cpu does not support bf16 computation)

Cpu compiler (on XEON) also uses `simplify-fp-conversions` pass which simplifies `f32_convolution + convert_to_bf16 + convert_to_f32` to just `f32_convolution`.

As the result - the whole model was converted to f32 precision internally and conversion to bf16 happens only at the very end.

In some cases we want to execute bf16 model on CPU but get results with accuracy similar to the case when it is executed on bf16 hardware.

To control the accuracy we can use debug_option `xla_allow_excess_precision`
By default it is true - hence, `simplify-fp-conversions` pass is enabled.

If we need to emulate bf16 computation on intel cpu we can set `XLA_FLAGS="--xla_allow_excess_precision=false"` - in this case `simplify-fp-conversions` will not be added to cpu_compiler pipeline. f32 ops results will be converted to bf16 immediately. This will preserve bf16 accuracy internally.

[gpu_compiler.cc](https://github.com/openxla/xla/blob/main/xla/service/gpu/gpu_compiler.cc#L1359) already enables `SimplifyFPConversions` pass only if `debug_options.xla_allow_excess_precision()` is true.
Copybara import of the project:

--
796dc83ef34455e53b83c02dc68cd6d71306e654 by Alexander Pivovarov <pivovaa@amazon.com>:

[CPU] Add SimplifyFPConversions only if xla_allow_excess_precision

Merging this change closes #10687

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#10687 from apivovarov:fix_cpu_SimplifyFPConversions 796dc83ef34455e53b83c02dc68cd6d71306e654
PiperOrigin-RevId: 617252815
copybara-service bot pushed a commit that referenced this pull request Mar 20, 2024
…recision

Imported from GitHub PR #10687

Several weeks ago it was a change which enables "simplify-fp-conversions" pass in cpu_compiler.cc for intel cpus unconditionally.

[PR-8402](#8402) - [XLA:CPU] [oneDNN] Enable Dot op (MatMul) in BF16 Type

I noticed the following issue with having "simplify-fp-conversions" pass in cpu_compiler.cc enabled unconditionally.

My model uses bf16 operators (e.g. convolution). I want to jit compile and run it on CPU preserving intermediate bf16 accuracy.

Cpu compiler uses`float-normalization-bf16` pass which converts bf16 convolution to f32_convolution + convert_to_bf16 + convert_to_f32. (because typical cpu does not support bf16 computation)

Cpu compiler (on XEON) also uses `simplify-fp-conversions` pass which simplifies `f32_convolution + convert_to_bf16 + convert_to_f32` to just `f32_convolution`.

As the result - the whole model was converted to f32 precision internally and conversion to bf16 happens only at the very end.

In some cases we want to execute bf16 model on CPU but get results with accuracy similar to the case when it is executed on bf16 hardware.

To control the accuracy we can use debug_option `xla_allow_excess_precision`
By default it is true - hence, `simplify-fp-conversions` pass is enabled.

If we need to emulate bf16 computation on intel cpu we can set `XLA_FLAGS="--xla_allow_excess_precision=false"` - in this case `simplify-fp-conversions` will not be added to cpu_compiler pipeline. f32 ops results will be converted to bf16 immediately. This will preserve bf16 accuracy internally.

[gpu_compiler.cc](https://github.com/openxla/xla/blob/main/xla/service/gpu/gpu_compiler.cc#L1359) already enables `SimplifyFPConversions` pass only if `debug_options.xla_allow_excess_precision()` is true.
Copybara import of the project:

--
796dc83 by Alexander Pivovarov <pivovaa@amazon.com>:

[CPU] Add SimplifyFPConversions only if xla_allow_excess_precision

Merging this change closes #10687

COPYBARA_INTEGRATE_REVIEW=#10687 from apivovarov:fix_cpu_SimplifyFPConversions 796dc83
PiperOrigin-RevId: 617460913
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 20, 2024
…recision

Imported from GitHub PR openxla/xla#10687

Several weeks ago it was a change which enables "simplify-fp-conversions" pass in cpu_compiler.cc for intel cpus unconditionally.

[PR-8402](openxla/xla#8402) - [XLA:CPU] [oneDNN] Enable Dot op (MatMul) in BF16 Type

I noticed the following issue with having "simplify-fp-conversions" pass in cpu_compiler.cc enabled unconditionally.

My model uses bf16 operators (e.g. convolution). I want to jit compile and run it on CPU preserving intermediate bf16 accuracy.

Cpu compiler uses`float-normalization-bf16` pass which converts bf16 convolution to f32_convolution + convert_to_bf16 + convert_to_f32. (because typical cpu does not support bf16 computation)

Cpu compiler (on XEON) also uses `simplify-fp-conversions` pass which simplifies `f32_convolution + convert_to_bf16 + convert_to_f32` to just `f32_convolution`.

As the result - the whole model was converted to f32 precision internally and conversion to bf16 happens only at the very end.

In some cases we want to execute bf16 model on CPU but get results with accuracy similar to the case when it is executed on bf16 hardware.

To control the accuracy we can use debug_option `xla_allow_excess_precision`
By default it is true - hence, `simplify-fp-conversions` pass is enabled.

If we need to emulate bf16 computation on intel cpu we can set `XLA_FLAGS="--xla_allow_excess_precision=false"` - in this case `simplify-fp-conversions` will not be added to cpu_compiler pipeline. f32 ops results will be converted to bf16 immediately. This will preserve bf16 accuracy internally.

[gpu_compiler.cc](https://github.com/openxla/xla/blob/main/xla/service/gpu/gpu_compiler.cc#L1359) already enables `SimplifyFPConversions` pass only if `debug_options.xla_allow_excess_precision()` is true.
Copybara import of the project:

--
796dc83ef34455e53b83c02dc68cd6d71306e654 by Alexander Pivovarov <pivovaa@amazon.com>:

[CPU] Add SimplifyFPConversions only if xla_allow_excess_precision

Merging this change closes #10687

PiperOrigin-RevId: 617460913
steeve pushed a commit to zml/xla that referenced this pull request Aug 30, 2024
…xcess_precision

Imported from GitHub PR openxla#10687

Several weeks ago it was a change which enables "simplify-fp-conversions" pass in cpu_compiler.cc for intel cpus unconditionally.

[PR-8402](openxla#8402) - [XLA:CPU] [oneDNN] Enable Dot op (MatMul) in BF16 Type

I noticed the following issue with having "simplify-fp-conversions" pass in cpu_compiler.cc enabled unconditionally.

My model uses bf16 operators (e.g. convolution). I want to jit compile and run it on CPU preserving intermediate bf16 accuracy.

Cpu compiler uses`float-normalization-bf16` pass which converts bf16 convolution to f32_convolution + convert_to_bf16 + convert_to_f32. (because typical cpu does not support bf16 computation)

Cpu compiler (on XEON) also uses `simplify-fp-conversions` pass which simplifies `f32_convolution + convert_to_bf16 + convert_to_f32` to just `f32_convolution`.

As the result - the whole model was converted to f32 precision internally and conversion to bf16 happens only at the very end.

In some cases we want to execute bf16 model on CPU but get results with accuracy similar to the case when it is executed on bf16 hardware.

To control the accuracy we can use debug_option `xla_allow_excess_precision`
By default it is true - hence, `simplify-fp-conversions` pass is enabled.

If we need to emulate bf16 computation on intel cpu we can set `XLA_FLAGS="--xla_allow_excess_precision=false"` - in this case `simplify-fp-conversions` will not be added to cpu_compiler pipeline. f32 ops results will be converted to bf16 immediately. This will preserve bf16 accuracy internally.

[gpu_compiler.cc](https://github.com/openxla/xla/blob/main/xla/service/gpu/gpu_compiler.cc#L1359) already enables `SimplifyFPConversions` pass only if `debug_options.xla_allow_excess_precision()` is true.
Copybara import of the project:

--
796dc83 by Alexander Pivovarov <pivovaa@amazon.com>:

[CPU] Add SimplifyFPConversions only if xla_allow_excess_precision

Merging this change closes openxla#10687

COPYBARA_INTEGRATE_REVIEW=openxla#10687 from apivovarov:fix_cpu_SimplifyFPConversions 796dc83
PiperOrigin-RevId: 617460913
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to pull PR ready for merge process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants