XLA not fusing reduce for CPU tensor with > 32 elements #16792

JeffGreen · 2024-09-04T15:08:09Z

Hello!

As per jax-ml/jax#23427, I'm noticing that XLA on CPU isn't doing a fused reduction sum for a very simple function if the input tensor is > 32 elements:

@jax.jit
def func_a(input_tensor):
    return jnp.sum(jnp.cos(input_tensor))

n_elem = 32

print("************LOWERED**********")
print(jax.jit(func_a).lower(jnp.ones(n_elem, dtype=jnp.float32)).as_text())

print("************COMPILED**********")
print(jax.jit(func_a).lower(jnp.ones(n_elem, dtype=jnp.float32)).compile().as_text())

If I run this with n_elem = 32 and then again with n_elem=33, I get the same lowered stablehlo, but different compiled output. In the case of the tensor with length 32, I see loop fusion:

HloModule jit_func_a, is_scheduled=true, entry_computation_layout={(f32[32]{0})->f32[]}, allow_spmd_sharding_propagation_to_parameters={true}, allow_spmd_sharding_propagation_to_output={true}

%region_0.2 (Arg_0.3: f32[], Arg_1.4: f32[]) -> f32[] {
  %Arg_0.3 = f32[] parameter(0), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum"}
  %Arg_1.4 = f32[] parameter(1), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum"}
  ROOT %add.5 = f32[] add(f32[] %Arg_0.3, f32[] %Arg_1.4), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
}

%fused_computation (param_0.2: f32[32]) -> f32[] {
  %param_0.2 = f32[32]{0} parameter(0)
  %cosine.1 = f32[32]{0} cosine(f32[32]{0} %param_0.2), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/cos" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
  %constant.1 = f32[] constant(0)
  ROOT %reduce.1 = f32[] reduce(f32[32]{0} %cosine.1, f32[] %constant.1), dimensions={0}, to_apply=%region_0.2, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
}

ENTRY %main.12 (Arg_0.1: f32[32]) -> f32[] {
  %Arg_0.1 = f32[32]{0} parameter(0), metadata={op_name="input_tensor"}
  ROOT %fusion = f32[] fusion(f32[32]{0} %Arg_0.1), kind=kLoop, calls=%fused_computation, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
}

In the case of the tensor with length 33, the fusion goes away and I'm seeing two passes over the data (once for cosine, then another for reduction):

HloModule jit_func_a, is_scheduled=true, entry_computation_layout={(f32[33]{0})->f32[]}, allow_spmd_sharding_propagation_to_parameters={true}, allow_spmd_sharding_propagation_to_output={true}

%region_0.2 (Arg_0.3: f32[], Arg_1.4: f32[]) -> f32[] {
  %Arg_0.3 = f32[] parameter(0), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum"}
  %Arg_1.4 = f32[] parameter(1), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum"}
  ROOT %add.5 = f32[] add(f32[] %Arg_0.3, f32[] %Arg_1.4), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
}

ENTRY %main.12 (Arg_0.1: f32[33]) -> f32[] {
  %Arg_0.1 = f32[33]{0} parameter(0), metadata={op_name="input_tensor"}
  %cosine.0 = f32[33]{0} cosine(f32[33]{0} %Arg_0.1), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/cos" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
  %constant.0 = f32[] constant(0)
  %reduce-window = f32[2]{0} reduce-window(f32[33]{0} %cosine.0, f32[] %constant.0), window={size=32 stride=32 pad=15_16}, to_apply=%region_0.2
  ROOT %reduce.0 = f32[] reduce(f32[2]{0} %reduce-window, f32[] %constant.0), dimensions={0}, to_apply=%region_0.2, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
}

I'd expect both cases to be fused - am I missing something here?

Thanks.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLA not fusing reduce for CPU tensor with > 32 elements #16792

XLA not fusing reduce for CPU tensor with > 32 elements #16792

JeffGreen commented Sep 4, 2024

XLA not fusing reduce for CPU tensor with > 32 elements #16792

XLA not fusing reduce for CPU tensor with > 32 elements #16792

Comments

JeffGreen commented Sep 4, 2024