Efficient Reduction without Intermediates #23427

JeffGreen · 2024-09-04T01:59:07Z

JeffGreen
Sep 4, 2024

Hello! What is the most efficient way to do reductions so that data is passed over only once?

The following code:

@jax.jit
def func_a(arg):
    return jnp.sum(jnp.cos(arg))

print("hlo:", jax.jit(func_a).lower(jnp.ones(10000)).as_text())

produces the following HLO:

hlo: module @jit_func_a attributes {mhlo.num_partitions = 1 : i32, mhlo.num_replicas = 1 : i32} {
  func.func public @main(%arg0: tensor<10000xf64> {mhlo.layout_mode = "default"}) -> (tensor<f64> {jax.result_info = "", mhlo.layout_mode = "default"}) {
    %0 = call @func_a(%arg0) : (tensor<10000xf64>) -> tensor<f64>
    return %0 : tensor<f64>
  }
  func.func private @func_a(%arg0: tensor<10000xf64> {mhlo.layout_mode = "default"}) -> (tensor<f64> {mhlo.layout_mode = "default"}) {
    %0 = stablehlo.cosine %arg0 : tensor<10000xf64>
    %cst = stablehlo.constant dense<0.000000e+00> : tensor<f64>
    %1 = stablehlo.reduce(%0 init: %cst) applies stablehlo.add across dimensions = [0] : (tensor<10000xf64>, tensor<f64>) -> tensor<f64>
    return %1 : tensor<f64>
  }
}

This implies to me that the data is essentially iterated over once to compute an array of cosine values and then again to do the reduction which is inefficient. Is that two-pass understanding correct? If so, what is the way to do a reduction plus computation in one pass?

Incidentally, I noticed that a version I tried with jax.lax.reduce seems to have a "reducer" block instead of the stablehlo.reduce line - is there any difference between these two as far the number of passes over the data?

I'm using the CPU backend if that's relevant.

Thanks!

yashk2810 · 2024-09-04T02:24:55Z

yashk2810
Sep 4, 2024
Collaborator

CPU is probably not fusing it.

But on TPU, this is what the optimized HLO looks like:

print(jax.jit(func_a).lower(jnp.ones(10000)).compile().as_text())

HloModule jit_func_a, is_scheduled=true, entry_computation_layout={(f32[10000]{0:T(1024)})->f32[]{:T(128)}}, allow_spmd_sharding_propagation_to_parameters={true}, allow_spmd_sharding_propagation_to_output={true}

%region_0.2 (Arg_0.3: f32[], Arg_1.4: f32[]) -> f32[] {
  %Arg_1.4 = f32[]{:T(128)} parameter(1), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum"}
  %Arg_0.3 = f32[]{:T(128)} parameter(0), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum"}
  ROOT %add.5 = f32[]{:T(128)} add(f32[]{:T(128)} %Arg_0.3, f32[]{:T(128)} %Arg_1.4), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="third_party/py/jax/tests/memories_test.py" source_line=1394}
}

%fused_computation (param_0.2: f32[10000]) -> f32[] {
  %param_0.2 = f32[10000]{0:T(1024)} parameter(0)
  %cosine.1 = f32[10000]{0:T(1024)} cosine(f32[10000]{0:T(1024)} %param_0.2), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/cos" source_file="third_party/py/jax/tests/memories_test.py" source_line=1394}
  %constant.1 = f32[]{:T(128)} constant(0)
  ROOT %reduce.1 = f32[]{:T(128)} reduce(f32[10000]{0:T(1024)} %cosine.1, f32[]{:T(128)} %constant.1), dimensions={0}, to_apply=%region_0.2, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="third_party/py/jax/tests/memories_test.py" source_line=1394}
}

ENTRY %main.12 (Arg_0.1: f32[10000]) -> f32[] {
  %Arg_0.1 = f32[10000]{0:T(1024)} parameter(0), metadata={op_name="arg"}
  ROOT %fusion = f32[]{:T(128)} fusion(f32[10000]{0:T(1024)} %Arg_0.1), kind=kLoop, calls=%fused_computation, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="third_party/py/jax/tests/memories_test.py" source_line=1394}, backend_config={"flag_configs":[],"window_config":{"kernel_window_bounds":[],"output_window_bounds":["5"],"input_window_bounds":[],"estimated_cycles":"2342","iteration_bounds":["2"]},"megacore_config":{"megacore_split_dim":"0","megacore_allreduce_bytes":"4096"},"scoped_memory_configs":[],"used_scoped_memory_configs":[{"memory_space":"1","offset":"0","size":"24576"}],"retry_config":{"retry_count":"0"}}
}

You should look at the optimized HLO for fusions! On CPU I can't see the fusion but on TPU you can.

1 reply

JeffGreen Sep 4, 2024
Author

Ah, that's too bad. Is there a more canonical way to do this type of calculation like jax.lax.reduce / scan / something else or is this a reasonable formulation? Given memory latency vs compute performance these days, a two pass algorithm basically takes twice as long.

Also, is the CPU backend generally "behind" the TPU backend as far as performance / features or is this just a one-off in your estimation?

Thanks.

JeffGreen · 2024-09-04T04:37:28Z

JeffGreen
Sep 4, 2024
Author

Interestingly, I think I get fusion on the cpu if my parameter array is <= 32 elements -

print(jax.jit(func_a).lower(jnp.ones(32, dtype=jnp.float64)).compile().as_text())

HloModule jit_func_a, is_scheduled=true, entry_computation_layout={(f64[32]{0})->f64[]}, allow_spmd_sharding_propagation_to_parameters={true}, allow_spmd_sharding_propagation_to_output={true}

%region_0.2 (Arg_0.3: f64[], Arg_1.4: f64[]) -> f64[] {
  %Arg_0.3 = f64[] parameter(0), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum[axes=(0,)]"}
  %Arg_1.4 = f64[] parameter(1), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum[axes=(0,)]"}
  ROOT %add.5 = f64[] add(f64[] %Arg_0.3, f64[] %Arg_1.4), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum[axes=(0,)]" source_file="/home/jgreen/gits/learn/python/scratch/jax_test.py" source_line=11}
}

%fused_computation (param_0.2: f64[32]) -> f64[] {
  %param_0.2 = f64[32]{0} parameter(0)
  %cosine.1 = f64[32]{0} cosine(f64[32]{0} %param_0.2), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/cos" source_file="/home/jgreen/gits/learn/python/scratch/jax_test.py" source_line=11}
  %constant.1 = f64[] constant(0)
  ROOT %reduce.1 = f64[] reduce(f64[32]{0} %cosine.1, f64[] %constant.1), dimensions={0}, to_apply=%region_0.2, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum[axes=(0,)]" source_file="/home/jgreen/gits/learn/python/scratch/jax_test.py" source_line=11}
}

ENTRY %main.12 (Arg_0.1: f64[32]) -> f64[] {
  %Arg_0.1 = f64[32]{0} parameter(0), metadata={op_name="arg"}
  ROOT %fusion = f64[] fusion(f64[32]{0} %Arg_0.1), kind=kLoop, calls=%fused_computation, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum[axes=(0,)]" source_file="/home/jgreen/gits/learn/python/scratch/jax_test.py" source_line=11}
}

but with 33 or higher:

print(jax.jit(func_a).lower(jnp.ones(33, dtype=jnp.float64)).compile().as_text())

HloModule jit_func_a, is_scheduled=true, entry_computation_layout={(f64[33]{0})->f64[]}, allow_spmd_sharding_propagation_to_parameters={true}, allow_spmd_sharding_propagation_to_output={true}

%region_0.2 (Arg_0.3: f64[], Arg_1.4: f64[]) -> f64[] {
  %Arg_0.3 = f64[] parameter(0), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum[axes=(0,)]"}
  %Arg_1.4 = f64[] parameter(1), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum[axes=(0,)]"}
  ROOT %add.5 = f64[] add(f64[] %Arg_0.3, f64[] %Arg_1.4), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum[axes=(0,)]" source_file="/home/jgreen/gits/learn/python/scratch/jax_test.py" source_line=11}
}

ENTRY %main.12 (Arg_0.1: f64[33]) -> f64[] {
  %Arg_0.1 = f64[33]{0} parameter(0), metadata={op_name="arg"}
  %cosine.0 = f64[33]{0} cosine(f64[33]{0} %Arg_0.1), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/cos" source_file="/home/jgreen/gits/learn/python/scratch/jax_test.py" source_line=11}
  %constant.0 = f64[] constant(0)
  %reduce-window = f64[2]{0} reduce-window(f64[33]{0} %cosine.0, f64[] %constant.0), window={size=32 stride=32 pad=15_16}, to_apply=%region_0.2
  ROOT %reduce.0 = f64[] reduce(f64[2]{0} %reduce-window, f64[] %constant.0), dimensions={0}, to_apply=%region_0.2, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum[axes=(0,)]" source_file="/home/jgreen/gits/learn/python/scratch/jax_test.py" source_line=11}
}

This doesn't make much sense to me.

2 replies

yashk2810 Sep 4, 2024
Collaborator

Csn you try nightly JAX and jaxlib?

JeffGreen Sep 4, 2024
Author

jax.print_environment_info() now shows:

jax: 0.4.32.dev20240903
jaxlib: 0.4.32.dev20240903

and I see results identical to above (32 fuses, 33 does not)

JeffGreen · 2024-09-04T14:50:14Z

JeffGreen
Sep 4, 2024
Author

One more bit of info - the jaxpr and the stablehlo (jit.lower.as_text) are identical for both the 32-array and 33-array (aside from the tensor dimension). The only difference appears in the compiled version (jit.lower.compile.as_text) with one having the fusion operator and one not. Does this indicate a potential problem with XLA as opposed to JAX?

2 replies

yashk2810 Sep 4, 2024
Collaborator

Yes, it's a XLA issue. Probably ask here: https://github.com/openxla/xla?

JeffGreen Sep 4, 2024
Author

Thanks. openxla/xla#16792

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient Reduction without Intermediates #23427

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Efficient Reduction without Intermediates #23427

JeffGreen Sep 4, 2024

Replies: 3 comments · 5 replies

yashk2810 Sep 4, 2024 Collaborator

JeffGreen Sep 4, 2024 Author

JeffGreen Sep 4, 2024 Author

yashk2810 Sep 4, 2024 Collaborator

JeffGreen Sep 4, 2024 Author

JeffGreen Sep 4, 2024 Author

yashk2810 Sep 4, 2024 Collaborator

JeffGreen Sep 4, 2024 Author

JeffGreen
Sep 4, 2024

Replies: 3 comments 5 replies

yashk2810
Sep 4, 2024
Collaborator

JeffGreen Sep 4, 2024
Author

JeffGreen
Sep 4, 2024
Author

yashk2810 Sep 4, 2024
Collaborator

JeffGreen Sep 4, 2024
Author

JeffGreen
Sep 4, 2024
Author

yashk2810 Sep 4, 2024
Collaborator

JeffGreen Sep 4, 2024
Author