ENH: Optimize LUT with transposed loads and 4-way row unrolling by seiko2plus · Pull Request #7 · numpy/numpy-simd-routines

seiko2plus · 2025-12-22T16:32:49Z

This patch add several improvements to the Lookup Table (LUT) for non-scalable architectures.

Transposed loads: Introduced a pre-calculated transposed storage (trans_) and loading strategy. This allows using optimized interleaved loads (X2 and X4) when vector lanes match the blocking factor, reducing the need for expensive gather operations.
FourTablesLookup: Handle case where the vector length is exactly one-quarter of the table width using 4-way table lookups.

Maintenance:

Removed runtime specializations for scalable extensions. These now explicitly fallback to Highway's GatherIndex (no performance gain on SVE).

This patch add several improvements to the Lookup Table (LUT) for non-scalable architectures. - **Transposed loads:** Introduced a pre-calculated transposed storage (`trans_`) and loading strategy. This allows using optimized interleaved loads (X2 and X4) when vector lanes match the blocking factor, reducing the need for expensive gather operations. - **FourTablesLookup:** Handle case where the vector length is exactly one-quarter of the table width using 4-way table lookups. Maintenance: - Removed runtime specializations for scalable extensions. These now explicitly fallback to Highway's `GatherIndex` (no performance gain on SVE).

Mousius · 2026-01-05T12:18:10Z

Looks good to me @seiko2plus!

One future thought is likely to stop using HWY_HAVE_SCALABLE to enable compilation that switches to fixed width 128-bit vectors within an scalable kernel. Though not important for now 😸

seiko2plus marked this pull request as ready for review December 22, 2025 17:21

Mousius approved these changes Jan 5, 2026

View reviewed changes

Mousius merged commit 7c40f72 into numpy:main Jan 5, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Optimize LUT with transposed loads and 4-way row unrolling#7

ENH: Optimize LUT with transposed loads and 4-way row unrolling#7
Mousius merged 1 commit intonumpy:mainfrom
seiko2plus:enhance-lut

seiko2plus commented Dec 22, 2025

Uh oh!

Uh oh!

Mousius commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

seiko2plus commented Dec 22, 2025

Uh oh!

Uh oh!

Mousius commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants