Skip to content

Conversation

@andygrove
Copy link
Member

Which issue does this PR close?

  • Closes #.

Rationale for this change

I ran some microbenchmarks comparing DataFusion and DuckDB (see apache/datafusion-benchmarks#28) and found that CASE WHEN expressions were much slower in DataFusion, so I asked Claude to make it go faster.

What changes are included in this PR?

This adds a new optimization path for searched CASE WHEN expressions (CASE WHEN condition THEN result) with 3 or more branches.

Instead of evaluating conditions sequentially on progressively shrinking batches, this approach:

  1. Evaluates all conditions upfront on the full batch
  2. Builds a branch_index array indicating which branch matched each row
  3. Filters once per branch and evaluates THEN expressions

This provides better performance due to:

  • Better cache locality (conditions evaluated on same data)
  • Simpler filter predicates (integer equality vs boolean expressions)
  • No progressive batch shrinking overhead

Important: This changes short-circuit semantics for CONDITIONS (not THEN expressions). All conditions are evaluated even for rows where an earlier condition matched. This is safe for simple comparisons but may cause issues if conditions can error (e.g., division by zero in a condition).

🤖 Generated with Claude Code

Are these changes tested?

Existing tests

Are there any user-facing changes?

No

This adds a new optimization path for searched CASE WHEN expressions
(CASE WHEN condition THEN result) with 3 or more branches.

Instead of evaluating conditions sequentially on progressively shrinking
batches, this approach:
1. Evaluates all conditions upfront on the full batch
2. Builds a branch_index array indicating which branch matched each row
3. Filters once per branch and evaluates THEN expressions

This provides better performance due to:
- Better cache locality (conditions evaluated on same data)
- Simpler filter predicates (integer equality vs boolean expressions)
- No progressive batch shrinking overhead

**Important**: This changes short-circuit semantics for CONDITIONS (not
THEN expressions). All conditions are evaluated even for rows where an
earlier condition matched. This is safe for simple comparisons but may
cause issues if conditions can error (e.g., division by zero in a
condition).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Jan 1, 2026
@andygrove andygrove closed this Jan 1, 2026
@andygrove andygrove deleted the case-vectorized-conditions branch January 1, 2026 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant