Skip to content

Use a balanced tree instead of unbalanced one to prevent recursion error in create_match_filter #1830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 31, 2025

Conversation

koenvo
Copy link
Contributor

@koenvo koenvo commented Mar 22, 2025

Use a balanced tree instead of an unbalanced one to prevent recursion error in create_match_filter

Rationale for this change

In the create_match_filter function, the previous implementation used functools.reduce(operator.or_, filters) to combine expressions. This approach constructed a right-heavy, unbalanced tree, which could lead to a RecursionError when dealing with a large number of expressions (e.g., over 1,000).

To address this, we've introduced the _build_balanced_tree function. This utility constructs a balanced binary tree of expressions, reducing the maximum depth to O(log n) and thereby preventing potential recursion errors. This makes expression construction more stable and scalable, especially when working with large datasets.

# Past behavior

Or(*[A, B, C, D]) = Or(A, Or(B, Or(C, D))

# New behavior
Or(*[A, B, C, D]) = Or(Or(A, B), Or(C, D))

Are these changes tested?

Yes, existing tests cover the functionality of Or. Additional testing was done with large expression sets (e.g., 10,000 items) to ensure that balanced tree construction avoids recursion errors.

Are there any user-facing changes?

No, there are no user-facing changes. This is an internal implementation improvement that does not affect the public API.

Closes #1759
Closes #1785

@koenvo koenvo marked this pull request as ready for review March 24, 2025 15:18
@@ -257,7 +298,7 @@ class Or(BooleanExpression):

def __new__(cls, left: BooleanExpression, right: BooleanExpression, *rest: BooleanExpression) -> BooleanExpression: # type: ignore
if rest:
return reduce(Or, (left, right, *rest))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want to apply this in the And situation :)

@Fokko
Copy link
Contributor

Fokko commented Mar 24, 2025

I think this is pretty elegant, thoughts @kevinjqliu?

I expressed some concerns with in #1783 around performance. I have played around with the singledispatch stuff in the past, and it was always crazy fast because the type dispatching is pushed down into the C-layer of Python.

@Fokko Fokko added this to the PyIceberg 0.9.1 milestone Mar 26, 2025
@Fokko
Copy link
Contributor

Fokko commented Mar 31, 2025

Moving this forward since I think it is a great idea in general :)

@Fokko Fokko merged commit bae62df into apache:main Mar 31, 2025
7 checks passed
@Fokko
Copy link
Contributor

Fokko commented Mar 31, 2025

Thanks @koenvo

Fokko pushed a commit that referenced this pull request Apr 17, 2025
**Use a balanced tree instead of an unbalanced one to prevent recursion
error in `create_match_filter`**

<!-- Closes #1776 -->

## Rationale for this change

In the `create_match_filter` function, the previous implementation used
`functools.reduce(operator.or_, filters)` to combine expressions. This
approach constructed a right-heavy, unbalanced tree, which could lead to
a `RecursionError` when dealing with a large number of expressions
(e.g., over 1,000).

To address this, we've introduced the `_build_balanced_tree` function.
This utility constructs a balanced binary tree of expressions, reducing
the maximum depth to O(log n) and thereby preventing potential recursion
errors. This makes expression construction more stable and scalable,
especially when working with large datasets.

```python

# Past behavior

Or(*[A, B, C, D]) = Or(A, Or(B, Or(C, D))

# New behavior
Or(*[A, B, C, D]) = Or(Or(A, B), Or(C, D))

```

## Are these changes tested?

Yes, existing tests cover the functionality of `Or`. Additional testing
was done with large expression sets (e.g., 10,000 items) to ensure that
balanced tree construction avoids recursion errors.

## Are there any user-facing changes?

No, there are no user-facing changes. This is an internal implementation
improvement that does not affect the public API.

Closes #1759
Closes #1785

<!-- In the case of user-facing changes, please add the changelog label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] bind visitor causes RecursionError: maximum recursion depth exceeded Issue during Upsert
2 participants