Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix #57608: queries on categorical string columns in HDFStore.select() return unexpected results. #61225

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

SofiaSM45
Copy link

In function init() of class Selection (pandas/core/io/pytables.py), the method self.terms.evaluate() was not returning the correct value for the where condition. The issue stemmed from the function convert_value() of class BinOp (pandas/core/computation/pytables.py), where the function searchedsorted() did not return the correct index when matching the where condition in the metadata (categories table). Replacing searchsorted() with np.where() resolves this issue.

HDFStore.select() return unexpected results.
In function __init__() of class Selection (pandas/core/io/pytables.py),
the method self.terms.evaluate() was not returning the correct value
for the where condition. The issue stemmed from the function
convert_value() of class BinOp (pandas/core/computation/pytables.py),
where the function searchedsorted() did not return the correct index
when matching the where condition in the metadata (categories table).
Replacing searchsorted() with np.where() resolves this issue.
@@ -239,7 +239,8 @@ def stringify(value):
if conv_val not in metadata:
result = -1
else:
result = metadata.searchsorted(conv_val, side="left")
# Find the index of the first match of conv_val in metadata
result = np.where(metadata == conv_val)[0][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result = np.where(metadata == conv_val)[0][0]
result = np.flatnonzero(metadata == conv_val)[0]

Also is it possible to know if metadata is sorted ahead of time so we can use searchsorted? it will be much faster in that case

@@ -239,7 +239,13 @@ def stringify(value):
if conv_val not in metadata:
result = -1
else:
result = metadata.searchsorted(conv_val, side="left")
# Check if metadata is sorted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we probably won't want to do this check here because this also incurs some performance penalty. I was just staying if there's something in the preprocessing code above that already checked this for us.

If not, just using np.flatnonzero here directly is fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: queries on categorical string columns in HDFStore.select() return unexpected results
2 participants