Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize plain "not" queries #124

Merged
merged 2 commits into from
Oct 30, 2021
Merged

Optimize plain "not" queries #124

merged 2 commits into from
Oct 30, 2021

Conversation

jensens
Copy link
Member

@jensens jensens commented Oct 28, 2021

In Plone 6 we do queries returning a result except one item (the current context). A not query on the UID catalog is used to do this. In larger projects with several 10 or 100 thousands object cataloged, this slowed down queries to take several seconds. Without the not everything was fast.

After some profiling we found the time is wasted in this loop

for k in record.keys:
if k is None:
# Prevent None from being looked up. None doesn't
# have a valid ordering definition compared to any
# other object. BTrees 4.0+ will throw a TypeError
# "object has default comparison".
continue
try:
s = index.get(k, None)
except TypeError:
# key is not valid for this Btree so the value is None
LOG.error(
'%(context)s: query_index tried '
'to look up key %(key)r from index %(index)r '
'but key was of the wrong type.', dict(
context=self.__class__.__name__,
key=k,
index=self.id,
)
)
s = None
# If None, try to bail early
if s is None:
if operator == 'or':
# If union, we can possibly get a bigger result
continue
# If intersection, we can't possibly get a smaller result
if cachekey is not None:
# If operator is 'and', we have to cache a list of
# IISet objects
cache[cachekey] = [IISet()]
return IISet()
elif isinstance(s, int):
s = IISet((s,))
setlist.append(s)
# If we only use one key return immediately
if len(setlist) == 1:
result = setlist[0]
if isinstance(result, int):
result = IISet((result,))
if cachekey is not None:
if operator == 'or':
cache[cachekey] = result
else:
cache[cachekey] = [result]
if not_parm:
exclude = self._apply_not(not_parm, resultset)
result = difference(result, exclude)
return result
if operator == 'or':
# If we already get a small result set passed in, intersecting
# the various indexes with it and doing the union later is
# faster than creating a multiunion first.
if resultset is not None and len(resultset) < 200:
smalllist = []
for s in setlist:
smalllist.append(intersection(resultset, s))
r = multiunion(smalllist)
# 'r' is not invariant of resultset. Thus, we
# have to remember the union of 'setlist'. But
# this is maybe a performance killer. So we do not cache.
# if cachekey is not None:
# cache[cachekey] = multiunion(setlist)
else:
r = multiunion(setlist)
if cachekey is not None:
cache[cachekey] = r
else:
# For intersection, sort with smallest data set first
if len(setlist) > 2:
setlist = sorted(setlist, key=len)
# 'r' is not invariant of resultset. Thus, we
# have to remember the union of 'setlist'
if cachekey is not None:
cache[cachekey] = setlist
r = resultset
for s in setlist:
r = intersection(r, s)
# If intersection, we can't possibly get a smaller result
if not r:
break
primary while factoring IISets here https://github.com/zopefoundation/Products.ZCatalog/blob/master/src/Products/PluginIndexes/unindex.py#L608

Whats happens?
The not query is written to return all values except the ones matching. Thus, for a single UID, this creates an result containing all index keys, except one. This is very expensive.

What I have done:
If the code detects a simple not query (without any operators) excluding only one or more keys, I shortcut the whole index_query and return the previous result w/o the current catalog id.

Does it help?
Yes. We bench-marked our query on a customer database with ~367000 catalogued objects. It cut down the whole requests time from ~1200ms to ~80ms (with 40ms of it been not query time in both).

cc @tisto @mauritsvanrees @cekk

@jensens jensens force-pushed the jensens-optimize-not-queries branch from b33c889 to f0e9acc Compare October 29, 2021 09:52
@jensens jensens force-pushed the jensens-optimize-not-queries branch from f0e9acc to d94e333 Compare October 29, 2021 10:27
@jensens
Copy link
Member Author

jensens commented Oct 29, 2021

Coverage tests are red on master too: https://coveralls.io/github/zopefoundation/Products.ZCatalog

@jensens
Copy link
Member Author

jensens commented Oct 29, 2021

see also plone/plone.restapi#1252

@dataflake
Copy link
Member

I trust your changes are OK, but don't have any ZCatalog Fu.

@dataflake dataflake removed their request for review October 29, 2021 12:08
* Further optimize excluding results in not queries

Usually the number of the parameters that have to be excluded in the not query is much lower than the number of values in the index, so it makes sense to actually try to pop them out from the list

* duration1 appears to be lower after the optimizations
@jensens
Copy link
Member Author

jensens commented Oct 30, 2021

this branch includes now #125

@jensens jensens requested a review from jamadden October 30, 2021 09:32
Copy link
Member

@ale-rt ale-rt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jensens jensens merged commit ae1199b into master Oct 30, 2021
@jensens jensens deleted the jensens-optimize-not-queries branch October 30, 2021 09:38
@ale-rt
Copy link
Member

ale-rt commented Oct 30, 2021

We see that it is only possible to squash and merge the commits:
image
Is that intended?
CC @icemac

@icemac
Copy link
Member

icemac commented Nov 9, 2021

@ale-rt Yes this is intended to keep the commit history straight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants