Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find and sort has poor performance #645

Closed
cspecter opened this issue Jan 23, 2018 · 8 comments
Closed

Find and sort has poor performance #645

cspecter opened this issue Jan 23, 2018 · 8 comments

Comments

@cspecter
Copy link

So far LokiJS has been working great, extremely fast for $eq type find queries. The problem I am having is when I need to find either a single document or set of documents by some arbitrary sort parameter which is causing some serious performance degredation. For example I have a collection with about 100k documents. I need to get either the first doc or last doc based on a subset of about 10k docs in this collection. Since findOne() will not support a sort option I've implemented this call using chain() and find(). Each operation by itself is very fast. I can run a simplesort() or find() the whole collection in 1ms each, but if I run find() then simplesort() my query takes like 200ms. Am I dong something wrong? It seems like it is not using the index for the sort after the find.

let cache = new loki();

let collection = cache.addCollection("someData", {indices: ["date", "name", "killingTrolls"]});

//These queries are slow, >200ms response time
//Get first
collection.chain().find({name: "Odin", killingTrolls: true}).simplesort("date").limit(1).data();

//Get last
collection.chain().find({name: "Odin", killingTrolls: true}).simplesort("date", true).limit(1).data();

//Get last 100
collection.chain().find({name: "Odin", killingTrolls: true}).simplesort("date", true).limit(100).data().reverse();

//Get last 100, skip 10
collection.chain().find({name: "Odin", killingTrolls: true}).simplesort("date", true)offset(10).limit(100).data().reverse();

//These queries are fast, ~1ms response time.
//Just sort
collection.chain().simplesort("date", true).limit(1).data();
//~1ms response

//Just find
collection.chain().find({name: "Odin", killingTrolls: true}).limit(1).data();
//~1ms response

//Query that didn't work as intended
collection.chain().simplesort("date", true).find({name: "Odin", killingTrolls: true}).limit(1).data();
@obeliskos
Copy link
Collaborator

obeliskos commented Jan 23, 2018

In a query chain, only the first operation can use binary index which is why they are super fast alone.

So its the sorting that's killing you as you have 10k docs that need to be sorted (without use of index). You might try calling sort() instead of simplesort() and supplying it with a javascript compare function similar to that provided in our sort example.

  coll.chain().sort(function(obj1, obj2) {
     if (obj1.name === obj2.name) return 0;
     if (obj1.name > obj2.name) return 1;
     if (obj1.name < obj2.name) return -1;
   });

You are still doing a full sort on 10k docs but that lighter weight function might yield faster results if your data is clean and compare is not thrown off by dirty data. I'd be interested in hearing your results with that.

We might be able to explore array intersection algorithms for simplesort() to see if there is any performance increase over a manual sort() on interim results when a binary index exists for the sort column and interim results are already filtered.

To elaborate, a scaled down version of your query might have arrays of integer array positions for each binary index. Lets say you have only 6 docs, the indices :
'name' : [5, 2, 1, 3, 0, 4]
'killingTrolls' : [0, 4, 1, 5, 2, 3]
'date' : [3, 2, 1, 4, 5, 0]

...and after you run your find the interim results are [5, 2], we might want to sort by date index [3, 2, 1, 4, 5, 0].

So you want to filter down our sort index [3, 2, 1, 4, 5, 0] keeping its ordering but getting rid of values not in our interim resultset [5, 2]. That's a lot of linear scans but worth exploring benchmarking various algorithms to see if there could be gains.

@cspecter
Copy link
Author

Ok, tried out the sort() method with a javascript compare function. Definitely much faster. Getting response times between 35-37ms for the big query, not as lightning fast a normal find, but much, much better than before. I'd be interested in seeing if there are gains to be made with the array intersection algorithm you have laid out.

Will LokiJS2 eventually support the use of multiple binary indexes or binary indexes for each operation or compound indexes? That seems like it would solve some of these problems given how fast just a simplesort() or find() is. Or maybe if there was a way to pre-sort a collection and run a find with a sort either ascending or descending based on that pre-sort.

A little more info on my use case. I'm using LokiJS to cache data locally from MongoDB inside a Meteor app. I've written an interpreter that sits inbetween Meteor and Mongo, if there is cached data it will convert the Mongo selector and options into a LokiJS query. Read only. I have a lot of findOne() queries where I need to grab the first element from a descending sort (grabbing the most recent entry that meets my criteria type things). It would be great if LokiJS eventually could handle this out-of-the-box. Right now I am just caching a whole collection, though I plan on implementing the ability to cache specific queries and keep the cache in sync with the DB

@obeliskos
Copy link
Collaborator

Sure we can add that to the LokiJS2 roadmap... it will require research along the same lines as outlined above where we do multiple (result) set intersection or index / (result) set intersection. This would be paired with benchmarks to confirm or refute benefits before adding.

Our $or op already does set union where each or clause can utilize binary index... we might explore implementing $and similarly except with an intersect (instead of union) like sorting. The biggest performance increase would be the sort though... everything else is just array scan performance hit.

@obeliskos
Copy link
Collaborator

I did some initial experiments that did not yield beneficial results so this won't be included in a 1.5.2 release but I will continue experimenting with this using other techniques.

@cspecter
Copy link
Author

cspecter commented Feb 2, 2018

I also did a little more testing and sort() is definitely faster on larger collections than simplesort(). But, I was able to get a lot more performance out of the query by just changing the order and number of selectors. I'm assuming that since it's only using one index it's all about picking the right index and that any selections beyond the first one will decrease performance because they are not indexed, correct?

For example lets say I have a set of docs like this:

// Model
{
   name,
   type,
   group,
   rank,
   date
}

[{
   name: "Odin",
   type: "God",
   group: "Aesir",
   rank: 1,
   date: "2018-01-01"
},
{
   name: "Frey",
   type: "God",
   group: "Vanir",
   rank: null,
   date: "2018-01-01"
},
{
   name: "Thor",
   type: "God",
   group: "Aesir",
   rank: 2,
   date: "2018-01-01"
}]

I want to get the last 100 days for rank 1 gods that are Aesir between a range of dates. My initial query (in Meteor) is structured like this:

Data.find({type: "God", group: "Aesir", rank: 1, date: {$gte: "2017-01-01", $lte: "2018-02-01"}, {sort: {date: -1}, limit: 100})

Querying MongoDB, which has a compound index on that whole selector, takes about 60-80ms.

To get both selectors for the date to work in LokiJs I had to restructure the query like this:

Data.chain().find({type: "God", group: "Aesir", rank: 1, $and: [{date: {$gte: "2017-01-01"}}, {date: {$lte: "2018-02-01"}}]}).sort(...fun).limit(100).data()

All keys have an index and only Aesir are ranked, other gods are not ranked. Now this query without the limit in LokiJs might return 4000 documents and take about 45-60ms as structured when indexing off the type key using sort(). If I just query for {type: "God"}, and nothing else but the sort(), I will get about 55k documents returned and it will take like 4ms or less. If I restructure the query with the first key as one of the other keys (either group, rank or date) the query will take a lot longer, between 130-250ms. Now, as only the Aesir have a rank in my data, when I dropped the group key, as it was semi-redundant, the query to LokiJs took only 9ms or less.

Data.chain().find({type: "God", rank: 1, $and: [{date: {$gte: "2017-01-01"}}, {$lte: "2018-02-01"}}]}).sort(...fun).limit(100).data()

In your indexing doc it might be worth going into a little more detail on how indexes work and how to optimize queries to best use them. Now that I have all my queries tweaked to work with LokiJs, they are runnin pretty fast, though it took a lot of experimentation to get there.

@obeliskos
Copy link
Collaborator

obeliskos commented Feb 2, 2018

Good ideas, yes we do cover alot of general topics in these wiki pages :
Indexing and Query performance
Query Examples

Depending on your data flows you might also want to experiment with Dynamic views, which can optimize result fetching in various situations where you may repeat the same query over and over. The final sort is the tricky area where, if you are willing to expend idle time after inserts/updates/removes, we can also automatically apply sorting (sortPriority).

obeliskos added a commit that referenced this issue Feb 3, 2018
exploring an 'array intersection' algorithm for leveraging multiple binary indices, also adding a binary index thrash test to verify index validity across high volume adaptive inserts/updates/removes
@obeliskos
Copy link
Collaborator

I did manage come up with a simplesort algorithm which can utilize additional binary index via an array intersection algorithm.

To summarize it looks like it will be faster than the old simplesort unless you were able to filter 90% of documents before calling simplesort. It would seem more dependent on that ratio than how many documents were ultimately in your collection or resultset.

The less effective your filtering, the faster the sorting phase will be.

If you want to experiment with it, I have created a script in the benchmark folder called 'benchmark_sorting.js' and it uses an experimental version of lokijs also in that folder.

obeliskos added a commit that referenced this issue Feb 4, 2018
- we now utilize array intersection algorithm to leverage binary index on simplesort if you have filtered out at least 10% of total documents from results
- added diagnostic collection.checkIndex function to test binary index validity
- i do not expect most users will ever need to use this checkIndex function
- coll.checkIndex(propertyName) returns true/false whether binary index is valid
- coll.checkIndex{propertyName, { repair: true }) will fix binary index if invalid
obeliskos added a commit that referenced this issue Feb 4, 2018
…k to js sorting

with no indexing on sort property, basic javascript sorting will be faster than loki sorting but may yield  unpredictable results if sort property contains dirty or mixed data.

when indexing is available on sort property, passing 'useJavascriptSorting' option will fallback to javascript sorting if your filtered results amount to 10% or more of total document count.  (best performance tradeoff for most data situations)

See #645 for more info.
@obeliskos
Copy link
Collaborator

obeliskos commented Feb 4, 2018

Ok, oddly enough your exact use case is right on the border of being able to use (or benefit from) the array intersect performance increases which have now been committed to master branch. (go figure)

To provide the best balance for your situation (where javascript sorting is perfectly acceptable), i multipurposed the second parameter to be either boolean or options object. If you pass an options object, I allow passing 'useJavascriptSorting' option.

When an index is defined on your sort property, passing 'useJavascriptSorting:true' option will fallback to simple javascript sort function (instead of slower loki sort) on the property supplied if it is deemed to be more efficient than index intercept (less than 10% of total document count in current resultset). That would look like :

coll.chain().find({a: { $gte: 99 }).simplesort('b', { useJavascriptSorting: true }).data();

If an index were not defined on that property it would always use javascript sort function.
If you wanted to force javascript sorting even when an index was defined you could use the (slower) options to disable index intersect :

coll.chain().find({a: { $gte: 99 }).simplesort('b', { disableIndexIntersect: true, useJavascriptSorting: true }).data();

Allowing simplesort to perform javascript sorting would allow persisting those queries/sorts in dynamic views and transforms within the database itself, should you or others wish to go that route.

Since using the 'useJavascriptSorting' fallback option might produce slightly different sort ordering results on mixed types, depending on whether results meet or exceed the 10% ratio of total docs, that option is false by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants