Scaling/Performance with a large number of raters #90

ajjahn · 2014-04-23T04:33:51Z

With a small number of raters, refreshing similarities and recommendations is quite fast and works great. Unfortunately my system needs to support more than half a million raters. My refresh time for a single rater increases significantly with the total number of raters. Even at 10,000 raters, refreshing can take between 1-3 minutes.

I've tried tweaking (and lowering) the config.nearest_neighbors, config.furthest_neighbors, and config.recomendations_to_store settings without seeing much improvement.

Obviously, 1-3 minutes will only get much longer as the number of raters gets closer to half a million. So here are a few questions:

First, I'm wondering if this situation is unique to me? Is this a typical situation?

Is there a way to limit the similar raters calculation to use a subset of the total raters in the system?

Are there any other places you can suggest I could look to decrease refreshing time?

I noticed a few places in the recommendable code where Redis operations are called inside a loop; is there any reason not to multi or pipeline in those situations?

Any direction you can offer is appreciated!

davidcelis · 2014-04-23T01:19:26Z

Any recommender system will have degraded performance as the data set rises, but I do suspect that it's particularly not great with Recommendable. With regards to multi/pipelining, this has actually been something I want to try, but I simply haven't gotten around to it yet... Though it's the sort of thing that would be awesome to get a PR for if you'd be willing to help out with it? I haven't really worked with Redis pipelining before

ajjahn · 2014-04-23T02:14:47Z

I'll pick through the code and see if I can't get a PR together to minimize Redis connections. Although I suspect that will only marginally boost performance. I'm inclined to think to really handle large data sets, it might need to support something along the lines of set sampling.

In the meantime, do you know of anyone using successfully using recommendable with larger data sets in the wild?

davidcelis · 2014-04-23T03:08:19Z

If any of my users have large data sets, they haven't let me know... I think that Recommendable definitely needs some performance improvements but, truth be told, I'm not really up-to-date on my knowledge of the computer science behind really performant and scalable recommender systems. Stuff's not easy... When I built Recommendable, my primary focus was a really nice and elegant interface and I figured I'd improve the performance as I went along. I've done so bit by bit, but at this point I only have a few ideas of how I could improve the current performance of the Redis logic. And like you said, I worry that they'd only marginally boost performance. One of those ideas is using multi/pipelining, the other idea is putting all of that logic into a Lua script that Redis can simply run directly.

Though for the next major version I've been meaning to support alternate ways (with a pluggable system) of generating similarity values and recommendations that don't rely on Redis.

ajjahn · 2014-04-23T04:36:12Z

There's the first stab at it. I'll keep plugging away at it and attempt some benchmarking.

ajjahn · 2014-04-23T15:46:11Z

Surprisingly pipelining has yielded a significant boost in performance (at least on my data set).

Updating similarities and recommendations 10 times
Without pipelining:

       user     system      total        real
 836.470000 113.540000 950.010000 (953.499563)

With pipelining:

       user     system      total        real
 200.050000  48.630000 248.680000 (250.104407)

davidcelis · 2014-04-23T16:18:51Z

Daaaaang please PR that.

ajjahn · 2014-04-23T16:32:12Z

@mhuggins it would seem that way, but the tests fail without it. The reason is sometimes there are no members in the liked_by_set/disliked_by_set. That would mean the similarity values array is empty.

[].map(&:to_f) returns the same empty array, but [].reduce(&:+) returns nil. So we need that to_f to ensure in the case of empty liked sets we end up with 0.0. It threw me for a loop when I was first writing that line of code.

@davidcelis Isn't this already a PR?

davidcelis · 2014-04-23T16:39:12Z

Oh it is. Crazy. I forgot you could turn issues into a PR. I'll take a closer look and probably merge when I'm not on a bus!

mhuggins · 2014-04-23T16:51:41Z

@ajjahn makes sense RE: to_f :)

davidcelis · 2014-04-23T17:31:29Z

🎆

Scaling/Performance with a large number of raters

ajjahn · 2014-04-23T17:44:51Z

Rad! Thanks! I'll keep digging for other ways to optimize, but I'd say this was good start.

pipeline independent redis commands

84c5921

ajjahn added 2 commits April 23, 2014 01:15

pipeline + extract method

bd143ce

ensure similarity values are floats

4df3be8

davidcelis added a commit that referenced this pull request Apr 23, 2014

Merge pull request #90 from ajjahn/optimize

03e3d80

Scaling/Performance with a large number of raters

davidcelis merged commit 03e3d80 into davidcelis:master Apr 23, 2014

ajjahn deleted the optimize branch April 23, 2014 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling/Performance with a large number of raters #90

Scaling/Performance with a large number of raters #90

ajjahn commented Apr 23, 2014

davidcelis commented Apr 23, 2014

ajjahn commented Apr 23, 2014

davidcelis commented Apr 23, 2014

ajjahn commented Apr 23, 2014

ajjahn commented Apr 23, 2014

davidcelis commented Apr 23, 2014

ajjahn commented Apr 23, 2014

davidcelis commented Apr 23, 2014

mhuggins commented Apr 23, 2014

davidcelis commented Apr 23, 2014

ajjahn commented Apr 23, 2014

Scaling/Performance with a large number of raters #90

Scaling/Performance with a large number of raters #90

Conversation

ajjahn commented Apr 23, 2014

davidcelis commented Apr 23, 2014

ajjahn commented Apr 23, 2014

davidcelis commented Apr 23, 2014

ajjahn commented Apr 23, 2014

ajjahn commented Apr 23, 2014

davidcelis commented Apr 23, 2014

ajjahn commented Apr 23, 2014

davidcelis commented Apr 23, 2014

mhuggins commented Apr 23, 2014

davidcelis commented Apr 23, 2014

ajjahn commented Apr 23, 2014