Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling/Performance with a large number of raters #90

Merged
merged 3 commits into from
Apr 23, 2014

Conversation

ajjahn
Copy link
Contributor

@ajjahn ajjahn commented Apr 23, 2014

With a small number of raters, refreshing similarities and recommendations is quite fast and works great. Unfortunately my system needs to support more than half a million raters. My refresh time for a single rater increases significantly with the total number of raters. Even at 10,000 raters, refreshing can take between 1-3 minutes.

I've tried tweaking (and lowering) the config.nearest_neighbors, config.furthest_neighbors, and config.recomendations_to_store settings without seeing much improvement.

Obviously, 1-3 minutes will only get much longer as the number of raters gets closer to half a million. So here are a few questions:

First, I'm wondering if this situation is unique to me? Is this a typical situation?

Is there a way to limit the similar raters calculation to use a subset of the total raters in the system?

Are there any other places you can suggest I could look to decrease refreshing time?

I noticed a few places in the recommendable code where Redis operations are called inside a loop; is there any reason not to multi or pipeline in those situations?

Any direction you can offer is appreciated!

@davidcelis
Copy link
Owner

Any recommender system will have degraded performance as the data set rises, but I do suspect that it's particularly not great with Recommendable. With regards to multi/pipelining, this has actually been something I want to try, but I simply haven't gotten around to it yet... Though it's the sort of thing that would be awesome to get a PR for if you'd be willing to help out with it? I haven't really worked with Redis pipelining before

@ajjahn
Copy link
Contributor Author

ajjahn commented Apr 23, 2014

I'll pick through the code and see if I can't get a PR together to minimize Redis connections. Although I suspect that will only marginally boost performance. I'm inclined to think to really handle large data sets, it might need to support something along the lines of set sampling.

In the meantime, do you know of anyone using successfully using recommendable with larger data sets in the wild?

@davidcelis
Copy link
Owner

If any of my users have large data sets, they haven't let me know... I think that Recommendable definitely needs some performance improvements but, truth be told, I'm not really up-to-date on my knowledge of the computer science behind really performant and scalable recommender systems. Stuff's not easy... When I built Recommendable, my primary focus was a really nice and elegant interface and I figured I'd improve the performance as I went along. I've done so bit by bit, but at this point I only have a few ideas of how I could improve the current performance of the Redis logic. And like you said, I worry that they'd only marginally boost performance. One of those ideas is using multi/pipelining, the other idea is putting all of that logic into a Lua script that Redis can simply run directly.

Though for the next major version I've been meaning to support alternate ways (with a pluggable system) of generating similarity values and recommendations that don't rely on Redis.

@ajjahn
Copy link
Contributor Author

ajjahn commented Apr 23, 2014

There's the first stab at it. I'll keep plugging away at it and attempt some benchmarking.

@ajjahn
Copy link
Contributor Author

ajjahn commented Apr 23, 2014

Surprisingly pipelining has yielded a significant boost in performance (at least on my data set).

Updating similarities and recommendations 10 times
Without pipelining:

       user     system      total        real
 836.470000 113.540000 950.010000 (953.499563)

With pipelining:

       user     system      total        real
 200.050000  48.630000 248.680000 (250.104407)

@davidcelis
Copy link
Owner

Daaaaang please PR that.

@ajjahn
Copy link
Contributor Author

ajjahn commented Apr 23, 2014

@mhuggins it would seem that way, but the tests fail without it. The reason is sometimes there are no members in the liked_by_set/disliked_by_set. That would mean the similarity values array is empty.

[].map(&:to_f) returns the same empty array, but [].reduce(&:+) returns nil. So we need that to_f to ensure in the case of empty liked sets we end up with 0.0. It threw me for a loop when I was first writing that line of code.

@davidcelis Isn't this already a PR?

@davidcelis
Copy link
Owner

Oh it is. Crazy. I forgot you could turn issues into a PR. I'll take a closer look and probably merge when I'm not on a bus!

@mhuggins
Copy link

@ajjahn makes sense RE: to_f :)

@davidcelis
Copy link
Owner

🎆

davidcelis added a commit that referenced this pull request Apr 23, 2014
Scaling/Performance with a large number of raters
@davidcelis davidcelis merged commit 03e3d80 into davidcelis:master Apr 23, 2014
@ajjahn ajjahn deleted the optimize branch April 23, 2014 17:37
@ajjahn
Copy link
Contributor Author

ajjahn commented Apr 23, 2014

Rad! Thanks! I'll keep digging for other ways to optimize, but I'd say this was good start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants