-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling/Performance with a large number of raters #90
Conversation
Any recommender system will have degraded performance as the data set rises, but I do suspect that it's particularly not great with Recommendable. With regards to multi/pipelining, this has actually been something I want to try, but I simply haven't gotten around to it yet... Though it's the sort of thing that would be awesome to get a PR for if you'd be willing to help out with it? I haven't really worked with Redis pipelining before |
I'll pick through the code and see if I can't get a PR together to minimize Redis connections. Although I suspect that will only marginally boost performance. I'm inclined to think to really handle large data sets, it might need to support something along the lines of set sampling. In the meantime, do you know of anyone using successfully using recommendable with larger data sets in the wild? |
If any of my users have large data sets, they haven't let me know... I think that Recommendable definitely needs some performance improvements but, truth be told, I'm not really up-to-date on my knowledge of the computer science behind really performant and scalable recommender systems. Stuff's not easy... When I built Recommendable, my primary focus was a really nice and elegant interface and I figured I'd improve the performance as I went along. I've done so bit by bit, but at this point I only have a few ideas of how I could improve the current performance of the Redis logic. And like you said, I worry that they'd only marginally boost performance. One of those ideas is using multi/pipelining, the other idea is putting all of that logic into a Lua script that Redis can simply run directly. Though for the next major version I've been meaning to support alternate ways (with a pluggable system) of generating similarity values and recommendations that don't rely on Redis. |
There's the first stab at it. I'll keep plugging away at it and attempt some benchmarking. |
Surprisingly pipelining has yielded a significant boost in performance (at least on my data set). Updating similarities and recommendations 10 times
With pipelining:
|
@mhuggins it would seem that way, but the tests fail without it. The reason is sometimes there are no members in the liked_by_set/disliked_by_set. That would mean the similarity values array is empty.
@davidcelis Isn't this already a PR? |
Oh it is. Crazy. I forgot you could turn issues into a PR. I'll take a closer look and probably merge when I'm not on a bus! |
@ajjahn makes sense RE: |
🎆 |
Scaling/Performance with a large number of raters
Rad! Thanks! I'll keep digging for other ways to optimize, but I'd say this was good start. |
With a small number of raters, refreshing similarities and recommendations is quite fast and works great. Unfortunately my system needs to support more than half a million raters. My refresh time for a single rater increases significantly with the total number of raters. Even at 10,000 raters, refreshing can take between 1-3 minutes.
I've tried tweaking (and lowering) the
config.nearest_neighbors
,config.furthest_neighbors
, andconfig.recomendations_to_store
settings without seeing much improvement.Obviously, 1-3 minutes will only get much longer as the number of raters gets closer to half a million. So here are a few questions:
First, I'm wondering if this situation is unique to me? Is this a typical situation?
Is there a way to limit the similar raters calculation to use a subset of the total raters in the system?
Are there any other places you can suggest I could look to decrease refreshing time?
I noticed a few places in the recommendable code where Redis operations are called inside a loop; is there any reason not to
multi
orpipeline
in those situations?Any direction you can offer is appreciated!