-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use multithreading in basic operations #2647
Conversation
It's my understanding that the @sync for i = 1:Threads.nthreads()
Threads.@spawn expr
end Spawning individual tasks allows nested multithreading to all operate cooperatively, whereas if |
@quinnj - thank you for commenting - just pushed this in parallel after discussing the PR with @nalimilan. |
Some benchmarks: Machine 1 (faster, less memory)
Machine 2 (more memory but slower)
So minimum time is OK, but - probably expectedly - GC kills mean performance in threaded code. |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like NEWS needs a conflict resolution
Here are the current benchmarks (code generating them is in
|
This should be done. Could you please comment that you have looked at the PR and all looks OK before merging (these things are more tricky than they seem). Thank you ! |
Looks good, but I wonder whether it would be worth introducing a All branches could be replaced with something like this, and the new_columns = tcollect(AbstractVector, (_columns(df)[i][selected_rows] for i in selected_columns),
threaded=length(selected_rows) >= 1_000_000)
return DataFrame(new_columns, idx, copycols=false) What do you think? |
OK - I will add |
I could not find a better solution that what is currently implemented without regression in performance. I will merge this PR tomorrow if no more comments are raised. |
Thank you! Of course code organization improvements are welcome - there would be a proposal that would not regress the performance please open a PR. |
With this PR operations that subset a data frame will use multi threading.
This should be non-breaking.
I am not sure if it is worth to use threading in operations that operate on short columns (or how costly it is to use it if there is only one thread in the Julia session). I will ask for advice on Slack.