Skip to content

Release GIL for Merge #13745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mrocklin opened this issue Jul 22, 2016 · 9 comments
Open

Release GIL for Merge #13745

mrocklin opened this issue Jul 22, 2016 · 9 comments
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@mrocklin
Copy link
Contributor

I think that the title says it all. The pd.merge function can be compute intensive and can benefit (I think) from parallel computing.

It does not appear to currently release the GIL. I can easily push my CPU to 100% but no higher when performing parallel joins.

@sinhrks sinhrks added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 22, 2016
@jorisvandenbossche
Copy link
Member

@mrocklin do you have an example case of such an intensive merge where you do not see an speedup by parallellizing? On some examples I tried, I already do see some speedup (but probably can be improved).

For example, with

left = pd.DataFrame({'key': list(range(1,11)) * 100000})
right = pd.DataFrame({'key': range(10), 'val': range(10)})

I already see some speedup:

def f():
    left.merge(right, how='inner')
    
def g4():
    for i in range(4):
        f()
        
from pandas.util.testing import test_parallel

@test_parallel(num_threads=4)
def pg4():
    f()
In [21]: %timeit g4()
10 loops, best of 3: 149 ms per loop

In [22]: %timeit pg4()
10 loops, best of 3: 99.2 ms per loop

When I profile this merge operation (prof_merge3.out), the main operations that take time are (the number are for this specific example, but with others I get similar trends):

  • factorization (ca 36%) -> hastable Factorizer -> this is already releases the GIL where possible I think
  • the actual inner join (ca 31%)
    • ca 2/3 of the time is spend in algos.groupsort_indexer -> this also already releases the GIL(code)
    • the remaining logic in the _join.inner_join function itself -> this can further release the GIL, but I think is only ca 10% of overall time of merge operation
  • combining the results (ca 20%) -> comes down to mainly take_1d/2d algos -> these also already release the GIL to some extent (at least the 1d ones, 2d for some reason not)

So from a first quick exploration, there are certainly some small improvements to be made, but seems the bigger ones are already done (but with further analysis quite possible that it can further be improved).

@mrocklin
Copy link
Contributor Author

mrocklin commented Dec 2, 2016

OK. Let me come up with a few examples and get back to you. If as you say most of this is already done then I'll be quite happy to be incorrect here :)

@jreback
Copy link
Contributor

jreback commented Dec 2, 2016

FYI: jreback@a295e83

this makes factorization about 30% faster and releases the gil in the core parts (but this currently breaks other stuff).

@jorisvandenbossche
Copy link
Member

But still, I only get a speed improvement of factor 1.5 on 4 cores, so it also not that impressive.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2016

@mrocklin I think to make this a truly parallel merge, you would need to change the problem a bit I think. e.g. partition across workers, replicate the Dataframe, then concat?

@mrocklin
Copy link
Contributor Author

mrocklin commented Dec 2, 2016

@jreback yes, it could be that by operating on different dataframes we have less memory contention and would see larger speedups?

@jorisvandenbossche I'm hearing two things:

  1. We can get about a 50% speedup on 4 cores
  2. Most of the gains have already occurred

This raises the question of fundamentally why isn't something closer to a 4x speedup possible? Is this a memory hierarchy bound operation?

@jreback
Copy link
Contributor

jreback commented Dec 2, 2016

@jorisvandenbossche is your test with processes? or threads?

@mrocklin
Copy link
Contributor Author

mrocklin commented Dec 2, 2016

from pandas.util.testing import test_parallel

@test_parallel(num_threads=4)
def pg4():
    f()

@jorisvandenbossche
Copy link
Member

Yes, I was using the test_parallel decorator, so was testing with threads.

I don't have much experience with this, but the fact that the GIL free operations are spread throughout the merge operation (the full merge operation separately releases the GIL in potentially 5 or 6 different algos), is this is a reason for overhead and less efficient use of multiple threads / less speedup?

jreback added a commit to jreback/pandas that referenced this issue Dec 12, 2016
allows releasing the GIL on these dtypes

xref pandas-dev#13745
jreback added a commit to jreback/pandas that referenced this issue Dec 12, 2016
allows releasing the GIL on these dtypes

xref pandas-dev#13745
jreback added a commit to jreback/pandas that referenced this issue Dec 14, 2016
allows releasing the GIL on these dtypes

xref pandas-dev#13745
jreback added a commit to jreback/pandas that referenced this issue Dec 15, 2016
allows releasing the GIL on these dtypes

xref pandas-dev#13745
jreback added a commit to jreback/pandas that referenced this issue Dec 15, 2016
allows releasing the GIL on these dtypes

xref pandas-dev#13745
jreback added a commit to jreback/pandas that referenced this issue Dec 15, 2016
allows releasing the GIL on these dtypes

xref pandas-dev#13745
jreback added a commit to jreback/pandas that referenced this issue Dec 15, 2016
allows releasing the GIL on these dtypes

xref pandas-dev#13745
jreback added a commit that referenced this issue Dec 15, 2016
xref #13745

provides a modest speedup for all string hashing. The
key thing is, it will release the GIL on more operations where this is
possible (mainly factorize).
can be easily extended to value_counts() and .duplicated() (for strings)

Author: Jeff Reback <jeff@reback.net>

Closes #14859 from jreback/string and squashes the following commits:

98f46c2 [Jeff Reback] PERF: use StringHashTable for strings in factorizing
ischurov pushed a commit to ischurov/pandas that referenced this issue Dec 19, 2016
xref pandas-dev#13745

provides a modest speedup for all string hashing. The
key thing is, it will release the GIL on more operations where this is
possible (mainly factorize).
can be easily extended to value_counts() and .duplicated() (for strings)

Author: Jeff Reback <jeff@reback.net>

Closes pandas-dev#14859 from jreback/string and squashes the following commits:

98f46c2 [Jeff Reback] PERF: use StringHashTable for strings in factorizing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

4 participants