Skip to content

SparseDataFrame loc multiple index is slow #14310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DeoLeung opened this issue Sep 28, 2016 · 10 comments · Fixed by #28425
Closed

SparseDataFrame loc multiple index is slow #14310

DeoLeung opened this issue Sep 28, 2016 · 10 comments · Fixed by #28425
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type

Comments

@DeoLeung
Copy link

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[1, 0, 0],[0,0,0]])

# non sparse
In [19]: %time df.loc[0]
CPU times: user 367 µs, sys: 14 µs, total: 381 µs
Wall time: 374 µs
Out[19]:
0    1
1    0
2    0
Name: 0, dtype: int64

In [20]: %time df.loc[[0]]
CPU times: user 808 µs, sys: 57 µs, total: 865 µs
Wall time: 827 µs
Out[20]:
   0  1  2
0  1  0  0

In [3]: df = df.to_sparse()

# sparse
In [5]: %time df.loc[0]
CPU times: user 429 µs, sys: 14 µs, total: 443 µs
Wall time: 436 µs
Out[5]:
0    1.0
1    0.0
2    0.0
Name: 0, dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)

In [6]: %time df.loc[[0]]
CPU times: user 4.07 ms, sys: 406 µs, total: 4.48 ms
Wall time: 6.16 ms
Out[6]:
     0    1    2
0  1.0  0.0  0.0

It seems in SparseDataFrame, selecting as dataframe, the time increased much more than it does in DataFrame, I would expect it runs faster(on my 1m x 500 SparseDataFrame it takes seconds for the loc operation)

## INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_GB.UTF-8
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 24.0.2
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: None
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Sep 28, 2016

I guess. Why would this small difference actually matter in practice? If you are repeatedly using an indexer that is completely non-performant (in either case), and you should be indexing via a larger indexer.

@jreback jreback added Performance Memory or execution speed performance Sparse Sparse Data Type Difficulty Intermediate labels Sep 28, 2016
@jreback jreback added this to the Next Major Release milestone Sep 28, 2016
@jreback
Copy link
Contributor

jreback commented Sep 28, 2016

If you would like to investigate and see where / propose fix, then that would be great.

@nesdis
Copy link

nesdis commented Aug 31, 2017

indexing of sparse dataframe looks to be slow because elements are being stored as multiple sparse series. It would be much faster if pandas stored the underlying elements as a coo_matrix instead, when all the columns are of the same datatype. @jreback Do you suppose sparse dataframe would benefit from this optimization?

@nesdis
Copy link

nesdis commented Sep 1, 2017

python sparse dataframes are ridiculously slow when it comes to most operations vs sparse matrices:

Sparse matrix instantiation vs Sparse dataframe instantiation from numpy 2darray:

d = numpy.zeros((10000,10000))
d[1,2] = 3

timeit.timeit('m = coo_matrix(d)', globals=globals(), number=1)
0.7182237296299819

timeit.timeit('df = pandas.DataFrame(d).to_sparse(0)', globals=globals(), number=1)
206.5695096827077

Sparse dataframe instantiation is about 280 times slower vs sparse matrix

Sparse matrix slicing vs Sparse dataframe slicing

r = m.tocsr()
timeit.timeit('r[:5,:].toarray()', globals=globals(), number=1)
0.0005268476787705367

timeit.timeit('df.iloc[:5,:]', globals=globals(), number=1)

'''
MEMORY EXCEPTION!!
python ended up consuming 6GB of my RAM
'''

I dont understand why this bug is a duplicate of Row slicing of a sparse dataframe is too slow #17408

#14310 bug is about multi row indexing being slow. #17408 is about sparse dataframe being buggy and slow overall.

@jreback
Copy link
Contributor

jreback commented Sep 1, 2017

@nesdis you are welcome to have a look. However, virtually all support for sparse would have to be community driven. This is not really well supported, though @kernc has done a very nice job fixing some things up.

DataFrames are column based so changing to internal data storage to coo would likely be a herculean task. If you need a numpy coo, simply use one. The usecases are different.

@nesdis
Copy link

nesdis commented Sep 1, 2017

@jreback can't the internal structure be changed only for the case of sparse-dataframe. The main advantage of a dataframe is the automatic reindexing and alignment based on labelled indexes for arithmetic operations. using coo means i am having to re-index and align manually for my arithmetics.

@jreback
Copy link
Contributor

jreback commented Sep 1, 2017

@nesdis the current design of pandas would make this quite difficult. In any event, I would suspect your issues are related to how you are indexing. Simply indexing by a list like a small number of times makes a slightly slower access pattern acceptable. Repeated indexing is an anti-pattern (IOW in a loop).

@nesdis
Copy link

nesdis commented Sep 1, 2017

@jreback I am not sure, but does pandas convert 2d numpy array (which is used to instantiate a dataframe) into multiple 1d arrays or refer to the array as a single 2d block? If it is the latter then treating the input data as a sparse 2d array block should be possible too? (instead of multiple 1d series/arrays).

Yes in my case my slicing is happening inside a for loop to get a subset of values, the sliced matrix/df is being multiplied with other matrices to generate probabilities for a subset of elements. At the moment I have not gotten rid of the loop.

@kernc
Copy link
Contributor

kernc commented Sep 3, 2017

It would be much faster if pandas stored the underlying elements as a coo_matrix instead, when all the columns are of the same datatype.

Imho not general enough. The case when all columns are of the same dtype is quite uncommon in the space of all possible columns/dtypes. What would probably work, however, is SparseBlock collecting all columns of the same dtype and no longer sharing the limitations of NonConsolidatableMixIn. One'd have to investigate. PRs are welcome.

Sparse dataframe instantiation is about 280 times slower vs sparse matrix

A better instantiation of SparseDataFrame would be via coo_matrix; just twice as slow:

>>> %timeit coo_matrix(d)
858 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.SparseDataFrame(coo_matrix(d))
1.83 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

does pandas convert 2d numpy array (which is used to instantiate a dataframe) into multiple 1d arrays

Yes, pandas is a column-oriented storage. Some columns are consolidated into a single 2d block of the same dtype, but not SparseBlock.

class SparseBlock(NonConsolidatableMixIn, Block):

Optimize tight loops by converting first to a format with less overhead:

coo = sdf.to_coo()

@TomAugspurger
Copy link
Contributor

This affects DataFrame[sparse] as well, but I don't think there's an obvious fix. We aren't planning to adopt arbitrary 2D extension arrays, so this will always be slower.

Would certainly welcome performance improvements given the current design though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants