-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
SparseDataFrame loc multiple index is slow #14310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess. Why would this small difference actually matter in practice? If you are repeatedly using an indexer that is completely non-performant (in either case), and you should be indexing via a larger indexer. |
If you would like to investigate and see where / propose fix, then that would be great. |
indexing of sparse dataframe looks to be slow because elements are being stored as multiple sparse series. It would be much faster if pandas stored the underlying elements as a coo_matrix instead, when all the columns are of the same datatype. @jreback Do you suppose sparse dataframe would benefit from this optimization? |
python sparse dataframes are ridiculously slow when it comes to most operations vs sparse matrices:Sparse matrix instantiation vs Sparse dataframe instantiation from numpy 2darray:d = numpy.zeros((10000,10000))
d[1,2] = 3
timeit.timeit('m = coo_matrix(d)', globals=globals(), number=1)
0.7182237296299819
timeit.timeit('df = pandas.DataFrame(d).to_sparse(0)', globals=globals(), number=1)
206.5695096827077 Sparse dataframe instantiation is about 280 times slower vs sparse matrix Sparse matrix slicing vs Sparse dataframe slicingr = m.tocsr()
timeit.timeit('r[:5,:].toarray()', globals=globals(), number=1)
0.0005268476787705367
timeit.timeit('df.iloc[:5,:]', globals=globals(), number=1)
'''
MEMORY EXCEPTION!!
python ended up consuming 6GB of my RAM
''' I dont understand why this bug is a duplicate of Row slicing of a sparse dataframe is too slow #17408 #14310 bug is about multi row indexing being slow. #17408 is about sparse dataframe being buggy and slow overall. |
@nesdis you are welcome to have a look. However, virtually all support for sparse would have to be community driven. This is not really well supported, though @kernc has done a very nice job fixing some things up. DataFrames are column based so changing to internal data storage to coo would likely be a herculean task. If you need a numpy coo, simply use one. The usecases are different. |
@jreback can't the internal structure be changed only for the case of sparse-dataframe. The main advantage of a dataframe is the automatic reindexing and alignment based on labelled indexes for arithmetic operations. using coo means i am having to re-index and align manually for my arithmetics. |
@nesdis the current design of pandas would make this quite difficult. In any event, I would suspect your issues are related to how you are indexing. Simply indexing by a list like a small number of times makes a slightly slower access pattern acceptable. Repeated indexing is an anti-pattern (IOW in a loop). |
@jreback I am not sure, but does pandas convert 2d numpy array (which is used to instantiate a dataframe) into multiple 1d arrays or refer to the array as a single 2d block? If it is the latter then treating the input data as a sparse 2d array block should be possible too? (instead of multiple 1d series/arrays). Yes in my case my slicing is happening inside a for loop to get a subset of values, the sliced matrix/df is being multiplied with other matrices to generate probabilities for a subset of elements. At the moment I have not gotten rid of the loop. |
Imho not general enough. The case when all columns are of the same dtype is quite uncommon in the space of all possible columns/dtypes. What would probably work, however, is
A better instantiation of SparseDataFrame would be via coo_matrix; just twice as slow: >>> %timeit coo_matrix(d)
858 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pd.SparseDataFrame(coo_matrix(d))
1.83 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Yes, pandas is a column-oriented storage. Some columns are consolidated into a single 2d block of the same dtype, but not pandas/pandas/core/internals.py Line 2606 in 870b6a6
Optimize tight loops by converting first to a format with less overhead: coo = sdf.to_coo() |
This affects DataFrame[sparse] as well, but I don't think there's an obvious fix. We aren't planning to adopt arbitrary 2D extension arrays, so this will always be slower. Would certainly welcome performance improvements given the current design though. |
It seems in SparseDataFrame, selecting as dataframe, the time increased much more than it does in DataFrame, I would expect it runs faster(on my 1m x 500 SparseDataFrame it takes seconds for the loc operation)
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_GB.UTF-8
LANG: en_GB.UTF-8
pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 24.0.2
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: None
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: