-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: add arrow engine to read_csv #31817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
it appears that the arrow engine does better on larger datasets. for example, on scikit-learn's diabetes dataset(24 kb vs the 3kb iris), performance was ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to hook into the test framework via the fixture all_parsers, which exhaustively tests things. This parsers likely has a very narrow case where it actually passes.
Secondaly on the performance comparsions, pls show the asv's for this. again this might be slightly more performant that the c parser, but its important to narrow down when and in what cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
There are a few things where pyarrow has different defaults. For example, in which values to recognize as NA values (but this is configurable). Do we want to pass our default values as option to pyarrow by default (so the different engines in pandas are consistent on that), or are we fine with following pyarrow's default?
put up what you want and will have a look |
@lithomas1 can you move release note to 1.2 |
@lithomas1 Can you merge master to fix the conflict? |
I would like to see this move forward, and can also put some effort in updating it (in case @lithomas1 has no time on the short term). @jreback can you clarify what exactly you don't like about the current |
this needs a merge in master and a green build will have a look |
Yes it needs to be updated with latest master, but in the mean time it would still be useful to understand your concern about the fixture (as I asked in #31817 (comment) already). So can you clarify your "it's very hard to grok the way you are using the fixtures" (see also @lithomas1's answers to your comments at #31817 (comment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the testing method i think is easily fixed see my comments.
this needs to be integrated in a very different option validation structure that exists in master.
Also the actual processing of the arrow engine is pretty awkward.
engine : {``'c'``, ``'pyarrow'``, ``'python'``} | ||
Parser engine to use. In terms of performance, the pyarrow engine, | ||
which requires ``pyarrow`` >= 0.15.0, is faster than the C engine, which | ||
is faster than the python engine. However, the pyarrow and C engines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionchanged tag here 1.2
possible pandas uses the C parser (specified as ``engine='c'``), but may fall | ||
back to Python if C-unsupported options are specified. Currently, C-unsupported | ||
options include: | ||
Currently, pandas supports using three engines, the C engine, the python engine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionchanged 1.2 tag here
the pyarrow engine is much less robust than the C engine, which in turn lacks a | ||
couple of features present in the Python parser. | ||
|
||
Where possible pandas uses the C parser (specified as ``engine='c'``), but may fall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to refactor this entire section to provide a more table like comparision of all of the parsers, if you'd create an issue for this
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -271,6 +272,14 @@ change, as ``fsspec`` will still bring in the same packages as before. | |||
|
|||
.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/ | |||
|
|||
|
|||
read_csv() now accepts pyarrow as an engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 1.2
@@ -93,10 +100,16 @@ def import_optional_dependency( | |||
raise ImportError(msg) from None | |||
else: | |||
return None | |||
|
|||
minimum_version = VERSIONS.get(name) | |||
# Handle submodules: if we have submodule, grab parent module from sys.modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is all this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been answered before: #31817 (comment) (and the above comment has been added based on your comment)
It's to 1) import a submodule (pyarrow.csv
in this case) and 2) to support passing a different version as in our global minimum versions dictionary.
Now I suppose that the submodule importing is not necessarily needed. Right now this PR does:
csv = import_optional_dependency("pyarrow.csv", min_version="0.15")
but I suppose this could also be:
import_optional_dependency("pyarrow", min_version="0.15")
from pyarrow import csv
And then this additional code to directly import a submodule with import_optional_dependency
is not needed (although where it is used, I think it is a bit cleaner to be able to directly import the submodule)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche importing as a submodule is required, you can't access the csv module by doing pyarrow.csv
as far as I remember, and if you do import pyarrow.csv
, the it won't validate the version and will not error for pyarrow<0.15
return content.encode(self.encoding) | ||
|
||
|
||
class ArrowParserWrapper(ParserBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will need to refactor this as the current code is very different from this.
Also I really don't like doing all of this validation in a single function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will need to refactor this as the current code is very different from this.
Can you clarify a bit more what you mean? Or point to recent changes related to this?
For example also on master, the C parser is using a very similar mechanism with the CParserWrapper class.
Or do you only mean that it needs to split some validation into separate methods (as you indicate in a comment below as well, fully agreed with that).
read_options["skip_rows"] = skiprows | ||
read_options["autogenerate_column_names"] = True | ||
read_options = pyarrow.ReadOptions(**read_options) | ||
table = pyarrow.read_csv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add line breaks between section and comments.
|
||
parse_options = {k: v for k, v in kwdscopy.items() if k in parseoptions} | ||
convert_options = {k: v for k, v in kwdscopy.items() if k in convertoptions} | ||
headerexists = True if self.header is not None else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the whole structure of this class is really odd. The class should do consruction, option validation, reading, and post-processing by calling methods.
pandas/io/parsers.py
Outdated
@@ -3400,7 +3570,7 @@ def _isindex(colspec): | |||
colspec = orig_names[colspec] | |||
if _isindex(colspec): | |||
continue | |||
data_dict[colspec] = converter(data_dict[colspec]) | |||
data_dict[colspec] = converter(np.array(data_dict[colspec])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.asarray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converter doesn't work on pandas series, only on np arrays, if i can remember correctly. The alternative would be to convert dataframe to dict of np arrays or something like that, which hurts perf a lot. This has much less impact and also doesn't hurt perf with non pyarrow case.
parser._engine = MyCParserWrapper(StringIO(data), **parser.options) | ||
|
||
result = parser.read() | ||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
def test_empty_decimal_marker(all_parsers): | ||
def test_empty_decimal_marker(all_parsers, pyarrow_xfail): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this, why don't you just define a new fixture (all_parsers_xpyarrow) or something and then just change the inputs to all of these functions), otherwise this is awkward and unmaintanable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this all_parsers_xpyarrow
fixture would be "all parsers but without pyarrow"?
If so, how is that necessarily more maintainable? You still need to update each test function that doesn't support pyarrow to change all_parsers
to all_parsers_xpyarrow
, while with the current approach you need to update this to all_parsers, pyarrow_xfail
. That doesn't seem a big difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have a way to make this clearer.
at the moment the fixture is xfailing tests explicitly and not using an xfail marker.
I have changed this but there are some tests that seem to lock up, so also need a skip fixture.
Am working through the failing/timing out tests. will do a wip commit for discusssion. (just revert if not happy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, frequent lock-ups on a full test run of tests/io/parser/ so several xfails changed to skips.
hopefully, this won't time out and with the decorators makes it easier to review for functionality that should be working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macOS py37_macos didn't time out this time around, but looks like Windows and linux still have a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins yeah i had problems with it getting stuck on my computer too. I think the tests are actually failing but pytest just gets stuck or something. I think that some of the CI machines don't have pyarrow on it, and skip the pyarrow tests, which is probably why they don't fail. Does it work with the xfail when you run that single test?
@jreback thanks a lot for the review! I added some comments/questions for further clarifications |
pandas/io/parsers.py
Outdated
currently more feature-complete. | ||
engine : {{'c', 'python', 'pyarrow'}}, optional | ||
Parser engine to use. The C and pyarrow engines are faster, while the python engine | ||
is currently more feature-complete. The pyarrow engine requires ``pyarrow`` >= 0.15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since our current project wide min is 0.15, this can probably be removed.
Hi all, I still don't think I have the time/energy to finish this one though, but I can commit the WIP code that I have and provide some guidance on the code if necessary. |
Continued in #38370 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
ASV's, 100,000 rows, 5 columns, for reading from BytesIO & StringIO buffers (running on 2 core machine).