ENH: add arrow engine to read_csv #31817

lithomas1 · 2020-02-09T04:27:30Z

closes ENH: allow engine='pyarrow' in read_csv #23697
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

ASV's, 100,000 rows, 5 columns, for reading from BytesIO & StringIO buffers (running on 2 core machine).

[ 75.00%] ··· io.csv.ReadCSVEngine.time_read_bytescsv                                                                                                                                                                               ok
[ 75.00%] ··· ========= ============
                engine
              --------- ------------
                  c       51.7±6ms
                python    598±40ms
               pyarrow   27.8±0.8ms
              ========= ============

[100.00%] ··· io.csv.ReadCSVEngine.time_read_stringcsv                                                                                                                                                                              ok
[100.00%] ··· ========= ===========
                engine
              --------- -----------
                  c      53.8±20ms
                python    584±30ms
               pyarrow    29.9±1ms
              ========= ===========

lithomas1 · 2020-02-09T04:54:01Z

it appears that the arrow engine does better on larger datasets. for example, on scikit-learn's diabetes dataset(24 kb vs the 3kb iris), performance was ...
4.59 ms for arrow
6.83 ms for c
8.72 ms for python

pandas/io/parsers.py

jreback

you need to hook into the test framework via the fixture all_parsers, which exhaustively tests things. This parsers likely has a very narrow case where it actually passes.

Secondaly on the performance comparsions, pls show the asv's for this. again this might be slightly more performant that the c parser, but its important to narrow down when and in what cases.

doc/source/whatsnew/v1.1.0.rst

pandas/io/parsers.py

jorisvandenbossche

Thanks for working on this!

There are a few things where pyarrow has different defaults. For example, in which values to recognize as NA values (but this is configurable). Do we want to pass our default values as option to pyarrow by default (so the different engines in pandas are consistent on that), or are we fine with following pyarrow's default?

pandas/io/parsers.py

jreback · 2020-07-17T11:25:37Z

put up what you want and will have a look
it's very hard to grok the way you are using the fixtures

simonjayhawkins · 2020-08-01T13:58:43Z

@lithomas1 can you move release note to 1.2

dsaxton · 2020-10-08T00:32:10Z

@lithomas1 Can you merge master to fix the conflict?

jorisvandenbossche · 2020-10-21T19:07:42Z

I would like to see this move forward, and can also put some effort in updating it (in case @lithomas1 has no time on the short term).

@jreback can you clarify what exactly you don't like about the current pyarrow_xfail approach?

jreback · 2020-10-21T19:49:31Z

this needs a merge in master and a green build

will have a look

jorisvandenbossche · 2020-10-21T20:12:13Z

Yes it needs to be updated with latest master, but in the mean time it would still be useful to understand your concern about the fixture (as I asked in #31817 (comment) already). So can you clarify your "it's very hard to grok the way you are using the fixtures" (see also @lithomas1's answers to your comments at #31817 (comment))

jreback

the testing method i think is easily fixed see my comments.

this needs to be integrated in a very different option validation structure that exists in master.

Also the actual processing of the arrow engine is pretty awkward.

jreback · 2020-10-22T00:22:00Z

doc/source/user_guide/io.rst

+engine : {``'c'``, ``'pyarrow'``, ``'python'``}
+  Parser engine to use. In terms of performance, the pyarrow engine,
+  which requires ``pyarrow`` >= 0.15.0, is faster than the C engine, which
+  is faster than the python engine. However, the pyarrow and C engines


add a versionchanged tag here 1.2

jreback · 2020-10-22T00:22:14Z

doc/source/user_guide/io.rst

-possible pandas uses the C parser (specified as ``engine='c'``), but may fall
-back to Python if C-unsupported options are specified. Currently, C-unsupported
-options include:
+Currently, pandas supports using three engines, the C engine, the python engine,


add a versionchanged 1.2 tag here

jreback · 2020-10-22T00:22:42Z

doc/source/user_guide/io.rst

+the pyarrow engine is much less robust than the C engine, which in turn lacks a
+couple of features present in the Python parser.
+
+Where possible pandas uses the C parser (specified as ``engine='c'``), but may fall


we might want to refactor this entire section to provide a more table like comparision of all of the parsers, if you'd create an issue for this

jreback · 2020-10-22T00:22:56Z

doc/source/whatsnew/v1.1.0.rst

@@ -271,6 +272,14 @@ change, as ``fsspec`` will still bring in the same packages as before.

 .. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/

+
+read_csv() now accepts pyarrow as an engine


move to 1.2

jreback · 2020-10-22T00:23:37Z

pandas/compat/_optional.py

@@ -93,10 +100,16 @@ def import_optional_dependency(
            raise ImportError(msg) from None
        else:
            return None
-
-    minimum_version = VERSIONS.get(name)
+    # Handle submodules: if we have submodule, grab parent module from sys.modules


why is all this needed?

This has been answered before: #31817 (comment) (and the above comment has been added based on your comment)

It's to 1) import a submodule (pyarrow.csv in this case) and 2) to support passing a different version as in our global minimum versions dictionary.

Now I suppose that the submodule importing is not necessarily needed. Right now this PR does:

csv = import_optional_dependency("pyarrow.csv", min_version="0.15")

but I suppose this could also be:

import_optional_dependency("pyarrow", min_version="0.15") from pyarrow import csv

And then this additional code to directly import a submodule with import_optional_dependency is not needed (although where it is used, I think it is a bit cleaner to be able to directly import the submodule)

@jorisvandenbossche importing as a submodule is required, you can't access the csv module by doing pyarrow.csv as far as I remember, and if you do import pyarrow.csv, the it won't validate the version and will not error for pyarrow<0.15

jreback · 2020-10-22T00:28:55Z

pandas/io/parsers.py

+        return content.encode(self.encoding)
+
+
+class ArrowParserWrapper(ParserBase):


you will need to refactor this as the current code is very different from this.

Also I really don't like doing all of this validation in a single function.

you will need to refactor this as the current code is very different from this.

Can you clarify a bit more what you mean? Or point to recent changes related to this?

For example also on master, the C parser is using a very similar mechanism with the CParserWrapper class.
Or do you only mean that it needs to split some validation into separate methods (as you indicate in a comment below as well, fully agreed with that).

jreback · 2020-10-22T00:29:14Z

pandas/io/parsers.py

+                read_options["skip_rows"] = skiprows
+            read_options["autogenerate_column_names"] = True
+        read_options = pyarrow.ReadOptions(**read_options)
+        table = pyarrow.read_csv(


pls add line breaks between section and comments.

jreback · 2020-10-22T00:29:50Z

pandas/io/parsers.py

+
+        parse_options = {k: v for k, v in kwdscopy.items() if k in parseoptions}
+        convert_options = {k: v for k, v in kwdscopy.items() if k in convertoptions}
+        headerexists = True if self.header is not None else False


the whole structure of this class is really odd. The class should do consruction, option validation, reading, and post-processing by calling methods.

jreback · 2020-10-22T00:30:01Z

pandas/io/parsers.py

@@ -3400,7 +3570,7 @@ def _isindex(colspec):
                    colspec = orig_names[colspec]
                if _isindex(colspec):
                    continue
-                data_dict[colspec] = converter(data_dict[colspec])
+                data_dict[colspec] = converter(np.array(data_dict[colspec]))


use np.asarray

why do you need this?

converter doesn't work on pandas series, only on np arrays, if i can remember correctly. The alternative would be to convert dataframe to dict of np arrays or something like that, which hurts perf a lot. This has much less impact and also doesn't hurt perf with non pyarrow case.

jreback · 2020-10-22T00:33:24Z

pandas/tests/io/parser/test_common.py

    parser._engine = MyCParserWrapper(StringIO(data), **parser.options)

    result = parser.read()
    tm.assert_frame_equal(result, expected)


-def test_empty_decimal_marker(all_parsers):
+def test_empty_decimal_marker(all_parsers, pyarrow_xfail):


instead of this, why don't you just define a new fixture (all_parsers_xpyarrow) or something and then just change the inputs to all of these functions), otherwise this is awkward and unmaintanable.

And this all_parsers_xpyarrow fixture would be "all parsers but without pyarrow"?

If so, how is that necessarily more maintainable? You still need to update each test function that doesn't support pyarrow to change all_parsers to all_parsers_xpyarrow, while with the current approach you need to update this to all_parsers, pyarrow_xfail. That doesn't seem a big difference?

I might have a way to make this clearer.

at the moment the fixture is xfailing tests explicitly and not using an xfail marker.

I have changed this but there are some tests that seem to lock up, so also need a skip fixture.

Am working through the failing/timing out tests. will do a wip commit for discusssion. (just revert if not happy)

hmm, frequent lock-ups on a full test run of tests/io/parser/ so several xfails changed to skips.

hopefully, this won't time out and with the decorators makes it easier to review for functionality that should be working.

macOS py37_macos didn't time out this time around, but looks like Windows and linux still have a problem.

@simonjayhawkins yeah i had problems with it getting stuck on my computer too. I think the tests are actually failing but pytest just gets stuck or something. I think that some of the CI machines don't have pyarrow on it, and skip the pyarrow tests, which is probably why they don't fail. Does it work with the xfail when you run that single test?

jorisvandenbossche · 2020-10-22T08:23:17Z

@jreback thanks a lot for the review! I added some comments/questions for further clarifications

pandas/tests/io/parser/test_common.py

simonjayhawkins · 2020-10-22T20:05:59Z

pandas/io/parsers.py

-    currently more feature-complete.
+engine : {{'c', 'python', 'pyarrow'}}, optional
+    Parser engine to use. The C and pyarrow engines are faster, while the python engine
+    is currently more feature-complete. The pyarrow engine requires ``pyarrow`` >= 0.15


since our current project wide min is 0.15, this can probably be removed.

lithomas1 · 2020-10-28T03:50:09Z

Hi all,
Sorry that I abandoned this PR for a while, and thanks to @simonjayhawkins and @jorisvandenbossche for picking this up, there just seems to be way too many test cases that are failing than I can handle, last time I checked it was a couple hundred. Right now, there is a couple logic errors in the code with usecols I think, but code should be good otherwise.

I still don't think I have the time/energy to finish this one though, but I can commit the WIP code that I have and provide some guidance on the code if necessary.

for more information, see https://pre-commit.ci

asv_bench/benchmarks/io/csv.py

arw2019 · 2020-12-08T18:21:40Z

Continued in #38370

add arrow engine to read_csv

f22ff46

lithomas1 marked this pull request as ready for review February 9, 2020 04:54

lithomas1 requested a review from gfyoung February 9, 2020 04:56

fix failing test

8ae43e4

gfyoung reviewed Feb 9, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

gfyoung reviewed Feb 9, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

gfyoung reviewed Feb 9, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

gfyoung reviewed Feb 9, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

gfyoung reviewed Feb 9, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

gfyoung reviewed Feb 9, 2020

View reviewed changes

pandas/io/parsers.py Show resolved Hide resolved

gfyoung added Enhancement IO CSV read_csv, to_csv Performance Memory or execution speed performance labels Feb 9, 2020

gfyoung requested review from jreback and TomAugspurger February 9, 2020 07:59

lithomas1 added 4 commits February 9, 2020 10:01

formatting and revert unnecessary change

09074df

remove bloat and more formatting changes

6be276d

Whatsnew

df4fa7e

Merge remote-tracking branch 'upstream/master' into add-arrow-engine

9cd9a6f

lithomas1 requested a review from gfyoung February 9, 2020 18:36

jreback requested changes Feb 9, 2020

View reviewed changes

Get tests up and running

ecaf3fd

TomAugspurger reviewed Feb 10, 2020

View reviewed changes

pandas/io/parsers.py Outdated Show resolved Hide resolved

lithomas1 added 2 commits February 10, 2020 07:26

Some fixes

b3c3287

Add asvs and xfail some tests

474baf4

jorisvandenbossche reviewed Feb 12, 2020

View reviewed changes

pandas/io/parsers.py Show resolved Hide resolved

pandas/io/parsers.py Outdated Show resolved Hide resolved

pandas/io/parsers.py Outdated Show resolved Hide resolved

pandas/io/parsers.py Outdated Show resolved Hide resolved

lithomas1 and others added 3 commits February 19, 2020 16:57

address comments

2cd9937

Merge branch 'master' into add-arrow-engine

48ff255

fix typo

3d15a56

Fix doc failures

e8eff08

dsaxton added the Stale label Sep 17, 2020

jreback requested changes Oct 22, 2020

View reviewed changes

simonjayhawkins added 2 commits October 22, 2020 14:34

Merge remote-tracking branch 'upstream/master' into add-arrow-engine

87cfcf5

wip

55139ee

simonjayhawkins reviewed Oct 22, 2020

View reviewed changes

more xfails and skips

c1aeecf

simonjayhawkins reviewed Oct 22, 2020

View reviewed changes

lithomas1 and others added 4 commits October 27, 2020 21:04

Merge branch 'master' into add-arrow-engine

62fc9d6

[pre-commit.ci] auto fixes from pre-commit.com hooks

b53a620

for more information, see https://pre-commit.ci

Fix typos

f13113d

Doc fixes and more typo fixes

f9ce2e4

jreback reviewed Oct 31, 2020

View reviewed changes

asv_bench/benchmarks/io/csv.py Show resolved Hide resolved

Green?

4158d6a

simonjayhawkins mentioned this pull request Nov 14, 2020

ENH: Basis for a StringDtype using Arrow #35259

Merged

5 tasks

Merge branch 'master' into add-arrow-engine

d34e75f

arw2019 mentioned this pull request Dec 8, 2020

[WIP] ENH: add Pyarrow csv engine #38370

Closed

5 tasks

arw2019 closed this Dec 8, 2020

arw2019 mentioned this pull request Dec 10, 2020

REF: move get_filepath_buffer into get_handle #37639

Merged

4 tasks

lithomas1 deleted the add-arrow-engine branch December 24, 2020 19:55

arw2019 mentioned this pull request Dec 31, 2020

DOC: refactor CSV parser section in user guide #38863

Open

		@@ -271,6 +272,14 @@ change, as ``fsspec`` will still bring in the same packages as before.

		.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/


		read_csv() now accepts pyarrow as an engine

		return content.encode(self.encoding)


		class ArrowParserWrapper(ParserBase):

Uh oh!

ENH: add arrow engine to read_csv #31817

ENH: add arrow engine to read_csv #31817

Uh oh!

Conversation

lithomas1 commented Feb 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lithomas1 commented Feb 9, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jreback commented Jul 17, 2020

Uh oh!

simonjayhawkins commented Aug 1, 2020

Uh oh!

dsaxton commented Oct 8, 2020

Uh oh!

jorisvandenbossche commented Oct 21, 2020

Uh oh!

jreback commented Oct 21, 2020

Uh oh!

jorisvandenbossche commented Oct 21, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lithomas1 Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lithomas1 Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

lithomas1 commented Feb 9, 2020 •

edited

Loading

lithomas1 Oct 28, 2020 •

edited

Loading

jorisvandenbossche Oct 22, 2020 •

edited

Loading

lithomas1 Oct 28, 2020 •

edited

Loading

lithomas1 Oct 28, 2020 •

edited

Loading

lithomas1 commented Oct 28, 2020 •

edited

Loading