DOC, ENH: Support memory_map for Python engine #13381

gfyoung · 2016-06-06T14:21:05Z

Title is self-explanatory.

jreback · 2016-06-06T14:23:54Z

doc/source/io.rst

@@ -193,6 +193,10 @@ use_unsigned : boolean, default False

  If integer columns are being compacted (i.e. ``compact_ints=True``), specify whether
  the column should be compacted to the smallest signed or unsigned integer dtype.
+memory_map : boolean, default False


is doc-string updated?

Yes? I just added it.

codecov-io · 2016-06-07T01:54:38Z

Current coverage is 84.24%

Merging #13381 into master will increase coverage by 0.01%

@@             master     #13381   diff @@
==========================================
  Files           138        138          
  Lines         50763      50775    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42756      42773    +17   
+ Misses         8007       8002     -5   
  Partials          0          0

Powered by Codecov. Last updated by 158ae5b...1f6aae9

shoyer · 2016-06-07T06:52:36Z

pandas/io/common.py

+    if memory_map and hasattr(f, 'fileno'):
+        try:
+            f = MMapWrapper(f)
+        except:  # fallback: leave file handler as is


Can you make this more specific? Which exceptions do you need to catch here?

It's really whatever can be thrown by the mmap constructor, and while I could certainly guess what some of them might be, not really sure how safe that is. This is also a private method, so I wouldn't foresee much abuse of this catch-all statement.

I agree with @shoyer here, I think you shouldn't be catching things. Let the exception bubble up; this is a low-level handler and shouldn't hide things.

tests would help narrow this down! e.g. first test should be a non-existant file, 2nd a file-like which cannot be memory mapped, e.g. StringIO (I think).

That's inconsistent with what's done on the C engine side, where if the lower-level function calls fail (for whatever reason), we just return NULL for the source as you can see here. Note that you can pass in StringIO to read_csv with memory_map=True and nothing blows up on the C engine side.

If you're referring to the file object, I think I am already (I wrote a parameters section for that reason). Or are you referring to the catch-all try-except block?

@gfyoung If you scroll down in parser.pyx a little further, you notice if the pointer is NULL we raise IOError.

I don't think we should ignore arbitrary errors, certainly not if this is disabled by default. Even then, never use blanket except: clauses -- except Exception: is much safer.

@shoyer : that's not the right NULL - this is the one I'm referring to here. The one you're pointing at is when all other attempts to create a source have failed.

OK, I guess you're right. Still, let's only catch Exception and make a note about why we do this.

I'm not sure that we need to consider the current "catch anything" behavior as API but I guess it's better safe than sorry.

However, I have no issues with adding except Exception, so I'll do that.

shoyer · 2016-06-07T06:55:14Z

Do you have any benchmarks showing positive performance gains?

If this is an unambiguous win, we should enable it by default rather than adding yet another option.

gfyoung · 2016-06-07T08:38:51Z

I can run benchmarks but remember this is AFAICT a performance vs memory tradeoff given what memory_map does. I am inclined to leave this still as an option, though how many people use this option I am not sure.

gfyoung · 2016-06-07T09:53:19Z

Ran basic %timeit with CSV file sizes of 10000 rows and 1000000 rows, and the performance flip-flops between the two. I think the I/O overhead starts to count as the file sizes get larger and larger, but I don't know if there is a specific cutoff. I think this would indicate leaving it as an option until there are more conclusive results.

jreback · 2016-06-07T12:14:06Z

@shoyer this is an already existing option, @gfyoung is just documenting. Though point taken that it doesn't exist for the python engine, but I think the consistency outweight having to special case options and think about which engine I am using.

jreback · 2016-06-07T12:42:44Z

pandas/io/common.py

+    def __getattr__(self, name):
+        return getattr(self.mmap, name)
+
+    def __next__(self):


if you are going to add this, then pls setup some tests for it in io.common/tests/test_common.py

Tests added.

gfyoung · 2016-06-07T12:46:18Z

I would also add that this PR does provide support for memory_map in the Python engine via Python's mmap library.

shoyer · 2016-06-07T16:05:35Z

Just to clarify, this is already supported by the C engine via **kwargs, but is currently undocumented?

I'm OK with documenting it and adding it as another argument then -- we already have quite a few arguments for read_csv, adding another one won't make things much worse.

gfyoung · 2016-06-07T16:11:02Z

@shoyer : If you look at the signature here, you will see that it is already in the signature explicitly. However, no explanation of it is given. Also, read @jreback comment above.

gfyoung · 2016-06-07T19:06:37Z

Travis was having build issues, so I'm rerunning tests. @jreback could you cancel this old build here?

jorisvandenbossche · 2016-06-08T08:37:04Z

pandas/io/parsers.py

+memory_map : boolean, default False
+    If a filepath is provided for `filepath_or_buffer`, map the file object
+    directly onto memory and access the data directly from there. Using this
+    option can improve performance because there is no longer any I/O overhead.
 Returns


Blank line is missing before 'Returns'

[ci skip]

gfyoung · 2016-06-08T10:05:11Z

@jreback : Added the except Exception block as @shoyer requested, and Travis is giving the green light. Ready to merge if there are no other concerns.

jreback · 2016-06-08T11:24:06Z

thanks!

gfyoung force-pushed the memory-map-python-engine branch from 65b128c to 8203fbd Compare June 6, 2016 14:22

jreback reviewed Jun 6, 2016
View reviewed changes

jreback added the IO CSV read_csv, to_csv label Jun 6, 2016

gfyoung force-pushed the memory-map-python-engine branch 3 times, most recently from 4c59975 to 1f6aae9 Compare June 7, 2016 00:03

shoyer reviewed Jun 7, 2016
View reviewed changes

jreback reviewed Jun 7, 2016
View reviewed changes

gfyoung force-pushed the memory-map-python-engine branch 2 times, most recently from 1526978 to 73e9ee8 Compare June 7, 2016 14:40

gfyoung force-pushed the memory-map-python-engine branch 2 times, most recently from f5e6f08 to 3a29331 Compare June 7, 2016 19:05

jorisvandenbossche reviewed Jun 8, 2016
View reviewed changes

DOC, ENH: Support memory_map for Python engine

5278fb5

[ci skip]

gfyoung force-pushed the memory-map-python-engine branch from 3a29331 to 5278fb5 Compare June 8, 2016 10:05

jreback added this to the 0.18.2 milestone Jun 8, 2016

jreback closed this in 5407249 Jun 8, 2016

gfyoung deleted the memory-map-python-engine branch June 8, 2016 11:24

kawochen mentioned this pull request Jun 8, 2016

ENH/DOC/CLN: Document arguments and reconcile C and Python engines for read_csv #12686

Open

22 tasks

jreback mentioned this pull request Jun 8, 2016

IO: memory_map kw in read_csv #7477

Closed

Uh oh!

DOC, ENH: Support memory_map for Python engine #13381

DOC, ENH: Support memory_map for Python engine #13381

Uh oh!

Conversation

gfyoung commented Jun 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Jun 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 84.24%

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jun 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jun 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Jun 7, 2016

Uh oh!

gfyoung commented Jun 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung commented Jun 7, 2016

Uh oh!

jreback commented Jun 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jun 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung commented Jun 7, 2016

Uh oh!

shoyer commented Jun 7, 2016

Uh oh!

gfyoung commented Jun 7, 2016

Uh oh!

gfyoung commented Jun 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung commented Jun 8, 2016

Uh oh!

jreback commented Jun 8, 2016

Uh oh!

Uh oh!

codecov-io commented Jun 7, 2016 •

edited

Loading

gfyoung Jun 7, 2016 •

edited

Loading

gfyoung Jun 7, 2016 •

edited

Loading

gfyoung commented Jun 7, 2016 •

edited

Loading

jreback commented Jun 7, 2016 •

edited

Loading

gfyoung Jun 7, 2016 •

edited

Loading