Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC, ENH: Support memory_map for Python engine #13381

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jun 6, 2016

Title is self-explanatory.

@gfyoung gfyoung force-pushed the memory-map-python-engine branch from 65b128c to 8203fbd Compare June 6, 2016 14:22
@@ -193,6 +193,10 @@ use_unsigned : boolean, default False

If integer columns are being compacted (i.e. ``compact_ints=True``), specify whether
the column should be compacted to the smallest signed or unsigned integer dtype.
memory_map : boolean, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is doc-string updated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes? I just added it.

@jreback jreback added the IO CSV read_csv, to_csv label Jun 6, 2016
@gfyoung gfyoung force-pushed the memory-map-python-engine branch 3 times, most recently from 4c59975 to 1f6aae9 Compare June 7, 2016 00:03
@codecov-io
Copy link

codecov-io commented Jun 7, 2016

Current coverage is 84.24%

Merging #13381 into master will increase coverage by 0.01%

@@             master     #13381   diff @@
==========================================
  Files           138        138          
  Lines         50763      50775    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42756      42773    +17   
+ Misses         8007       8002     -5   
  Partials          0          0          

Powered by Codecov. Last updated by 158ae5b...1f6aae9

if memory_map and hasattr(f, 'fileno'):
try:
f = MMapWrapper(f)
except: # fallback: leave file handler as is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this more specific? Which exceptions do you need to catch here?

Copy link
Member Author

@gfyoung gfyoung Jun 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really whatever can be thrown by the mmap constructor, and while I could certainly guess what some of them might be, not really sure how safe that is. This is also a private method, so I wouldn't foresee much abuse of this catch-all statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @shoyer here, I think you shouldn't be catching things. Let the exception bubble up; this is a low-level handler and shouldn't hide things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests would help narrow this down! e.g. first test should be a non-existant file, 2nd a file-like which cannot be memory mapped, e.g. StringIO (I think).

Copy link
Member Author

@gfyoung gfyoung Jun 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's inconsistent with what's done on the C engine side, where if the lower-level function calls fail (for whatever reason), we just return NULL for the source as you can see here. Note that you can pass in StringIO to read_csv with memory_map=True and nothing blows up on the C engine side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're referring to the file object, I think I am already (I wrote a parameters section for that reason). Or are you referring to the catch-all try-except block?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung If you scroll down in parser.pyx a little further, you notice if the pointer is NULL we raise IOError.

I don't think we should ignore arbitrary errors, certainly not if this is disabled by default. Even then, never use blanket except: clauses -- except Exception: is much safer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer : that's not the right NULL - this is the one I'm referring to here. The one you're pointing at is when all other attempts to create a source have failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I guess you're right. Still, let's only catch Exception and make a note about why we do this.

I'm not sure that we need to consider the current "catch anything" behavior as API but I guess it's better safe than sorry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I have no issues with adding except Exception, so I'll do that.

@shoyer
Copy link
Member

shoyer commented Jun 7, 2016

Do you have any benchmarks showing positive performance gains?

If this is an unambiguous win, we should enable it by default rather than adding yet another option.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 7, 2016

I can run benchmarks but remember this is AFAICT a performance vs memory tradeoff given what memory_map does. I am inclined to leave this still as an option, though how many people use this option I am not sure.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 7, 2016

Ran basic %timeit with CSV file sizes of 10000 rows and 1000000 rows, and the performance flip-flops between the two. I think the I/O overhead starts to count as the file sizes get larger and larger, but I don't know if there is a specific cutoff. I think this would indicate leaving it as an option until there are more conclusive results.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2016

@shoyer this is an already existing option, @gfyoung is just documenting. Though point taken that it doesn't exist for the python engine, but I think the consistency outweight having to special case options and think about which engine I am using.

def __getattr__(self, name):
return getattr(self.mmap, name)

def __next__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are going to add this, then pls setup some tests for it in io.common/tests/test_common.py

Copy link
Member Author

@gfyoung gfyoung Jun 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests added.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 7, 2016

I would also add that this PR does provide support for memory_map in the Python engine via Python's mmap library.

@gfyoung gfyoung force-pushed the memory-map-python-engine branch 2 times, most recently from 1526978 to 73e9ee8 Compare June 7, 2016 14:40
@shoyer
Copy link
Member

shoyer commented Jun 7, 2016

Just to clarify, this is already supported by the C engine via **kwargs, but is currently undocumented?

I'm OK with documenting it and adding it as another argument then -- we already have quite a few arguments for read_csv, adding another one won't make things much worse.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 7, 2016

@shoyer : If you look at the signature here, you will see that it is already in the signature explicitly. However, no explanation of it is given. Also, read @jreback comment above.

@gfyoung gfyoung force-pushed the memory-map-python-engine branch 2 times, most recently from f5e6f08 to 3a29331 Compare June 7, 2016 19:05
@gfyoung
Copy link
Member Author

gfyoung commented Jun 7, 2016

Travis was having build issues, so I'm rerunning tests. @jreback could you cancel this old build here?

memory_map : boolean, default False
If a filepath is provided for `filepath_or_buffer`, map the file object
directly onto memory and access the data directly from there. Using this
option can improve performance because there is no longer any I/O overhead.
Returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blank line is missing before 'Returns'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung gfyoung force-pushed the memory-map-python-engine branch from 3a29331 to 5278fb5 Compare June 8, 2016 10:05
@gfyoung
Copy link
Member Author

gfyoung commented Jun 8, 2016

@jreback : Added the except Exception block as @shoyer requested, and Travis is giving the green light. Ready to merge if there are no other concerns.

@jreback jreback added this to the 0.18.2 milestone Jun 8, 2016
@jreback jreback closed this in 5407249 Jun 8, 2016
@jreback
Copy link
Contributor

jreback commented Jun 8, 2016

thanks!

@gfyoung gfyoung deleted the memory-map-python-engine branch June 8, 2016 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants