-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC, ENH: Support memory_map for Python engine #13381
Conversation
65b128c
to
8203fbd
Compare
@@ -193,6 +193,10 @@ use_unsigned : boolean, default False | |||
|
|||
If integer columns are being compacted (i.e. ``compact_ints=True``), specify whether | |||
the column should be compacted to the smallest signed or unsigned integer dtype. | |||
memory_map : boolean, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is doc-string updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes? I just added it.
4c59975
to
1f6aae9
Compare
Current coverage is 84.24%@@ master #13381 diff @@
==========================================
Files 138 138
Lines 50763 50775 +12
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 42756 42773 +17
+ Misses 8007 8002 -5
Partials 0 0
|
if memory_map and hasattr(f, 'fileno'): | ||
try: | ||
f = MMapWrapper(f) | ||
except: # fallback: leave file handler as is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this more specific? Which exceptions do you need to catch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really whatever can be thrown by the mmap
constructor, and while I could certainly guess what some of them might be, not really sure how safe that is. This is also a private method, so I wouldn't foresee much abuse of this catch-all statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @shoyer here, I think you shouldn't be catching things. Let the exception bubble up; this is a low-level handler and shouldn't hide things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests would help narrow this down! e.g. first test should be a non-existant file, 2nd a file-like which cannot be memory mapped, e.g. StringIO (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's inconsistent with what's done on the C engine side, where if the lower-level function calls fail (for whatever reason), we just return NULL
for the source as you can see here. Note that you can pass in StringIO
to read_csv
with memory_map=True
and nothing blows up on the C engine side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're referring to the file
object, I think I am already (I wrote a parameters section for that reason). Or are you referring to the catch-all try-except
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung If you scroll down in parser.pyx
a little further, you notice if the pointer is NULL
we raise IOError
.
I don't think we should ignore arbitrary errors, certainly not if this is disabled by default. Even then, never use blanket except:
clauses -- except Exception:
is much safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I guess you're right. Still, let's only catch Exception
and make a note about why we do this.
I'm not sure that we need to consider the current "catch anything" behavior as API but I guess it's better safe than sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I have no issues with adding except Exception
, so I'll do that.
Do you have any benchmarks showing positive performance gains? If this is an unambiguous win, we should enable it by default rather than adding yet another option. |
I can run benchmarks but remember this is AFAICT a performance vs memory tradeoff given what memory_map does. I am inclined to leave this still as an option, though how many people use this option I am not sure. |
Ran basic |
def __getattr__(self, name): | ||
return getattr(self.mmap, name) | ||
|
||
def __next__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are going to add this, then pls setup some tests for it in io.common/tests/test_common.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests added.
I would also add that this PR does provide support for |
1526978
to
73e9ee8
Compare
Just to clarify, this is already supported by the C engine via I'm OK with documenting it and adding it as another argument then -- we already have quite a few arguments for |
f5e6f08
to
3a29331
Compare
memory_map : boolean, default False | ||
If a filepath is provided for `filepath_or_buffer`, map the file object | ||
directly onto memory and access the data directly from there. Using this | ||
option can improve performance because there is no longer any I/O overhead. | ||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blank line is missing before 'Returns'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
3a29331
to
5278fb5
Compare
thanks! |
Title is self-explanatory.