-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
[ENH] to_orc #43860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] to_orc #43860
Conversation
pandas.io.orc.to_orc method definition
Hello @NickFillot! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2021-10-03 14:47:17 UTC |
set to_orc to pandas.DataFrame
Thanks for the PR @NickFillot! |
Just created one @ to_orc Issue didn't see one related to it Thank you |
tests pls follow the existing way we test to_parquet for example with the fixtures that skip based in thr version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
try: | ||
assert engine.__name__ == 'pyarrow', "engine must be 'pyarrow' module" | ||
assert hasattr(engine, 'orc'), "'pyarrow' module must have orc module" | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be more specific about the exception type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
if path is None: | ||
# to bytes: tmp path, pyarrow auto closes buffers | ||
with tm.ensure_clean(os.path.join(gettempdir(), os.urandom(12).hex())) as path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this getting written to a file? Thought path = None will just return byte string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I do close the file-like object from my side by default in Arrow. It does seem to be different from the behavior of the Parquet writer in Arrow. If this is indeed an issue I can discuss with the Arrow community whether we should change it.
Right now I use PyArrow buffer and avoid creating a temp file.
Write a DataFrame to the orc/arrow format. | ||
Parameters | ||
---------- | ||
df : DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse the docstring opposed to copy/paste
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm there isn't an orc/arrow format. Maybe it should be "Write a DataFrame to the ORC format using PyArrow"?
Thanks for the PR @NickFillot comments above! |
Working on tests, i'm trying to understand how pandas testing works |
@NickFillot Thanks for working on this! Note that your ordering actually doesn't work for write_table in pyarrow 4.0.0 so please either use the path, table ordering to accommodate that version or set the minimum version of pyarrow to 4.0.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add tests.
a bytes object is returned. | ||
engine : {{'pyarrow'}}, default 'pyarrow' | ||
Parquet library to use, or library it self, checked with 'pyarrow' name | ||
and version > 4.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it > 4.0.0, meaning >= 5.0? would be more informative
@NickFillot Do you mind me reopening it? |
This PR has been reopened as #44554 |
Add pandas.io.orc.to_orc method definition