Skip to content

[ENH] to_orc #43860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

[ENH] to_orc #43860

wants to merge 3 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Oct 3, 2021

Add pandas.io.orc.to_orc method definition

pandas.io.orc.to_orc method definition
@pep8speaks
Copy link

pep8speaks commented Oct 3, 2021

Hello @NickFillot! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 2737:1: W293 blank line contains whitespace
Line 2784:89: E501 line too long (90 > 88 characters)
Line 2811:1: W293 blank line contains whitespace

Line 102:1: W293 blank line contains whitespace
Line 111:1: W293 blank line contains whitespace

Comment last updated at 2021-10-03 14:47:17 UTC

NickFillot added 2 commits October 3, 2021 16:34
set to_orc to pandas.DataFrame
@debnathshoham
Copy link
Member

Thanks for the PR @NickFillot!
Is there an existing pandas issue number you're trying to address?
Do you mind creating one, if there isn't one?

@ghost
Copy link
Author

ghost commented Oct 3, 2021

Thanks for the PR @NickFillot! Is there an existing pandas issue number you're trying to address? Do you mind creating one, if there isn't one?

Just created one @ to_orc Issue didn't see one related to it

Thank you

@jreback
Copy link
Contributor

jreback commented Oct 3, 2021

tests pls

follow the existing way we test to_parquet for example with the fixtures that skip based in thr version

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@jreback jreback added IO Parquet parquet, feather Enhancement labels Oct 4, 2021
try:
assert engine.__name__ == 'pyarrow', "engine must be 'pyarrow' module"
assert hasattr(engine, 'orc'), "'pyarrow' module must have orc module"
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be more specific about the exception type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


if path is None:
# to bytes: tmp path, pyarrow auto closes buffers
with tm.ensure_clean(os.path.join(gettempdir(), os.urandom(12).hex())) as path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this getting written to a file? Thought path = None will just return byte string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I do close the file-like object from my side by default in Arrow. It does seem to be different from the behavior of the Parquet writer in Arrow. If this is indeed an issue I can discuss with the Arrow community whether we should change it.

Right now I use PyArrow buffer and avoid creating a temp file.

Write a DataFrame to the orc/arrow format.
Parameters
----------
df : DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the docstring opposed to copy/paste

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm there isn't an orc/arrow format. Maybe it should be "Write a DataFrame to the ORC format using PyArrow"?

@alimcmaster1
Copy link
Member

Thanks for the PR @NickFillot comments above!

@ghost
Copy link
Author

ghost commented Oct 13, 2021

Working on tests, i'm trying to understand how pandas testing works

@iajoiner
Copy link
Contributor

@NickFillot Thanks for working on this! Note that your ordering actually doesn't work for write_table in pyarrow 4.0.0 so please either use the path, table ordering to accommodate that version or set the minimum version of pyarrow to 4.0.1.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add tests.

a bytes object is returned.
engine : {{'pyarrow'}}, default 'pyarrow'
Parquet library to use, or library it self, checked with 'pyarrow' name
and version > 4.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it > 4.0.0, meaning >= 5.0? would be more informative

@ghost ghost closed this Nov 12, 2021
@ghost ghost deleted the patch-2 branch November 12, 2021 10:44
@iajoiner
Copy link
Contributor

@NickFillot Do you mind me reopening it?

@iajoiner iajoiner mentioned this pull request Nov 21, 2021
4 tasks
@iajoiner
Copy link
Contributor

This PR has been reopened as #44554

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants