Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python docs: Add an example for opendal + pandas/polars #4270

Closed
Xuanwo opened this issue Feb 26, 2024 · 11 comments · Fixed by #4368
Closed

python docs: Add an example for opendal + pandas/polars #4270

Xuanwo opened this issue Feb 26, 2024 · 11 comments · Fixed by #4368
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Feb 26, 2024

This example will help users to use opendal in pandas/polars.

We can add a new ipynb like we do in https://github.com/apache/opendal/blob/main/bindings/python/examples/object.ipynb

@Xuanwo Xuanwo added good first issue Good for newcomers help wanted Extra attention is needed labels Feb 26, 2024
@Xuanwo
Copy link
Member Author

Xuanwo commented Feb 27, 2024

Maybe we can integrate with polars directly?

@suyanhanx
Copy link
Member

suyanhanx commented Feb 27, 2024

Seems that we could open a file and then pass the reader to pandas/polars?
https://docs.pola.rs/py-polars/html/reference/api/polars.read_csv.html#polars.read_csv
https://pandas.pydata.org/docs/reference/api/pandas.read_table.html
They both accept file-like objects.

@Xuanwo
Copy link
Member Author

Xuanwo commented Feb 28, 2024

Yep, it should just work.

@reswqa
Copy link
Member

reswqa commented Mar 14, 2024

This is my first time in the opendal community 😄 Well, I just saw this issue, and I'm glad you mentioned polars also(I am one of the core dev of polars).

Seems that we could open a file and then pass the reader to pandas/polars?

I'm not familiar with opendal, but I tried to run the following code:

op = opendal.Operator("fs", root="xxx")

pandas.read_csv(open("xxx/a.csv", mode="rb")) # this works
polars.read_csv(open("xxx/a.csv", mode="rb")) # this works

with op.open("a.csv", mode ="rb") as file:
    pandas.read_csv(file) # this raise

with op.open("a.csv", mode ="rb") as file:
    polars.read_csv(file) # this raise

If the file is opened via opendal, both polars and pandas are raise OSError: failed to fill whole buffer. IIUC, op.open("a.csv", mode ="rb") is a file-like object that we expect to work directly. Feel free to point out if I missed something? Thanks!

@Xuanwo
Copy link
Member Author

Xuanwo commented Mar 14, 2024

Well, I just saw this issue, and I'm glad you mentioned polars also(I am one of the core dev of polars).

Thanks for coming!

Feel free to point out if I missed something? Thanks!

It seems to be a bug. We expect them to work the same way. Let me look into this.

Also cc @Zheaoli and @messense if your are interested.

@reswqa
Copy link
Member

reswqa commented Mar 14, 2024

Thank you for your quick reply!

BTW, the a.csv is just a simple test file.

cat a.csv

a,b
1,2
3,4

@reswqa
Copy link
Member

reswqa commented Mar 14, 2024

@Xuanwo Oops, the difference between the file-like object crated via op.open and open occurred to me.

For the read(size) method of file-like object, our implementation has read exactly size bytes semantics, but some of python's own implementations have read at-most size bytes semantics.

For example:

with open("./a.csv", mode="rb") as file:
        file.read(10000000) # this works

with op.open("a.csv", mode="rb") as file:
        file.read(10000000) # this raise: `OSError: ContentIncomplete (permanent) at , context: { expect: 10000000, actual: 5 } => reader got too little data`

polars & pandas will call the read(size) method with a size that works for them, which may not always be equal to the actual size of the file.

@Xuanwo
Copy link
Member Author

Xuanwo commented Mar 14, 2024

Thank you so much for the research! And you are right, I must impelment this in wrong:

Otherwise, at most size characters (in text mode) or size bytes (in binary mode) are read and returned.

ref: https://docs.python.org/3/tutorial/inputoutput.html

Do you have interest to give it a fix?

@reswqa
Copy link
Member

reswqa commented Mar 14, 2024

Another question is: Is it make sense to have File::read directly return the result of &PyBytes instead of &PyMemoryView? The key point is both polars and pandas seems don't work well with MemoryViews when dealing with file-like objects.

@Xuanwo
Copy link
Member Author

Xuanwo commented Mar 14, 2024

Is it make sense to have File::read directly return the result of &PyBytes instead of &PyMemoryView?

Our read API used to accept &mut [u8], but now we are returning Bytes. I think we can do this change now.

@reswqa
Copy link
Member

reswqa commented Mar 14, 2024

Our read API used to accept &mut [u8], but now we are returning Bytes. I think we can do this change now.

Great! With these two changes, I can now have polars and pandas create DataFrame from opendal locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants