-
Notifications
You must be signed in to change notification settings - Fork 41
add support for writing parquet files #642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Very cool! Try with adding pyarrow. Once the tests pass I will merge to master
chunk_size = 10000 | ||
|
||
# optional import of pyarrow | ||
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So CI is telling me that pyarrow is not installed. I dont see it in any dev-requirements-py*. yml either. Perhaps try adding pyarrow in kipoi setup.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact error is ModuleNotFoundError: No module named 'pyarrow'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yes, I have to add it as an optional dependency 👍
I will merge this and remove the coverall for now. I will check tomorrow why the REPO token is not working anymore. |
This PR adds support for writing parquet files.
In particular, it adds two additional parquet writer implementations:
ParquetFileBatchWriter
: Writes to a single parquet file and depends on PyArrow.ParquetDirBatchWriter
: Writes a directory of parquet files usingpandas.to_parquet()
. Supports both PyArrow and fastparquet. This implementation is predestined for multi-processing / asynchronous writing since no file-locking is necessary for it. Nevertheless, I'm not sure yet how Kipoi would take advantage of that.My very long-term goal is that we directly provide schemas as Arrow schemas to be able to pass them around cross-language without any kind of conversion/serialization:
https://arrow.apache.org/docs/python/api/datatypes.html
This PR is the first step in that direction.
Future work: