-
Notifications
You must be signed in to change notification settings - Fork 540
Implement lance.write_table API and test the simplest round trips
#23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
add parameters
python/lance/__init__.py
Outdated
| ---------- | ||
| table : pa.Table | ||
| Apache Arrow Table | ||
| sink : str or `Path` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Match signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| return ds.dataset(uri, format=fmt) | ||
|
|
||
|
|
||
| def write_table(table: pa.Table, destination: Union[str, Path], primary_key: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a convenience to auto generate a pk column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we push that to the application / db level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want people to use it as a python library then it's probably a good idea to have it. Could be in a wrapper function or something? Should also check for uniqueness there as well.
| return ds.dataset(uri, format=fmt) | ||
|
|
||
|
|
||
| def write_table(table: pa.Table, destination: Union[str, Path], primary_key: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this requires holding everything in memory first right? If we have a bunch of images on S3, does this mean we need to hold them all in Arrow memory to convert to lance format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so there will be a StreamWriter which basically opens a DatasetWriter and write batch records one by one.
It is another set of interfaces tho.
Similar to parquet https://arrow.apache.org/docs/cpp/parquet.html#writing-parquet-files
lance.write_table()APIlancedata and read it backCloses #3