Skip to content

Conversation

@eddyxu
Copy link
Member

@eddyxu eddyxu commented Jul 11, 2022

  • Added lance.write_table() API
  • Test a round trip of writing data into lance data and read it back

Closes #3

@eddyxu eddyxu requested a review from changhiskhan July 11, 2022 23:40
@eddyxu eddyxu self-assigned this Jul 11, 2022
@eddyxu eddyxu added the python label Jul 11, 2022
----------
table : pa.Table
Apache Arrow Table
sink : str or `Path`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match signature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return ds.dataset(uri, format=fmt)


def write_table(table: pa.Table, destination: Union[str, Path], primary_key: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a convenience to auto generate a pk column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we push that to the application / db level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want people to use it as a python library then it's probably a good idea to have it. Could be in a wrapper function or something? Should also check for uniqueness there as well.

return ds.dataset(uri, format=fmt)


def write_table(table: pa.Table, destination: Union[str, Path], primary_key: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this requires holding everything in memory first right? If we have a bunch of images on S3, does this mean we need to hold them all in Arrow memory to convert to lance format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so there will be a StreamWriter which basically opens a DatasetWriter and write batch records one by one.

It is another set of interfaces tho.

Similar to parquet https://arrow.apache.org/docs/cpp/parquet.html#writing-parquet-files

@eddyxu eddyxu merged commit d905bb3 into main Jul 12, 2022
@eddyxu eddyxu deleted the lei/py_write branch July 12, 2022 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide WriteTable API in Python.

3 participants