-
Notifications
You must be signed in to change notification settings - Fork 540
Implement lance.write_table API and test the simplest round trips
#23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,28 @@ | ||
| # Copyright 2022 Lance Developers | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| from typing import Union | ||
| from pathlib import Path | ||
|
|
||
| import pyarrow as pa | ||
| import pyarrow.dataset as ds | ||
| from lance.lib import LanceFileFormat | ||
| from lance.lib import LanceFileFormat, WriteTable | ||
|
|
||
| __all__ = ["dataset", "write_table"] | ||
|
|
||
|
|
||
| def dataset(uri: str): | ||
| def dataset(uri: str) -> ds.Dataset: | ||
| """ | ||
| Create an Arrow Dataset from the given lance uri. | ||
|
|
||
|
|
@@ -13,3 +33,18 @@ def dataset(uri: str): | |
| """ | ||
| fmt = LanceFileFormat() | ||
| return ds.dataset(uri, format=fmt) | ||
|
|
||
|
|
||
| def write_table(table: pa.Table, destination: Union[str, Path], primary_key: str): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this requires holding everything in memory first right? If we have a bunch of images on S3, does this mean we need to hold them all in Arrow memory to convert to lance format?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, so there will be a It is another set of interfaces tho. Similar to parquet https://arrow.apache.org/docs/cpp/parquet.html#writing-parquet-files |
||
| """Write an Arrow Table into the destination. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| table : pa.Table | ||
| Apache Arrow Table | ||
| destination : str or `Path` | ||
| The destination to write dataset to. | ||
| primary_key : str | ||
| The column name of the primary key. | ||
| """ | ||
| WriteTable(table, destination, primary_key) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| # Copyright 2022 Lance Developers | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
|
|
||
| from pathlib import Path | ||
|
|
||
| import pandas as pd | ||
| import pyarrow as pa | ||
| from lance import write_table, dataset | ||
|
|
||
|
|
||
| def test_simple_round_trips(tmp_path: Path): | ||
| table = pa.Table.from_pandas(pd.DataFrame({"label": [123, 456, 789], "values": [22, 33, 2.24]})) | ||
| write_table(table, tmp_path / "test.lance", "label") | ||
|
|
||
| assert (tmp_path / "test.lance").exists() | ||
|
|
||
| ds = dataset(str(tmp_path / "test.lance")) | ||
| actual = ds.to_table() | ||
|
|
||
| assert (table == actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a convenience to auto generate a pk column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we push that to the application / db level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want people to use it as a python library then it's probably a good idea to have it. Could be in a wrapper function or something? Should also check for uniqueness there as well.