-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Read as arrow #1831
base: master
Are you sure you want to change the base?
WIP Read as arrow #1831
Conversation
ac514de
to
644901e
Compare
21fd9fa
to
cc28399
Compare
Dear friend, Great Job! I have saved all stocks data in ArcticDB, I would like to read the data from SSD directly into arrow table. Then we can query with DuckDB, LanceDB even KuzuDB in memory. As I understand finally maybe use ArcticDB zero-copy like this? import arcticdb
import duckdb
import lancedb
import kuzudb
......
arrow_table = lib.read("Symbol", output_format=OutputFormat.ARROW).data
duckdb.sql("SELECT * FROM arrow_table")
lancedb.create_table("Symbol", arrow_table, schema=schema)
kuzudb.execute("COPY Symbol FROM arrow_table") |
Yes that's exactly right. I'm very pleased to hear that you are excited about this piece of work! |
Does read_batch method support read_as_arrow too? Sometimes I wish to analysis all symbols in a daterange. symbols = library.list_symbols()
batch_results = library.read_batch(symbols, date_range=date_range, output_format=OutputFormat.ARROW ) So every batch_data[i].data is an arrow table? Then I let Claude3.5 sonnet code the rest. def fast_concat_arrow_tables(batch_results):
"""
Fast concatenation of Arrow tables from batch results
Parameters:
-----------
batch_results : List[Union[VersionedItem, DataError]]
Results from ArcticDB read_batch operation
Returns:
--------
pyarrow.Table
Concatenated table with added symbol column
"""
# 1. Pre-allocate list with known size for better memory efficiency
tables_len = len(batch_results)
tables = [None] * tables_len
# 2. Add symbol column to each table in one pass
for i, result in enumerate(batch_results):
if isinstance(result, VersionedItem):
table = result.data.to_arrow()
# Create symbol array once per table
symbol_array = pa.array([result.symbol] * len(table))
# Store table with appended symbol column
tables[i] = table.append_column('symbol', symbol_array)
# 3. Filter out None values and concatenate all tables at once
tables = [t for t in tables if t is not None]
return pa.concat_tables(tables) |
WIP read dataframe as Arrow arrays