-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data retrieval is grindingly slow with DeltaTable().to_pandas #631
Comments
It's hard to say what it could be. My initial guess is the bottleneck is IO. Have you checked how many files you are retrieving? len(list(DeltaTable(url).to_pyarrow_dataset(partitions=partitions).get_fragments())) By default, we use our internal filesystem, but you might get better performance with a pyarrow filesystem. from pyarrow.fs import S3FileSystem
fs = S3FileSystem(...) # configure
dataset = DeltaTable(url).to_pandas(partitions=partitions, columns=columns, filesystem=fs).to_dict()
To be clear, I don't think Delta Lake will ever have comparable latency to a dedicated OLAP database. I would guess at best latency will be in the 1 - 5 second range, unless you implement some caching. |
@boonware - I wouldn't expect reading multiple files into a pandas DataFrame via delta-rs to provide low latency queries. Query engines that are optimized to read multiple files in parallel like Spark or Dask should be a lot faster. Even single node query engines like polars or DataFusion should provide better performance. Make sure to query the data and leverage partition filters, predicate pushdown filters, and column pruning. Or just use a database. |
hopefully this thread is still being monitored after closed.
than delta-rs. the files were written with delta-rs but read back from the 3 packages above I like |
@someaveragepunter - can you please provide the sample data and the queries, so we can take a look? Are you using partition filtering/metadata filtering? Are you saying that DuckDB is faster than pandas for certain queries? |
Let me try and narrow the use cases down, I was testing with partition filtering, but recall even for non partitioned reads it was slower. I just wanted to check if it was something the core contributors were prepared to look at before I spent the time cleaning up my example. One thing that would help is if you could recommend some public deltatable data readily available on S3 I could test against rather than needing to share my sample data. thanks @MrPowers |
@someaveragepunter - I've been running benchmarks with the h2o groupby dataset. Here are instructions on how to generate the dataset. Here's a talk I gave at the Data + AI Summit a couple of months back showing how Delta Lake made some queries run a lot faster. You need to formulate the queries properly to get good performance. Feel free to send your query code in the meantime and I can take a look. |
I think the key point I'm emphasizing on here is reading from s3 (blob store) as opposed to a local or network filesystem. if you could write your h2o files as a deltalake and share them publically on S3, I could try running some partitioned tests example code below:
deltalake.version python 3.11 |
@someaveragepunter - can you try to execute the query like this: table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e9_1e2_0_0")
dataset = table.to_pyarrow_dataset()
quack = duckdb.arrow(dataset)
quack.filter("id1 = 'id016' and v2 > 10") This notebook shows the performance gains you can get by using I think you're seeing bad performance cause you're loading all the data into memory. |
to_pyarrow_dataset() is still lazy and hasn't downloaded any data. how do I retrieve that data into a pandas or polars dataframe? I still have to do table.to_pyarrow_dataset().to_table() right? which takes the same amount of (slow) time can I ask that you execute the code above and perform the necessary adjustments and test the performance gain? again I have to stress, this is slowness I'm seeing using S3 blob storage. your example above suggests that you're testing with a local filesystem. |
No,
I'm still trying to figure out the benchmark you're running. There are query engines (pandas, Spark, DuckDB), Lakehouse Storage Systems (Delta Lake), and in memory formats (Arrow). Shouldn't the benchmark comparison be something like "A DuckDB query run on Parquet files stored in S3 runs faster/slower than the same DuckDB query run on a Delta Table with the same data stored in S3"? |
The comparison I'm running at this point is a query without any filters i.e. I genuinely want the entire contents of the parquet files within that folder. |
If I may jump in, I have a pretty similar problem. I am using a Delta Table (hosted on S3) withing Databricks. I am pretty aware that I shouldn't (theoretically) be loading all the data in memory and this slows down execution. But due to some legacy code, at a certain point I need to cast my dataset to Pandas. I made a simple experiment: it's a 1.9GB parquet file resultin in a table of 8885805 rows × 133 columns. If I load the parquet files directly in pandas through If I save my dataset as delta table (with same partitioning) and I do:
This takes 2 minutes and 50 seconds. Which is a great worsening in performances. For bigger dataset, this result scales accordigly, with DeltaLake taking an insane amount of time to cast to pandas |
I tested this, really much faster than converting to Pandas DF ( huge difference, thanks) |
Environment
Delta-rs version:
Binding:
Environment:
deltalake v0.5.7
Bug
What happened:
I have a Python Flask application that exposes a REST API to access the contents of a Delta Lake table for use in a web browser application. API requests to return records are grindingly slow, often in excess of 2 minutes. I have used a partitioned table with on the order of a few hundred (less than 1k) files. I have profiled the application and the bottleneck is in the
to_pandas
call. I see similar results if I replace the call withto_pyarrow_table
orto_pyarrow_dataset
.Why is this call so slow?
Is there a recommended approach that I am missing here?
What you expected to happen:
Data could be returned from the table in a reasonable length of time, i.e. that expected by a Reactive Web Application.
How to reproduce it:
The code above using S3 or IBM COS as backend storage. Table with order of 100K records partitioned on a timestamp field that creates on the order of 100 files.
More details:
As stated above, application profiling shows the delay is not at the REST controller level.
The text was updated successfully, but these errors were encountered: