Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta+Azure is slow #104

Open
nfoerster opened this issue Oct 2, 2024 · 1 comment
Open

Delta+Azure is slow #104

nfoerster opened this issue Oct 2, 2024 · 1 comment

Comments

@nfoerster
Copy link

nfoerster commented Oct 2, 2024

We are doing some tests as now azure and delta plugin work together, however we have heavy problems to write performant queries on our deltalake test.

The deltalake has around 70 columns and 1,5 billion rows, it is partitioned by 2 layers, the first on serialnumber has around 270 partitions and the second layer around 10-20 based on year-month. All files are parquets, we have only one deltalake version, the data is compacted and vacuumed, the metadata history is almost clean.

We are performing the queries with duckdb 1.1.1 from azure vm inside same vnet as the blob store is.

This is the setup:

LOAD azure;
LOAD delta;

set azure_transport_option_type = 'curl'; -- this is needed as we otherwise get SSL execptions, stated also in the azure plugin section on duckdb docs

-- two partition selection 
SELECT uuid FROM delta_scan('abfss://q2deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3') WHERE SN='XXXX0005' and yyyymm='202304';

-- 717931 rows in 2m30s

-- one partition selection
SELECT uuid FROM delta_scan('abfss://q2deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3') WHERE SN='XXXX0005';

-- 5493265 rows in 2m35s

As you can see there is not much of a difference between one or two partition clauses although its far less data. I think it scans the whole deltatable instead of pushing down the filters to partitions.

Did you ever had similar observations? Any hints would be nice.

@nfoerster nfoerster changed the title D Delta+Azure is incredible slow Oct 2, 2024
@nfoerster nfoerster changed the title Delta+Azure is incredible slow Delta+Azure is slow Oct 2, 2024
@nfoerster
Copy link
Author

This is something else we tried, the query time is around 4-5s. It has a range query so takes much more data in consideration.

import fsspec
from azure.identity.aio import DefaultAzureCredential
import pyarrow.parquet as pq
import time

credentials = DefaultAzureCredential()
fs = fsspec.filesystem("abfs", credential=credentials, account_name="saweud")
filters = []
filter_ = [("SN", "==", "XXXX2987"), ("yyyymm", ">=", 202107), ("yyyymm", "<=", 202204)]
filters.append(filter_)
start = time.time()
# filters.append(('billing_date', '<=', de.strftime('%Y-%m-%d')))
dt = pq.read_table(
    "q2deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3",
    partitioning="hive",
    filters=filters,
    filesystem=fs,
    columns=[
        "SN",
        "Date",
        "OperatingMode",
        "OP"
    ],
).to_pandas()
print(dt)
print(time.time() - start)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant