Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Athena read_sql_query provides completely wrong results for qmark style parametrized queries with cache enabled #2956

Closed
scimas opened this issue Sep 9, 2024 · 0 comments · Fixed by #2957
Assignees
Labels
bug Something isn't working

Comments

@scimas
Copy link

scimas commented Sep 9, 2024

Describe the bug

Exactly as stated in the title. If you're querying Athena using a qmark style query and use the query cache, and you change the actual parameter values between two query runs, aws-sdk-pandas/awswrangler retrieves the result of the previous query run with incorrect parameter values.

How to Reproduce

import awswrangler as wr

df = wr.athena.read_sql_query(
    sql="SELECT distinct id FROM my_database.my_table WHERE id in (?, ?)",
    database="my_database",
    ctas_approach=False,
    unload_approach=False,
    params=["id1", "id2"],
    paramstyle="qmark",
    workgroup="primary",
    athena_cache_settings={"max_cache_seconds": 300}
)
print(df)

df2 = wr.athena.read_sql_query(
    sql="SELECT distinct id FROM my_database.my_table WHERE id in (?, ?)",
    database="my_database",
    ctas_approach=False,
    unload_approach=False,
    params=["id3", "id4"],
    paramstyle="qmark",
    workgroup="primary",
    athena_cache_settings={"max_cache_seconds": 300}
)
print(df2)

this with a database and table should be enough to reproduce the issue.

Expected behavior

Since the parameters to the query have changed, a fresh query should be run.

Your project

No response

Screenshots

No response

OS

Amazon Linux 2023

Python version

3.12

AWS SDK for pandas version

layer:AWSSDKPandas-Python312-Arm64:12

Additional context

My guess is that this is happening due to aws-sdk-pandas using a hand-rolled cache implementation rather than just letting Athena handle the cache through the ResultReuseConfiguration of StartQueryExecution. As far as I know, Athena does not populate the query parameter values in the GetQueryExecution output. So you get the parametrized query without the parameter values and check the current query against it, which is incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants