Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: implement native_iceberg_drop #292

Merged
merged 28 commits into from
Jun 12, 2023

Conversation

brabster
Copy link
Contributor

@brabster brabster commented May 9, 2023

… dev and testing

Description

Remove need for direct access to S3 & Glue to drop an Iceberg table. I figured that there needs to be a way for me and others to test the solution works with the restricted permissions so I have the start of that here.

Models used to test - Optional

Checklist

  • [/] You followed contributing section
  • [/] You kept your Pull Request small and focused on a single feature or bug fix.
  • You added unit testing when necessary
  • You added functional testing when necessary

@brabster brabster changed the title Implement native_iceberg_drop fix: implement native_iceberg_drop May 9, 2023
@nicor88
Copy link
Contributor

nicor88 commented May 15, 2023

@brabster shall we call this native_drop? as we can use this behavior also for not iceberg tables ;)

@Jrmyy Jrmyy added this to the v1.5.1 milestone May 15, 2023
@brabster
Copy link
Contributor Author

@nicor88 native_drop sounds fine to me.

@brabster
Copy link
Contributor Author

brabster commented May 16, 2023

🤔 when I committed the restrictive role contained in the CloudFormation template for convenience) my thinking was to document a specific set of permissions in a role that can be used by anyone (or automation) for testing that the base functionality works without direct access to restricted APIs - obvs. could loosen the restrictions eg. to allow direct access to certain Glue APis.

Maybe it's a separate thread/issue, but having a way to explicitly test that basic dbt functionality works under a more restrictive set of permissions than the one currently used seems like the only way to really say that dbt-athena can be used under those permissions?

Otherwise at this point I'm the only one who can really say that a fix works and what the permissions I'm using are

@nicor88
Copy link
Contributor

nicor88 commented May 16, 2023

Maybe it's a separate thread/issue, but having a way to explicitly test that basic dbt functionality works under a more restrictive set of permissions than the one currently used seems like the only way to really say that dbt-athena can be used under those permissions?
that's indeed the right thought.

What do we need to achieve this in a more scalable way:

IMHO, in this specific case you are not really blocked, as you can prepare the feature without all the testing setup - to see if it works we can check if drops query hits athena, and you can turn debug mode on to see if the s3 cleanup path is called.

@brabster
Copy link
Contributor Author

brabster commented May 17, 2023

It might make sense to combine this and the native seed thing - they are a bit coupled together which makes me a bit unsure whether it's good as a whole. Added logging to make behaviour clear. I'll move the role definition before finalising

Evidence

�[0m21:37:10.808696 [debug] [Thread-1  ]: Began executing node seed.ephemeral.base
�[0m21:37:10.881750 [debug] [Thread-1  ]: Opening a new connection, currently in state closed
�[0m21:37:11.224528 [debug] [Thread-1  ]: Dropping relation via SQL only
�[0m21:37:11.226145 [debug] [Thread-1  ]: Using athena connection "seed.ephemeral.base"
�[0m21:37:11.227143 [debug] [Thread-1  ]: On seed.ephemeral.base: drop table if exists `awsdatacatalog`.`test16843558268112747602_test_basic_iceberg_native_drop`.`base__dbt_tmp`
    
�[0m21:37:12.694506 [debug] [Thread-1  ]: SQL status: OK -1 in 1.0 seconds
�[0m21:37:12.714119 [debug] [Thread-1  ]: Using athena connection "seed.ephemeral.base"
�[0m21:37:12.715985 [debug] [Thread-1  ]: On seed.ephemeral.base: 
    create external table `awsdatacatalog`.`test16843558268112747602_test_basic_iceberg_native_drop`.`base__dbt_tmp` (id string, name string, some_date string)
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    location 's3://843328850426-athena-results/staging/tables/test16843558268112747602_test_basic_iceberg_native_drop/base__dbt_tmp/afd3b34c-0d4c-4581-82d4-da916d517f73'
    tblproperties (
      'skip.header.line.count'='1'
    )
  
�[0m21:37:14.046299 [debug] [Thread-1  ]: SQL status: OK -1 in 1.0 seconds
�[0m21:37:14.362476 [debug] [Thread-1  ]: Config native_drop enabled, skipping direct S3 delete
�[0m21:37:14.363567 [debug] [Thread-1  ]: Using athena connection "seed.ephemeral.base"

@brabster
Copy link
Contributor Author

brabster commented May 22, 2023

Test run output, unit_tests all passing too. The native iceberg drop test is skipped as we hit DDL timeout, see issue #312

pytest -n auto tests/functional
====================================================================================== test session starts ======================================================================================
platform linux -- Python 3.8.13, pytest-7.3.1, pluggy-1.0.0
rootdir: /home/paul/projects/brabster/dbt-athena
configfile: pytest.ini
plugins: cov-4.0.0, dotenv-0.5.2, xdist-3.3.0
8 workers [57 items]    
ssssss....ssss.ss.s.....ss........s......................                                                                                                                                 [100%]
======================================================================================= warnings summary ========================================================================================
tests/functional/adapter/test_basic_hive.py: 31 warnings
tests/functional/adapter/test_basic_hive_native_drop.py: 15 warnings
tests/functional/adapter/test_basic_iceberg.py: 15 warnings
tests/functional/adapter/test_basic_iceberg_native_drop.py: 6 warnings
tests/functional/adapter/utils/test_utils.py: 60 warnings
tests/functional/adapter/test_snapshot.py: 36 warnings
  /home/paul/projects/brabster/dbt-athena/venv/lib/python3.8/site-packages/jinja2/lexer.py:650: DeprecationWarning: invalid escape sequence '\('

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================================== 41 passed, 16 skipped, 163 warnings in 593.48s (0:09:53) ====================================================================

@brabster brabster marked this pull request as ready for review May 22, 2023 20:07
Copy link
Contributor

@nicor88 nicor88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment.

dbt/include/athena/macros/adapters/relation.sql Outdated Show resolved Hide resolved
@brabster brabster requested a review from svdimchenko May 31, 2023 13:29
@brabster
Copy link
Contributor Author

Not sure how to close off the 2x requested changes - they have been implemented

svdimchenko
svdimchenko previously approved these changes May 31, 2023
Jrmyy
Jrmyy previously approved these changes Jun 2, 2023
@Jrmyy Jrmyy requested a review from svdimchenko June 2, 2023 08:53
@brabster brabster mentioned this pull request Jun 6, 2023
@brabster
Copy link
Contributor Author

brabster commented Jun 7, 2023

I've been trying out these changes in my real project and looking at the logs I've got a suspicion that native_drop might be taking effect when it shouldn't... super difficult to be confident without tests and it's unclear how to write/augment tests that ensure the correct behaviour in on/off cases.

I did wonder about asserting the logs do or do not contain specific text (this is how I'm checking by hand) but I'm not sure how to assert against the caplog for the existing test cases that are extending dbt base tests, any ideas?

@brabster brabster dismissed stale reviews from svdimchenko and Jrmyy via de00c89 June 7, 2023 18:54
@brabster
Copy link
Contributor Author

brabster commented Jun 7, 2023

There was a bug, the way I was trying to get the value of 'native_drop' as a boolean was always true and that meant that any iceberg table would get the native drop behaviour regardless of the flag.

{%- set native_drop = config.get('native_drop', default='False') | as_bool -%}

Swapped to

{%- set native_drop = config.get('native_drop', default=false) -%}

Based on similar bool config option in sqlserver adapter and seems to work correctly based on log inspection. Also removed a few messy log messages that I'd left in there.

Gotta love booleans in text-based config 🙄

@svdimchenko
Copy link
Contributor

There was a bug, the way I was trying to get the value of 'native_drop' as a boolean was always true and that meant that any iceberg table would get the native drop behaviour regardless of the flag.

{%- set native_drop = config.get('native_drop', default='False') | as_bool -%}

Swapped to

{%- set native_drop = config.get('native_drop', default=false) -%}

Based on similar bool config option in sqlserver adapter and seems to work correctly based on log inspection. Also removed a few messy log messages that I'd left in there.

Gotta love booleans in text-based config 🙄

I would prefer to have all such configs in format

{%- set native_drop = config.get('native_drop', False) -%}

just try to run the following code in python console:

d = {'a': 'b'}
d.get('c', default=None)  # raise TypeError: dict.get() takes no keyword arguments
d.get('c')  # return None
d.get('c', 'test')  # return string 'test'
d.get('c', False)  # return boolean False  

@brabster
Copy link
Contributor Author

brabster commented Jun 8, 2023

@svdimchenko thanks for the review - my adjustments are based on several trials of different ways of expressing a boolean such that it worked in the iceberg test cases and in my actual model dbt config. I was really struggling to work out what the heck type was actually coming out in the Jinja template until I found that sqlserver example and this SO post which refers to Jinja docs and suggests that the bool types are lowercased in Jinja:

The special constants true, false, and none are indeed lowercase. Because that caused confusion in the past, (True used to expand to an undefined variable that was considered false), all three can now also be written in title case (True, False, and None). However, for consistency, (all Jinja identifiers are lowercase) you should use the lowercase versions.

So i'm confused now - bool casing should be lower, and config.get('foo', default='bar') form is used everywhere else in the jinja templates? Maybe I'm misunderstanding?

svdimchenko
svdimchenko previously approved these changes Jun 9, 2023
README.md Show resolved Hide resolved
Jrmyy
Jrmyy previously approved these changes Jun 9, 2023
@Jrmyy Jrmyy added the enable-functional-tests Label to trigger functional testing label Jun 9, 2023
@svdimchenko svdimchenko dismissed stale reviews from Jrmyy and themself via d4f873d June 12, 2023 10:53
@Jrmyy Jrmyy merged commit 1b44a9a into dbt-labs:main Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enable-functional-tests Label to trigger functional testing
Projects
None yet
5 participants