-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Snowflake table last updated timestamp extractor #348
feat: Add Snowflake table last updated timestamp extractor #348
Conversation
Signed-off-by: Alagappan Sethuraman <alagappan.als@gmail.com>
5356e2a
to
b046f93
Compare
cc @feng-tao |
snowflake-sqlalchemy | ||
""" | ||
# TODO: SELECT statement from snowflake information_schema to extract table last update time | ||
SQL_STATEMENT = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alagappan, thanks for the contribution, I know we have a hive last updated extractor long time ago. I wonder whether it will make more sense to use sqlchemey extractor with the model directly instead of building an extractor for each sub metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Could you elaborate a bit more? I am all for getting all metadata in a single extractor, if possible. Are you suggesting, we extract this last_updated_timestamp
along with table and column metadata. If so, it should be pretty straightforward to do that for snowflake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am saying instead of another extractor for this metadata, we could just do sqlachemy extractor + table last updated model with snowflake connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see. I agree with what you are saying. For folks like us who have seen Amundsen in production, it seems straightforward to do that but people who are new to Amundsen it doesn't appear to be clear what metadata pieces we support on the table details page. I added this extractor to make it easy for Snowflake users to deploy Amundsen and have couple of metadata pieces already wired up to give a good feel for the product.
Let me know if you are still opposed to adding this in. Happy to discuss further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see. I agree with what you are saying. For folks like us who have seen Amundsen in production, it seems straightforward to do that but people who are new to Amundsen it doesn't appear to be clear what metadata pieces we support on the table details page. I added this extractor to make it easy for Snowflake users to deploy Amundsen and have couple of metadata pieces already wired up to give a good feel for the product.
Let me know if you are still opposed to adding this in. Happy to discuss further.
I think there's a value on this extractor which encapsulates last timestamp SQL. WDYT @feng-tao ?
@feng-tao Please let me know what you think. Would like to get this merged if we are okay with adding this extractor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @Alagappan , I synced with @jinhyukchang , so am convinced with this change. I put a few comments. Let me know what you think.
README.md
Outdated
job_config = ConfigFactory.from_dict({ | ||
'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.SNOWFLAKE_DATABASE_KEY): 'YourDbName', | ||
'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix, | ||
'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.USE_CATALOG_AS_CLUSTER_NAME): True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the indentation seems off.
README.md
Outdated
|
||
It uses same configs as the `SnowflakeMetadataExtractor` described above. | ||
|
||
The SQL query driving the extraction is defined [here](https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/extractor/snowflake_table_last_updated_extractor.py#L25) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to ping it to a sha as the Line number would change for a given sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will just point it to the master just like rest of the links.
README.md
Outdated
@@ -314,6 +314,27 @@ job = DefaultJob( | |||
job.launch() | |||
``` | |||
|
|||
#### [SnowflakeTableLastUpdatedExtractor](https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/extractor/snowflake_table_last_updated_extractor.py "SnowflakeTableLastUpdatedExtractor") | |||
An extractor that table last updated timestamp from a Snowflake database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An extractor that extracts the table last updated timestamp
lower({cluster_source}) AS cluster, | ||
lower(t.table_schema) AS schema, | ||
lower(t.table_name) AS table_name, | ||
DATA_PART(EPOCH, t.last_altered) AS last_updated_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a comment on the SQL from snowflake on where / how it defines last updated time?
lower(t.table_name) AS table_name, | ||
DATA_PART(EPOCH, t.last_altered) AS last_updated_time | ||
FROM | ||
{database}.INFORMATION_SCHEMA.TABLES t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need table alias t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a convenience. As the table is being referred in lines 27-29.
Test when USE_CATALOG_AS_CLUSTER_NAME is true (CLUSTER_KEY should be ignored) | ||
""" | ||
def setUp(self) -> None: | ||
logging.basicConfig(level=logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we need this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove all the logging from tests. Doesn't look like we are using it. I based it off of the existing unit tests.
Test when USE_CATALOG_AS_CLUSTER_NAME is false and CLUSTER_KEY is NOT specified | ||
""" | ||
def setUp(self) -> None: | ||
logging.basicConfig(level=logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Test with DATABASE_KEY config specified | ||
""" | ||
def setUp(self) -> None: | ||
logging.basicConfig(level=logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Test with 'USE_CATALOG_AS_CLUSTER_NAME' is false and 'CLUSTER_KEY' is specified | ||
""" | ||
def setUp(self) -> None: | ||
logging.basicConfig(level=logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Test 'where_cluser' config key in extractor | ||
""" | ||
def setUp(self) -> None: | ||
logging.basicConfig(level=logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Signed-off-by: Alagappan Sethuraman <alagappan.als@gmail.com>
@feng-tao PR updated. Please take a look and let me know what you think. Thanks! |
Signed-off-by: Alagappan Sethuraman <alagappan.als@gmail.com>
Summary of Changes
Related issue : amundsen-io/amundsen#664
Adding a new extractor to extract table last updated timestamp for tables in Snowflake.
Tests
Added new unit test
test_snowflake_last_updated_timestamp_extractor.py
to improve coverage.Documentation
Updated the README.md file to update extractor list and added some notes on how to use the new extractor
CheckList
Make sure you have checked all steps below to ensure a timely review.
make test