Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Snowflake table last updated timestamp extractor #348

Merged

Conversation

Alagappan
Copy link
Contributor

Summary of Changes

Related issue : amundsen-io/amundsen#664
Adding a new extractor to extract table last updated timestamp for tables in Snowflake.

Tests

Added new unit test test_snowflake_last_updated_timestamp_extractor.py to improve coverage.

Documentation

Updated the README.md file to update extractor list and added some notes on how to use the new extractor

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
  • PR passes make test

Signed-off-by: Alagappan Sethuraman <alagappan.als@gmail.com>
@Alagappan Alagappan force-pushed the snowflake_last_updated_extractor branch from 5356e2a to b046f93 Compare August 27, 2020 09:14
@Alagappan
Copy link
Contributor Author

cc @feng-tao

@Alagappan
Copy link
Contributor Author

cc @jinhyukchang

snowflake-sqlalchemy
"""
# TODO: SELECT statement from snowflake information_schema to extract table last update time
SQL_STATEMENT = """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alagappan, thanks for the contribution, I know we have a hive last updated extractor long time ago. I wonder whether it will make more sense to use sqlchemey extractor with the model directly instead of building an extractor for each sub metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Could you elaborate a bit more? I am all for getting all metadata in a single extractor, if possible. Are you suggesting, we extract this last_updated_timestamp along with table and column metadata. If so, it should be pretty straightforward to do that for snowflake.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am saying instead of another extractor for this metadata, we could just do sqlachemy extractor + table last updated model with snowflake connection.

Copy link
Contributor Author

@Alagappan Alagappan Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see. I agree with what you are saying. For folks like us who have seen Amundsen in production, it seems straightforward to do that but people who are new to Amundsen it doesn't appear to be clear what metadata pieces we support on the table details page. I added this extractor to make it easy for Snowflake users to deploy Amundsen and have couple of metadata pieces already wired up to give a good feel for the product.

Let me know if you are still opposed to adding this in. Happy to discuss further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see. I agree with what you are saying. For folks like us who have seen Amundsen in production, it seems straightforward to do that but people who are new to Amundsen it doesn't appear to be clear what metadata pieces we support on the table details page. I added this extractor to make it easy for Snowflake users to deploy Amundsen and have couple of metadata pieces already wired up to give a good feel for the product.

Let me know if you are still opposed to adding this in. Happy to discuss further.

I think there's a value on this extractor which encapsulates last timestamp SQL. WDYT @feng-tao ?

@Alagappan
Copy link
Contributor Author

@feng-tao Please let me know what you think. Would like to get this merged if we are okay with adding this extractor.

Copy link
Member

@feng-tao feng-tao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @Alagappan , I synced with @jinhyukchang , so am convinced with this change. I put a few comments. Let me know what you think.

README.md Outdated
job_config = ConfigFactory.from_dict({
'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.SNOWFLAKE_DATABASE_KEY): 'YourDbName',
'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.USE_CATALOG_AS_CLUSTER_NAME): True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the indentation seems off.

README.md Outdated

It uses same configs as the `SnowflakeMetadataExtractor` described above.

The SQL query driving the extraction is defined [here](https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/extractor/snowflake_table_last_updated_extractor.py#L25)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to ping it to a sha as the Line number would change for a given sql

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will just point it to the master just like rest of the links.

README.md Outdated
@@ -314,6 +314,27 @@ job = DefaultJob(
job.launch()
```

#### [SnowflakeTableLastUpdatedExtractor](https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/extractor/snowflake_table_last_updated_extractor.py "SnowflakeTableLastUpdatedExtractor")
An extractor that table last updated timestamp from a Snowflake database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An extractor that extracts the table last updated timestamp

lower({cluster_source}) AS cluster,
lower(t.table_schema) AS schema,
lower(t.table_name) AS table_name,
DATA_PART(EPOCH, t.last_altered) AS last_updated_time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment on the SQL from snowflake on where / how it defines last updated time?

lower(t.table_name) AS table_name,
DATA_PART(EPOCH, t.last_altered) AS last_updated_time
FROM
{database}.INFORMATION_SCHEMA.TABLES t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need table alias t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a convenience. As the table is being referred in lines 27-29.

Test when USE_CATALOG_AS_CLUSTER_NAME is true (CLUSTER_KEY should be ignored)
"""
def setUp(self) -> None:
logging.basicConfig(level=logging.INFO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we need this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove all the logging from tests. Doesn't look like we are using it. I based it off of the existing unit tests.

Test when USE_CATALOG_AS_CLUSTER_NAME is false and CLUSTER_KEY is NOT specified
"""
def setUp(self) -> None:
logging.basicConfig(level=logging.INFO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Test with DATABASE_KEY config specified
"""
def setUp(self) -> None:
logging.basicConfig(level=logging.INFO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Test with 'USE_CATALOG_AS_CLUSTER_NAME' is false and 'CLUSTER_KEY' is specified
"""
def setUp(self) -> None:
logging.basicConfig(level=logging.INFO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Test 'where_cluser' config key in extractor
"""
def setUp(self) -> None:
logging.basicConfig(level=logging.INFO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Signed-off-by: Alagappan Sethuraman <alagappan.als@gmail.com>
@Alagappan
Copy link
Contributor Author

@feng-tao PR updated. Please take a look and let me know what you think. Thanks!

Signed-off-by: Alagappan Sethuraman <alagappan.als@gmail.com>
@feng-tao feng-tao merged commit 0bac11b into amundsen-io:master Sep 2, 2020
szczeles pushed a commit to szczeles/amundsendatabuilder that referenced this pull request Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants