feat: Add Snowflake table last updated timestamp extractor #348

Alagappan · 2020-08-27T09:11:22Z

Summary of Changes

Related issue : amundsen-io/amundsen#664
Adding a new extractor to extract table last updated timestamp for tables in Snowflake.

Tests

Added new unit test test_snowflake_last_updated_timestamp_extractor.py to improve coverage.

Documentation

Updated the README.md file to update extractor list and added some notes on how to use the new extractor

CheckList

Make sure you have checked all steps below to ensure a timely review.

PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
PR includes a summary of changes.
PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
PR passes make test

Signed-off-by: Alagappan Sethuraman <alagappan.als@gmail.com>

Alagappan · 2020-08-28T18:28:42Z

cc @feng-tao

Alagappan · 2020-08-28T21:06:40Z

cc @jinhyukchang

feng-tao · 2020-08-28T20:09:56Z

databuilder/extractor/snowflake_table_last_updated_extractor.py

+        snowflake-sqlalchemy
+    """
+    # TODO: SELECT statement from snowflake information_schema to extract table last update time
+    SQL_STATEMENT = """


Alagappan, thanks for the contribution, I know we have a hive last updated extractor long time ago. I wonder whether it will make more sense to use sqlchemey extractor with the model directly instead of building an extractor for each sub metadata.

Interesting. Could you elaborate a bit more? I am all for getting all metadata in a single extractor, if possible. Are you suggesting, we extract this last_updated_timestamp along with table and column metadata. If so, it should be pretty straightforward to do that for snowflake.

I am saying instead of another extractor for this metadata, we could just do sqlachemy extractor + table last updated model with snowflake connection.

Ahh I see. I agree with what you are saying. For folks like us who have seen Amundsen in production, it seems straightforward to do that but people who are new to Amundsen it doesn't appear to be clear what metadata pieces we support on the table details page. I added this extractor to make it easy for Snowflake users to deploy Amundsen and have couple of metadata pieces already wired up to give a good feel for the product.

Let me know if you are still opposed to adding this in. Happy to discuss further.

Ahh I see. I agree with what you are saying. For folks like us who have seen Amundsen in production, it seems straightforward to do that but people who are new to Amundsen it doesn't appear to be clear what metadata pieces we support on the table details page. I added this extractor to make it easy for Snowflake users to deploy Amundsen and have couple of metadata pieces already wired up to give a good feel for the product.

Let me know if you are still opposed to adding this in. Happy to discuss further.

I think there's a value on this extractor which encapsulates last timestamp SQL. WDYT @feng-tao ?

Alagappan · 2020-08-31T18:41:17Z

@feng-tao Please let me know what you think. Would like to get this merged if we are okay with adding this extractor.

feng-tao

hey @Alagappan , I synced with @jinhyukchang , so am convinced with this change. I put a few comments. Let me know what you think.

feng-tao · 2020-08-31T19:38:33Z

README.md

+job_config = ConfigFactory.from_dict({
+	'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.SNOWFLAKE_DATABASE_KEY): 'YourDbName',
+	'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
+    'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.USE_CATALOG_AS_CLUSTER_NAME): True,


the indentation seems off.

feng-tao · 2020-08-31T19:39:19Z

README.md

+
+It uses same configs as the `SnowflakeMetadataExtractor` described above.
+
+The SQL query driving the extraction is defined [here](https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/extractor/snowflake_table_last_updated_extractor.py#L25)


I would suggest to ping it to a sha as the Line number would change for a given sql

Sounds good. I will just point it to the master just like rest of the links.

feng-tao · 2020-08-31T19:40:58Z

README.md

@@ -314,6 +314,27 @@ job = DefaultJob(
 job.launch()
 ```

+#### [SnowflakeTableLastUpdatedExtractor](https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/extractor/snowflake_table_last_updated_extractor.py "SnowflakeTableLastUpdatedExtractor")
+An extractor that table last updated timestamp from a Snowflake database.


An extractor that extracts the table last updated timestamp

feng-tao · 2020-08-31T19:41:38Z

databuilder/extractor/snowflake_table_last_updated_extractor.py

+            lower({cluster_source}) AS cluster,
+            lower(t.table_schema) AS schema,
+            lower(t.table_name) AS table_name,
+            DATA_PART(EPOCH, t.last_altered) AS last_updated_time


could you add a comment on the SQL from snowflake on where / how it defines last updated time?

feng-tao · 2020-08-31T19:41:53Z

databuilder/extractor/snowflake_table_last_updated_extractor.py

+            lower(t.table_name) AS table_name,
+            DATA_PART(EPOCH, t.last_altered) AS last_updated_time
+        FROM
+            {database}.INFORMATION_SCHEMA.TABLES t


do we need table alias t?

just a convenience. As the table is being referred in lines 27-29.

feng-tao · 2020-08-31T19:51:59Z