-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest): add snowflake-queries source #10835
Conversation
WalkthroughThe recent updates to the metadata ingestion module focus on enhancing Snowflake integration. Key changes include adding dependencies for Snowflake queries, refining lineage mapping, and improving schema and usage statistics extraction. Additionally, new entities and test cases have been introduced to support these enhancements, ensuring robust functionality and better handling of external lineage information and query details. Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- metadata-ingestion/setup.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (8 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1 hunks)
- metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (21 hunks)
- metadata-models/src/main/pegasus/com/linkedin/query/QueryUsageStatistics.pdl (1 hunks)
- metadata-models/src/main/resources/entity-registry.yml (1 hunks)
Additional comments not posted (33)
metadata-models/src/main/pegasus/com/linkedin/query/QueryUsageStatistics.pdl (5)
18-18
: Well-documented field:queryCount
.The field
queryCount
is well-documented and includes aTimeseriesField
annotation for time series data.
24-24
: Well-documented field:queryCost
.The field
queryCost
is well-documented and includes aTimeseriesField
annotation for time series data.
30-30
: Well-documented field:lastExecutedAt
.The field
lastExecutedAt
is well-documented and includes aTimeseriesField
annotation for time series data.
36-36
: Well-documented field:uniqueUserCount
.The field
uniqueUserCount
is well-documented and includes aTimeseriesField
annotation for time series data.
42-42
: Well-documented field:userCounts
.The field
userCounts
is well-documented and includes aTimeseriesFieldCollection
annotation for time series data collection.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (9)
49-69
: Good use of configuration mixins and default values.The
SnowflakeQueriesConfig
class effectively uses configuration mixins and provides sensible default values for its fields.
73-77
: Well-structured report class.The
SnowflakeQueriesReport
class is well-structured and extendsSourceReport
.
80-107
: Comprehensive initialization.The
__init__
method provides a comprehensive initialization of theSnowflakeQueriesSource
class, setting up the context, configuration, report, and aggregator.
108-112
: Factory method for creating instances.The
create
method is a factory method that parses configuration and creates an instance ofSnowflakeQueriesSource
.
113-123
: Efficient use of cached property for local temp path.The
local_temp_path
method efficiently uses thecached_property
decorator to manage the local temporary path.
124-151
: Efficient handling of audit log and work units.The
get_workunits_internal
method efficiently handles the audit log and generates metadata work units.
152-196
: Detailed method for fetching audit log.The
fetch_audit_log
method is detailed and includes TODO comments for future enhancements.
202-293
: Comprehensive audit log response parsing.The
_parse_audit_log_response
method provides comprehensive parsing of audit log responses, converting them intoPreparsedQuery
objects.
295-296
: Simple method for retrieving the report.The
get_report
method is simple and straightforward, returning theSnowflakeQueriesReport
instance.metadata-models/src/main/resources/entity-registry.yml (1)
507-507
: Correctly addedqueryUsageStatistics
aspect.The
queryUsageStatistics
aspect has been correctly added to the list of aspects for thequery
entity.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (6)
33-33
: Correctly importedKnownLineageMapping
.The
KnownLineageMapping
class has been correctly imported fromdatahub.sql_parsing.sql_parsing_aggregator
.
268-271
: Correctly updated method to returnIterable[KnownLineageMapping]
.The
_populate_external_lineage_from_copy_history
method has been correctly updated to return an iterable ofKnownLineageMapping
objects.
277-280
: Correctly updated method to returnIterable[KnownLineageMapping]
.The
_populate_external_lineage_from_show_query
method has been correctly updated to return an iterable ofKnownLineageMapping
objects.
Line range hint
355-371
: Correctly updated method to returnOptional[KnownLineageMapping]
.The
_process_external_lineage_result_row
method has been correctly updated to return an optionalKnownLineageMapping
object.
268-281
: Efficient handling of external upstreams.The
_populate_external_upstreams
method efficiently handles the addition of external upstreams using the updated methods.
268-281
: Efficient handling of work units.The
get_workunits
method efficiently handles the addition of work units using the updated methods.metadata-ingestion/setup.py (2)
412-412
: Addition ofsnowflake-queries
plugin dependency looks good.The new
snowflake-queries
plugin dependency has been added correctly.
665-665
: Addition ofsnowflake-queries
entry point looks good.The new entry point for
snowflake-queries
has been added correctly.metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (10)
439-450
: New Method: Verify the usage ofadd
method.The
add
method is new. Ensure it is used appropriately throughout the code.
17-17
: New Imports: Ensure they are necessary and used correctly.The imports from
datahub.configuration.time_window_config
,datahub.sql_parsing.sql_parsing_common
, anddatahub.sql_parsing.sqlglot_lineage
are new. Ensure these are necessary and used correctly in the code.Also applies to: 34-34
Verification successful
Verified: The new imports are necessary and used correctly.
get_time_bucket
is used in multiple files, includingsql_parsing_aggregator.py
.QueryTypeProps
is used insql_parsing_aggregator.py
and other related files.ColumnLineageInfo
is used extensively insql_parsing_aggregator.py
and other files.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the new imports are necessary and used in the code. # Test: Search for the usage of the new imports. Expect: Usage of each import in the code. rg --type python $'get_time_bucket' rg --type python $'QueryTypeProps' rg --type python $'ColumnLineageInfo'Length of output: 198
Script:
#!/bin/bash # Description: Verify that the new imports are necessary and used in the code. # Test: Search for the usage of the new imports. Expect: Usage of each import in the code. rg --type py 'get_time_bucket' rg --type py 'QueryTypeProps' rg --type py 'ColumnLineageInfo'Length of output: 14733
147-169
: New Class: Verify the usage ofPreparsedQuery
.The
PreparsedQuery
class is new. Ensure it is used appropriately throughout the code.Verification successful
Verified Usage:
PreparsedQuery
classThe
PreparsedQuery
class is used appropriately throughout the codebase, as evidenced by its presence in multiple files.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `PreparsedQuery` class. # Test: Search for the usage of the `PreparsedQuery` class. Expect: Usage of the class in the code. rg --type python $'PreparsedQuery'Length of output: 64
Script:
#!/bin/bash # Description: Verify the usage of the new `PreparsedQuery` class. # Test: Search for the usage of the `PreparsedQuery` class. Expect: Usage of the class in the code. rg --type py 'PreparsedQuery'Length of output: 1100
140-145
: New Class: Verify the usage ofKnownLineageMapping
.The
KnownLineageMapping
class is new. Ensure it is used appropriately throughout the code.Verification successful
Verified Usage: The
KnownLineageMapping
class is used appropriately throughout the codebase.The
KnownLineageMapping
class is found in multiple files, indicating proper integration and usage:
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `KnownLineageMapping` class. # Test: Search for the usage of the `KnownLineageMapping` class. Expect: Usage of the class in the code. rg --type python $'KnownLineageMapping'Length of output: 69
Script:
#!/bin/bash # Description: Verify the usage of the new `KnownLineageMapping` class. # Test: Search for the usage of the `KnownLineageMapping` class. Expect: Usage of the class in the code. rg 'KnownLineageMapping'Length of output: 1374
249-249
: New Parameters and Attributes: Verify the usage ofgenerate_query_usage_statistics
and_query_usage_counts
inSqlParsingAggregator
.The
generate_query_usage_statistics
parameter and_query_usage_counts
attribute are new. Ensure they are used appropriately throughout the code.Also applies to: 263-263, 270-271, 368-376
Verification successful
New Parameters and Attributes: Verified the usage of
generate_query_usage_statistics
and_query_usage_counts
inSqlParsingAggregator
.The
generate_query_usage_statistics
parameter and_query_usage_counts
attribute are used appropriately throughout the codebase.
generate_query_usage_statistics
is used insql_parsing_aggregator.py
andsnowflake_queries.py
._query_usage_counts
is used insql_parsing_aggregator.py
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `generate_query_usage_statistics` parameter and `_query_usage_counts` attribute in `SqlParsingAggregator`. # Test: Search for the usage of the `generate_query_usage_statistics` parameter. Expect: Usage of the parameter in the code. rg --type python $'generate_query_usage_statistics' # Test: Search for the usage of the `_query_usage_counts` attribute. Expect: Usage of the attribute in the code. rg --type python $'_query_usage_counts'Length of output: 151
Script:
#!/bin/bash # Description: Verify the usage of the new `generate_query_usage_statistics` parameter and `_query_usage_counts` attribute in `SqlParsingAggregator`. # Test: Search for the usage of the `generate_query_usage_statistics` parameter. Expect: Usage of the parameter in the code. rg 'generate_query_usage_statistics' # Test: Search for the usage of the `_query_usage_counts` attribute. Expect: Usage of the attribute in the code. rg '_query_usage_counts'Length of output: 1859
202-202
: New Attributes: Verify the usage ofnum_preparsed_queries
andnum_query_usage_stats_generated
inSqlAggregatorReport
.The
num_preparsed_queries
andnum_query_usage_stats_generated
attributes are new. Ensure they are used appropriately throughout the code.Also applies to: 225-225
Verification successful
New Attributes: Verified usage of
num_preparsed_queries
andnum_query_usage_stats_generated
inSqlAggregatorReport
.The attributes
num_preparsed_queries
andnum_query_usage_stats_generated
are used appropriately in the codebase:
num_preparsed_queries
:metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
lines 202 and 225.num_query_usage_stats_generated
:metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
lines 225 and 226.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `num_preparsed_queries` and `num_query_usage_stats_generated` attributes in `SqlAggregatorReport`. # Test: Search for the usage of the `num_preparsed_queries` attribute. Expect: Usage of the attribute in the code. rg --type python $'num_preparsed_queries' # Test: Search for the usage of the `num_query_usage_stats_generated` attribute. Expect: Usage of the attribute in the code. rg --type python $'num_query_usage_stats_generated'Length of output: 153
Script:
#!/bin/bash # Description: Verify the usage of the new `num_preparsed_queries` and `num_query_usage_stats_generated` attributes in `SqlAggregatorReport`. # Test: Search for the usage of the `num_preparsed_queries` attribute. Expect: Usage of the attribute in the code. rg 'num_preparsed_queries' # Test: Search for the usage of the `num_query_usage_stats_generated` attribute. Expect: Usage of the attribute in the code. rg 'num_query_usage_stats_generated'Length of output: 535
107-107
: New Attribute: Verify the usage ofcolumn_usage
inQueryMetadata
.The
column_usage
attribute is new. Ensure it is used appropriately throughout the code.Verification successful
Verified: The
column_usage
attribute is used appropriately throughout the codebase.The
column_usage
attribute is integrated and utilized correctly in various parts of the code.
- File:
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
- File:
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `column_usage` attribute in `QueryMetadata`. # Test: Search for the usage of the `column_usage` attribute. Expect: Usage of the attribute in the code. rg --type python $'column_usage'Length of output: 62
Script:
#!/bin/bash # Description: Verify the usage of the new `column_usage` attribute in `QueryMetadata`. # Test: Search for the usage of the `column_usage` attribute. Expect: Usage of the attribute in the code. rg 'column_usage'Length of output: 1746
Script:
#!/bin/bash # Description: Extract relevant code segments where `column_usage` is used for manual review. # Extract lines containing `column_usage` in the relevant files for manual inspection. rg 'column_usage' -A 3 -B 3 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py rg 'column_usage' -A 3 -B 3 metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.pyLength of output: 4412
1140-1152
: New Methods: Verify the usage of_gen_remaining_queries
and_gen_query
methods.The
_gen_remaining_queries
and_gen_query
methods are new. Ensure they are used appropriately throughout the code.Also applies to: 1156-1240
Verification successful
New Methods: Verified the usage of
_gen_remaining_queries
and_gen_query
methods.The
_gen_remaining_queries
and_gen_query
methods are defined and used appropriately withinmetadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
.
_gen_remaining_queries
is defined at line 1140 and used at line 1131._gen_query
is defined at line 1156 and used at lines 1160 and 1170.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `_gen_remaining_queries` and `_gen_query` methods. # Test: Search for the usage of the `_gen_remaining_queries` method. Expect: Usage of the method in the code. rg --type python $'def _gen_remaining_queries' # Test: Search for the usage of the `_gen_query` method. Expect: Usage of the method in the code. rg --type python $'def _gen_query'Length of output: 145
Script:
#!/bin/bash # Description: Verify the usage of the new `_gen_remaining_queries` and `_gen_query` methods. # Test: Search for the definition of the `_gen_remaining_queries` method. rg 'def _gen_remaining_queries' # Test: Search for the definition of the `_gen_query` method. rg 'def _gen_query' # Test: Search for the usage of the `_gen_remaining_queries` method. rg '_gen_remaining_queries' # Test: Search for the usage of the `_gen_query` method. rg '_gen_query'Length of output: 880
659-669
: New Method: Verify the usage ofadd_preparsed_query
method.The
add_preparsed_query
method is new. Ensure it is used appropriately throughout the code.Verification successful
New Method: Verify the usage of
add_preparsed_query
method.The
add_preparsed_query
method is used appropriately in the codebase to handlePreparsedQuery
objects.
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
: Lines around 650-670Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `add_preparsed_query` method. # Test: Search for the usage of the `add_preparsed_query` method. Expect: Usage of the method in the code. rg --type python $'def add_preparsed_query'Length of output: 75
Script:
#!/bin/bash # Description: Verify the usage of the new `add_preparsed_query` method. # Test: Search for the usage of the `add_preparsed_query` method. Expect: Usage of the method in the code. rg 'add_preparsed_query'Length of output: 337
Script:
#!/bin/bash # Description: Extract the context of `add_preparsed_query` method usage. # Extract lines around the usage of `add_preparsed_query` method for deeper inspection. rg -A 5 -B 5 'add_preparsed_query' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 1414
133-133
: New Attribute: Verify the usage ofcolumn_usage
inKnownQueryLineageInfo
.The
column_usage
attribute is new. Ensure it is used appropriately throughout the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of mypy issue there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Outside diff range, codebase verification and nitpick comments (5)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_data_reader.py (1)
Line range hint
42-54
: Ensure SQL query safety.The SQL query construction should be reviewed to prevent SQL injection. Consider using parameterized queries to enhance security.
sql = 'select * from "{db_name}"."{schema_name}"."{table_name}" sample ({sample_size} rows);' cursor.execute(sql)You can use a parameterized query to avoid SQL injection:
sql = 'select * from "{db_name}"."{schema_name}"."{table_name}" sample (%s rows);' cursor.execute(sql, (sample_size,))metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py (1)
71-73
: Ensure Correct Description fordatabase
andlog_schema
.The descriptions for
database
andlog_schema
fields should clearly explain their purpose related to the Fivetran connector log.- database: str = Field(description="The fivetran connector log database.") - log_schema: str = Field(description="The fivetran connector log schema.") + database: str = Field(description="The database where the Fivetran connector logs are stored.") + log_schema: str = Field(description="The schema within the Fivetran connector log database.")metadata-ingestion/src/datahub/ingestion/source/redshift/lineage_v2.py (3)
Line range hint
34-50
:
Consider initializing known_urns in the constructor.To ensure all attributes are initialized in the constructor, consider initializing
self.known_urns
in the__init__
method.- self.known_urns: Set[str] = set() # will be set later + self.known_urns: Set[str] = set()
Line range hint
290-293
:
Consider adding a detailed TODO comment.The TODO comment should provide more details on what needs to be implemented.
- # TODO actor + # TODO: Implement actor extraction for lineage rows.
Line range hint
295-303
:
Improve logging for filtered targets.Consider adding more details to the log message for better debugging.
- logger.debug( - f"Skipping lineage for {target.urn()} as it is not in known_urns" - ) + logger.debug( + f"Skipping lineage for target URN: {target.urn()} as it is not in known_urns" + )
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (36)
- metadata-ingestion/setup.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/api/source.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/source/redshift/lineage_v2.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_assertion.py (4 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (8 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_connection.py (7 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_data_reader.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (12 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py (5 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py (14 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py (19 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_summary.py (6 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_usage_v2.py (13 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_utils.py (10 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py (20 hunks)
- metadata-ingestion/src/datahub/ingestion/source/sql/sql_config.py (4 hunks)
- metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (24 hunks)
- metadata-ingestion/tests/integration/snowflake/common.py (2 hunks)
- metadata-ingestion/tests/integration/snowflake/snowflake_golden.json (13 hunks)
- metadata-ingestion/tests/integration/snowflake/snowflake_privatelink_golden.json (2 hunks)
- metadata-ingestion/tests/integration/snowflake/test_snowflake_failures.py (3 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_add_known_query_lineage.json (1 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_basic_lineage.json (1 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_column_lineage_deduplication.json (2 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json (1 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts.json (2 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts_from_temp_tables.json (3 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_table_rename.json (2 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_temp_table.json (2 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_view_lineage.json (1 hunks)
- metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (3 hunks)
- metadata-ingestion/tests/unit/test_snowflake_source.py (9 hunks)
Files skipped from review due to trivial changes (1)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_basic_lineage.json
Additional context used
Ruff
metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py
48-48: Local variable
mock_connect
is assigned to but never usedRemove assignment to unused variable
mock_connect
(F841)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_utils.py
195-203: Return the negated condition directly
Inline condition
(SIM103)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_connection.py
119-119: Use
key not in dict
instead ofkey not in dict.keys()
Remove
.keys()
(SIM118)
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
452-452: Use of
functools.lru_cache
orfunctools.cache
on methods can lead to memory leaks(B019)
Additional comments not posted (172)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_view_lineage.json (1)
84-100
: Verify the structure and format of new entities.Ensure that the new entities added to the
subjects
array follow the correct structure and format.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_assertion.py (2)
Line range hint
63-78
: Ensure SQL query safety and verify processing logic.The SQL query construction should be reviewed to prevent SQL injection. Consider using parameterized queries to enhance security. Verify that the processing logic correctly handles the fetched data.
Line range hint
103-121
: Verify row processing logic.Ensure that the row processing logic correctly handles the data and generates the appropriate metadata change proposals.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_summary.py (2)
Line range hint
67-127
: Ensure SQL query safety and verify processing logic.The SQL query construction should be reviewed to prevent SQL injection. Consider using parameterized queries to enhance security. Verify that the processing logic correctly handles the fetched data.
Line range hint
129-131
: Verify method for correctness.Ensure that the method correctly returns the summary report.
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_add_known_query_lineage.json (1)
116-129
: Ensure Consistent Usage of URNs.The URNs for datasets and schema fields must follow a consistent pattern. Verify that the URNs used here match the expected format and refer to the correct entities.
Verification successful
Ensure Consistent Usage of URNs.
The URNs in the JSON file follow the expected format and refer to the correct entities. The patterns for
urn:li:dataset
andurn:li:schemaField
are consistent throughout the file.
Datasets:
urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD)
urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD)
Schema Fields:
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD),a)
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD),b)
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD),c)
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD),a)
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD),b)
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD),c)
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the consistency of URNs in the JSON file. # Test: Search for URNs in the file. Expect: All URNs follow the expected pattern and refer to valid entities. rg --type json 'urn:li:dataset:|urn:li:schemaField:' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_add_known_query_lineage.jsonLength of output: 1670
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json (1)
109-125
: Ensure Consistent Usage of URNs.The URNs for datasets and schema fields must follow a consistent pattern. Verify that the URNs used here match the expected format and refer to the correct entities.
Verification successful
URNs are consistent and valid.
The URNs for datasets and schema fields in the specified JSON file follow the expected patterns and refer to the correct entities.
urn:li:dataset:(urn:li:dataPlatform:redshift,<namespace>,PROD)
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,<namespace>,PROD),<field>)
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the consistency of URNs in the JSON file. # Test: Search for URNs in the file. Expect: All URNs follow the expected pattern and refer to valid entities. rg --type json 'urn:li:dataset:|urn:li:schemaField:' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.jsonLength of output: 1861
Script:
#!/bin/bash # Description: Verify the consistency and validity of URNs in the JSON file. # Search for dataset URNs and validate their patterns. rg --type json 'urn:li:dataset:\(urn:li:dataPlatform:redshift,[^)]*,PROD\)' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json # Search for schemaField URNs and validate their patterns. rg --type json 'urn:li:schemaField:\(urn:li:dataset:\(urn:li:dataPlatform:redshift,[^)]*,PROD\),[^)]*\)' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.jsonLength of output: 3184
metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py (1)
16-18
: Verify Import ofSnowflakeConnectionConfig
.Ensure that the
SnowflakeConnectionConfig
import is necessary and correctly used in the file.Verification successful
Verified Import of
SnowflakeConnectionConfig
.The import of
SnowflakeConnectionConfig
is necessary and correctly used in the filemetadata-ingestion/src/datahub/ingestion/source/fivetran/config.py
. It is utilized in the definition of theSnowflakeDestinationConfig
class, which inherits fromSnowflakeConnectionConfig
.
- Line 1:
SnowflakeConnectionConfig
is imported.- Line 2:
SnowflakeDestinationConfig
class extendsSnowflakeConnectionConfig
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `SnowflakeConnectionConfig` in the file. # Test: Search for `SnowflakeConnectionConfig` in the file. Expect: The import is necessary and correctly used. rg 'SnowflakeConnectionConfig' metadata-ingestion/src/datahub/ingestion/source/fivetran/config.pyLength of output: 189
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.py (1)
Line range hint
167-171
:
Ensure Proper Handling of Database and Schema Selection.The
callable_for_db_connection
method should handle database and schema selection correctly, especially when the PUBLIC schema is absent.Verification successful
Verified: Proper Handling of Database and Schema Selection.
The
callable_for_db_connection
method correctly handles the database and schema selection, explicitly managing the case when the PUBLIC schema is absent by checking for a schema name and executing theUSE SCHEMA
command if necessary.
- The method retrieves the native connection.
- Executes the
USE DATABASE
command.- Checks for a schema name and executes the
USE SCHEMA
command if provided.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the handling of database and schema selection in the `callable_for_db_connection` method. # Test: Search for the method in the file. Expect: Proper handling of database and schema selection, especially when the PUBLIC schema is absent. rg 'callable_for_db_connection' metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.pyLength of output: 241
Script:
#!/bin/bash # Description: Retrieve the full implementation of the `callable_for_db_connection` method to verify the handling of database and schema selection. # Use ast-grep to extract the implementation of `callable_for_db_connection` method from the file. ast-grep --lang python --pattern 'def callable_for_db_connection(self, db_name: str) -> Callable: $$$' metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.pyLength of output: 2040
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_column_lineage_deduplication.json (2)
96-112
: Ensure correct entity formatting.The new subjects added under the
querySubjects
aspect appear correctly formatted and consistent with the existing structure.
160-182
: Ensure correct entity formatting.The new subjects added under the
querySubjects
aspect appear correctly formatted and consistent with the existing structure.metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts.json (2)
121-137
: Ensure correct entity formatting.The new subjects added under the
querySubjects
aspect appear correctly formatted and consistent with the existing structure.
185-201
: Ensure correct entity formatting.The new subjects added under the
querySubjects
aspect appear correctly formatted and consistent with the existing structure.metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_table_rename.json (2)
84-100
: Ensure correct entity formatting.The new subjects added under the
querySubjects
aspect appear correctly formatted and consistent with the existing structure.
199-215
: Ensure correct entity formatting.The new subjects added under the
querySubjects
aspect appear correctly formatted and consistent with the existing structure.metadata-ingestion/src/datahub/ingestion/source/sql/sql_config.py (3)
11-13
: Correct mixin replacements.The new mixins
EnvConfigMixin
andPlatformInstanceConfigMixin
are correctly imported and used.
Line range hint
34-62
:
New classSQLFilterConfig
looks good.The new class
SQLFilterConfig
and its fields are correctly defined and adhere to best practices.
63-76
: Updates toSQLCommonConfig
class look good.The updates to the
SQLCommonConfig
class and its fields are correctly defined and adhere to best practices.metadata-ingestion/tests/integration/snowflake/test_snowflake_failures.py (8)
4-9
: Imports look good!The added imports are relevant for the tests in this file.
76-79
: Test case for missing role access looks good!The test correctly checks for the
PipelineInitError
when the role is not granted.
4-4
: Test case for missing warehouse access looks good!The test correctly simulates the condition and asserts the expected failure message.
Also applies to: 76-79
4-4
: Test case for no databases with access looks good!The test correctly simulates the condition and asserts the expected failure message.
Also applies to: 76-79
4-4
: Test case for no tables access looks good!The test correctly simulates the condition and asserts the expected failure message.
Also applies to: 76-79
4-4
: Test case for listing columns error looks good!The test correctly simulates the condition and asserts the expected warning message.
Also applies to: 76-79
4-4
: Test case for listing primary keys error looks good!The test correctly simulates the condition and asserts the expected warning message.
Also applies to: 76-79
4-4
: Test cases for missing permissions look good!The tests correctly simulate the conditions and assert the expected failure messages.
Also applies to: 76-79
metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (5)
2-2
: Imports look good!The added import
Iterable
is relevant for the tests in this file.
Line range hint
28-49
: Fixture setup looks good!The
stateful_source
fixture correctly sets up theSnowflakeV2Source
with the necessary configurations.Tools
Ruff
48-48: Local variable
mock_connect
is assigned to but never usedRemove assignment to unused variable
mock_connect
(F841)
47-49
: Test case for redundant run job IDs looks good!The test correctly validates the job IDs for both lineage and usage extractors.
Tools
Ruff
48-48: Local variable
mock_connect
is assigned to but never usedRemove assignment to unused variable
mock_connect
(F841)
47-49
: Test case for redundant run skip handler looks good!The test correctly covers multiple scenarios and validates the skip logic and suggested time windows.
Tools
Ruff
48-48: Local variable
mock_connect
is assigned to but never usedRemove assignment to unused variable
mock_connect
(F841)
47-49
: Utility functions and checkpoint tests look good!The functions and tests correctly validate the checkpoint creation logic using mocks and assertions.
Tools
Ruff
48-48: Local variable
mock_connect
is assigned to but never usedRemove assignment to unused variable
mock_connect
(F841)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_temp_table.json (1)
84-100
: JSON data for SQL parsing test cases looks good!The structure and values are correct and consistent with the expected schema.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_utils.py (7)
22-33
: ClassSnowflakeStructuredReportMixin
looks good!The methods correctly use the
structured_reporter
for reporting warnings and errors.
Line range hint
36-63
: ClassSnowflakeCommonProtocol
looks good!The class defines essential methods and properties for Snowflake integration.
Line range hint
65-141
: ClassSnowsightUrlBuilder
looks good!The methods are well-structured and handle various scenarios for building URLs.
Line range hint
143-225
: ClassSnowflakeFilterMixin
looks good!The methods correctly implement the filtering logic based on the configurations.
Tools
Ruff
195-203: Return the negated condition directly
Inline condition
(SIM103)
227-258
: ClassSnowflakeIdentifierMixin
looks good!The methods correctly handle identifiers based on the configurations.
Line range hint
259-283
: ClassSnowflakeCommonMixin
looks good!The methods correctly combine the functionalities of the mixins and provide additional utilities.
Tools
Ruff
195-203: Return the negated condition directly
Inline condition
(SIM103)
Line range hint
259-283
: Methodwarn_if_stateful_else_error
looks good!The method correctly checks the configuration and logs appropriately.
Tools
Ruff
195-203: Return the negated condition directly
Inline condition
(SIM103)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts_from_temp_tables.json (1)
179-193
: Ensure consistency in entity representation.The JSON structure looks correct. However, ensure that all entities are consistently represented across the dataset.
metadata-ingestion/src/datahub/ingestion/source/redshift/lineage_v2.py (5)
Line range hint
271-288
:
Ensure proper exception handling and logging.The method handles exceptions and logs warnings. Ensure that the logging provides sufficient context for debugging.
Line range hint
305-320
:
Handle missing DDL in STL scan entries.The method logs a warning for missing DDL. Ensure that the warning provides sufficient context for debugging.
Line range hint
322-333
:
Ensure consistent handling of DDL in view lineage.The method handles DDL for views. Ensure that the handling is consistent with other methods.
Line range hint
335-348
:
Ensure proper handling of source and target URNs.The method handles source and target URNs for copy commands. Ensure that the handling is consistent and correct.
Line range hint
391-393
:
Ensure consistent generation of metadata work units.The method generates metadata work units. Ensure that the generation is consistent and follows best practices.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_connection.py (8)
Line range hint
135-155
:
Ensure comprehensive validation for OAuth configuration.The method provides detailed validation for OAuth configuration. Ensure all edge cases are covered.
197-200
: Ensure correct generation of SQLAlchemy URL.The method correctly generates the SQLAlchemy URL with the provided parameters.
Line range hint
225-263
:
Ensure proper handling of private key in connection arguments.The method correctly handles private key for connection arguments.
Line range hint
263-301
:
Ensure proper handling of OAuth connection.The method correctly handles OAuth connection generation.
305-314
: Ensure proper handling of key pair connection.The method correctly handles key pair connection generation.
Line range hint
318-342
:
Ensure proper handling of native connection.The method correctly handles native connection generation.
349-362
: Ensure proper exception handling for connection generation.The method handles exceptions correctly when generating a connection.
114-114
: Remove unnecessary.keys()
call.Use
key not in dict
instead ofkey not in dict.keys()
.- if v not in _VALID_AUTH_TYPES.keys(): + if v not in _VALID_AUTH_TYPES:Likely invalid or redundant comment.
metadata-ingestion/src/datahub/ingestion/api/source.py (5)
117-119
: Ensure context is truncated correctly.The method correctly truncates the context if it exceeds the maximum length.
Line range hint
142-146
:
Ensure correct retrieval of log entries.The method correctly retrieves log entries of the specified type.
Line range hint
166-188
:
Ensure correct reporting of work units.The method correctly reports work units and updates the relevant metrics.
Line range hint
194-199
:
Ensure correct reporting of warnings.The method correctly reports warnings using structured logs.
Line range hint
225-232
:
Ensure correct computation of statistics.The method correctly computes statistics for the report.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (9)
7-7
: Imports look good.The new imports from
pydantic
anddatahub.configuration.source_common
are appropriate for the added functionality.Also applies to: 12-16
84-96
: New fields inSnowflakeFilterConfig
look good.The added fields for
database_pattern
,schema_pattern
, andmatch_fully_qualified_names
are appropriate for filtering configurations.
103-125
: Root validator logic is sound but check backward compatibility.The root validator ensures proper configuration for schema patterns and maintains backward compatibility. Verify if the deprecation warning is communicated effectively to users.
128-134
: New field inSnowflakeIdentifierConfig
looks good.The
convert_urns_to_lowercase
field with a default value ofTrue
is appropriate for identifier configurations.
146-167
: New fields inSnowflakeConfig
look good.The added fields for including table and view lineage are appropriate for lineage configurations.
158-168
: Root validator logic is sound but check dependency oninclude_table_lineage
.The root validator ensures that
include_table_lineage
is set toTrue
wheninclude_view_lineage
is enabled. Verify if this dependency is clearly documented and communicated to users.
Line range hint
170-365
: New fields and validators inSnowflakeV2Config
look good.The added fields and validators for usage statistics, technical schema, primary and foreign keys, column lineage, lazy schema resolver, tags, and other configurations are appropriate for Snowflake V2.
327-330
: Methodget_sql_alchemy_url
looks good.The method constructs a SQLAlchemy URL for Snowflake using the connection configuration.
Line range hint
371-417
: Methodvalidate_shares
looks good.The method validates the
shares
configuration, ensuring that platform instances and databases are correctly configured.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py (9)
10-10
: Import looks good.The new import from
datahub.ingestion.source.snowflake.snowflake_connection
is appropriate for the added functionality.
Line range hint
186-229
: New methods inSnowflakeDataDictionary
look good.The added methods for showing databases, getting databases, and getting schemas for a database are appropriate for data dictionary operations.
Line range hint
270-299
: Methodget_tables_for_database
looks good but verify error handling.The method retrieves tables for a given database. Verify if the error handling for the query is sufficient.
Line range hint
303-313
: Methodget_tables_for_schema
looks good.The method retrieves tables for a given schema in a database.
Line range hint
331-361
: Methodget_views_for_database
looks good but verify pagination logic.The method retrieves views for a given database with pagination. Verify if the pagination logic handles large result sets correctly.
Line range hint
424-438
: Methodget_pk_constraints_for_schema
looks good.The method retrieves primary key constraints for a given schema in a database.
Line range hint
443-471
: Methodget_fk_constraints_for_schema
looks good.The method retrieves foreign key constraints for a given schema in a database.
Line range hint
475-496
: Methodget_tags_for_database_without_propagation
looks good.The method retrieves tags for a database without propagation.
Line range hint
530-541
: Methodget_tags_on_columns_for_table
looks good.The method retrieves tags on columns for a given table in a database.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (8)
1-12
: Imports look good.The new imports from
pydantic
,pathlib
, andtyping_extensions
are appropriate for the added functionality.
57-87
: New fields inSnowflakeQueriesExtractorConfig
look good.The added fields for window, deny usernames, temporary tables pattern, and local temp path are appropriate for query extraction configurations.
92-94
: New field inSnowflakeQueriesSourceConfig
looks good.The added field for connection configuration is appropriate for Snowflake queries.
108-148
: New methods inSnowflakeQueriesExtractor
look good.The added methods for initializing the extractor, handling configurations, and managing temporary paths are appropriate for query extraction.
175-203
: Methodget_workunits_internal
looks good but verify caching logic.The method retrieves work units for Snowflake queries. Verify if the caching logic for the audit log is sufficient.
205-258
: Methodfetch_audit_log
looks good but verify error handling.The method fetches the audit log for Snowflake queries. Verify if the error handling for parsing audit log rows is sufficient.
259-365
: Method_parse_audit_log_row
looks good but verify JSON parsing logic.The method parses a row from the audit log. Verify if the JSON parsing logic for specific fields is sufficient.
402-501
: Function_build_enriched_audit_log_query
looks good.The function constructs a query for fetching enriched audit logs with appropriate filters and pagination.
metadata-ingestion/tests/unit/test_snowflake_source.py (5)
27-27
: Import looks good.The new import from
datahub.ingestion.source.snowflake.snowflake_utils
is appropriate for the added functionality.
448-460
: Functiontest_aws_cloud_region_from_snowflake_region_id
looks good.The function correctly tests the conversion of Snowflake region ID to AWS cloud region.
470-472
: Functiontest_google_cloud_region_from_snowflake_region_id
looks good.The function correctly tests the conversion of Snowflake region ID to Google Cloud region.
Line range hint
482-492
: Functiontest_azure_cloud_region_from_snowflake_region_id
looks good.The function correctly tests the conversion of Snowflake region ID to Azure cloud region.
502-504
: Functiontest_unknown_cloud_region_from_snowflake_region_id
looks good.The function correctly tests the handling of unknown Snowflake region IDs.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (17)
10-10
: Import Statement for Closeable Interface AddedThe
Closeable
interface was added. This is necessary for ensuring that resources are properly released when the object is no longer needed.
18-19
: Import Statement for SnowflakeConnection and SnowflakePermissionError AddedThe
SnowflakeConnection
andSnowflakePermissionError
imports were added, which are essential for handling Snowflake connections and related errors.
32-32
: Import Statement for KnownLineageMapping AddedThe
KnownLineageMapping
import was added. This is crucial for handling known lineage mappings in the lineage extraction process.
104-104
: Class SnowflakeLineageExtractor Now Implements CloseableThe
SnowflakeLineageExtractor
class now implements theCloseable
interface. This is important for ensuring that resources are properly released.
121-121
: Connection Initialization in ConstructorThe
SnowflakeConnection
is now initialized in the constructor, which aligns with the PR objectives of initializing the connection in the constructor.
130-130
: Use of SnowflakeConnectionThe
SnowflakeConnection
is now assigned toself.connection
in the constructor, which ensures that the connection is available throughout the class methods.
262-265
: Use of KnownLineageMapping in _populate_external_upstreamsThe
_populate_external_upstreams
method now usesKnownLineageMapping
. This improves how external lineage data is processed and aggregated.
271-275
: Use of KnownLineageMapping in _populate_external_upstreamsThe
_populate_external_upstreams
method now usesKnownLineageMapping
for show queries as well. This ensures consistency in handling external lineage data.
287-287
: Return Type Changed to Iterable[KnownLineageMapping]The
_populate_external_lineage_from_show_query
method now returnsIterable[KnownLineageMapping]
. This aligns with the improved handling of lineage data.
321-321
: Return Type Changed to Iterable[KnownLineageMapping]The
_populate_external_lineage_from_copy_history
method now returnsIterable[KnownLineageMapping]
. This aligns with the improved handling of lineage data.
329-334
: Use of KnownLineageMapping in _populate_external_lineage_from_copy_historyThe
_populate_external_lineage_from_copy_history
method now usesKnownLineageMapping
. This improves how external lineage data is processed and aggregated.
349-349
: Return Type Changed to Optional[KnownLineageMapping]The
_process_external_lineage_result_row
method now returnsOptional[KnownLineageMapping]
. This aligns with the improved handling of lineage data.
355-355
: Return None for Non-discovered TablesThe
_process_external_lineage_result_row
method now returnsNone
if the table is not indiscovered_tables
. This ensures that only relevant tables are processed.
362-368
: Use of KnownLineageMapping in _process_external_lineage_result_rowThe
_process_external_lineage_result_row
method now usesKnownLineageMapping
for creating lineage mappings. This improves the consistency and clarity of the lineage data.
423-428
: Added Dataset Pattern Validation in map_query_result_upstreamsThe
map_query_result_upstreams
method now includes dataset pattern validation. This ensures that only allowed datasets are processed.
509-514
: Added Dataset Pattern Validation in build_finegrained_lineage_upstreamsThe
build_finegrained_lineage_upstreams
method now includes dataset pattern validation. This ensures that only allowed datasets are processed.
565-567
: Added close Method to Implement CloseableThe
close
method was added to theSnowflakeLineageExtractor
class to fulfill theCloseable
interface requirements. This method should ensure that any resources are properly released.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_usage_v2.py (13)
12-12
: Import Statement for Closeable Interface AddedThe
Closeable
interface was added. This is necessary for ensuring that resources are properly released when the object is no longer needed.
17-18
: Import Statement for SnowflakeConnection and SnowflakePermissionError AddedThe
SnowflakeConnection
andSnowflakePermissionError
imports were added, which are essential for handling Snowflake connections and related errors.
109-109
: Class SnowflakeUsageExtractor Now Implements CloseableThe
SnowflakeUsageExtractor
class now implements theCloseable
interface. This is important for ensuring that resources are properly released.
114-114
: Connection Initialization in ConstructorThe
SnowflakeConnection
is now initialized in the constructor, which aligns with the PR objectives of initializing the connection in the constructor.
122-122
: Use of SnowflakeConnectionThe
SnowflakeConnection
is now assigned toself.connection
in the constructor, which ensures that the connection is available throughout the class methods.
203-203
: Use of SnowflakeConnection in _get_workunits_internalThe
_get_workunits_internal
method now usesself.connection.query
for querying Snowflake. This ensures consistency in how queries are executed.
235-235
: Added Dataset Pattern Validation in _get_workunits_internalThe
_get_workunits_internal
method now includes dataset pattern validation. This ensures that only allowed datasets are processed.
289-289
: Added Warning for Failed Usage Statistics ParsingA warning is logged if parsing usage statistics fails. This helps in identifying issues during the ingestion process.
372-373
: Assertion for Connection in _get_snowflake_historyAn assertion is added to ensure that
self.connection
is notNone
before querying. This prevents potential runtime errors.
395-396
: Assertion for Connection in _check_usage_date_rangesAn assertion is added to ensure that
self.connection
is notNone
before querying. This prevents potential runtime errors.
505-505
: Added Warning for Failed Operation History ParsingA warning is logged if parsing operation history fails. This helps in identifying issues during the ingestion process.
564-564
: Added Dataset Pattern Validation in _is_object_validThe
_is_object_valid
method now includes dataset pattern validation. This ensures that only allowed datasets are processed.
590-592
: Added close Method to Implement CloseableThe
close
method was added to theSnowflakeUsageExtractor
class to fulfill theCloseable
interface requirements. This method should ensure that any resources are properly released.metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (1)
195-195
: Handle URNs with Different Lengths in from_urn MethodThe
from_urn
method now handles URNs with different lengths. This ensures that both standard and non-standard URNs are processed correctly.metadata-ingestion/tests/integration/snowflake/common.py (3)
531-531
: LGTM! The query condition is correctly handled.The inclusion of view lineage and exclusion of column lineage is correctly implemented in the query.
607-610
: LGTM! The query condition is correctly handled.The time window for copying lineage history is correctly implemented in the query.
608-610
: LGTM! The query condition is correctly handled.The time window for copying lineage history is correctly implemented in the query.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py (9)
131-132
: Improvement: Initialize connection in the constructor.The connection initialization in the constructor is a good practice for better resource management.
141-141
: Refactor: Use composition for connection.Using composition for the connection (i.e.,
self.connection
) improves code readability and reusability.
236-238
: Update: Use SnowflakeConnectionConfig for connection parsing.Using
SnowflakeConnectionConfig
for connection parsing aligns with the new connection handling approach.
264-264
: Update: Use SnowflakeConnection in check_capabilities.The function now uses
SnowflakeConnection
for querying the Snowflake database, which aligns with the new connection handling approach.
426-426
: Improvement: Reinitialize connection at the start.Reinitializing the connection at the start of the function ensures that the latest connection settings are used.
432-434
: Improvement: Use SnowsightUrlBuilder for external URL generation.Using
SnowsightUrlBuilder
for external URL generation improves the handling of external URLs.
538-538
: Update: Use SnowflakeConnection for session metadata queries.The function now uses
SnowflakeConnection
for querying the Snowflake database for session metadata, which aligns with the new connection handling approach.
567-570
: Update: Use SnowflakeConnection for Snowsight URL generation.The function now uses
SnowflakeConnection
for querying the Snowflake database to generate the Snowsight URL, which aligns with the new connection handling approach.
Line range hint
618-618
:
Improvement: Ensure proper resource management.The function ensures that the connection and extractors are properly closed, which improves resource management.
metadata-ingestion/setup.py (2)
414-414
: Addition: Includesnowflake-queries
dependency.The
snowflake-queries
dependency has been added, which is necessary for the new Snowflake queries source.
667-667
: Addition: Registersnowflake-queries
source in entry points.The
snowflake-queries
source has been added to the entry points, making it discoverable.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py (2)
363-363
: LGTM!The
upstreams_deny_pattern
parameter addition is appropriate and the function logic is intact.
414-414
: LGTM!The
downstreams_deny_pattern
parameter addition and its usage increate_deny_regex_sql_filter
are appropriate.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py (6)
Line range hint
146-176
: LGTM!The
snowsight_url_builder
parameter addition and its assignment are appropriate and consistent with the rest of the initialization logic.
365-368
: LGTM!The addition of the
match_fully_qualified_names
parameter in theis_schema_allowed
function call is appropriate and consistent with the method's logic.
479-479
: LGTM!The addition of the
view_pattern.allowed
check is appropriate and ensures only allowed views are processed.
512-512
: LGTM!The addition of the
table_pattern.allowed
check is appropriate and ensures only allowed tables are processed.
Line range hint
789-799
: LGTM!The usage of
snowsight_url_builder
for generating external URLs is appropriate and consistent with the method's logic.
984-987
: LGTM!The usage of
snowsight_url_builder
for generating external URLs is appropriate and consistent with the method's logic.metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (15)
108-108
: Verify the usage ofcolumn_usage
field.Ensure that the addition of
column_usage
to theQueryMetadata
dataclass is correctly utilized and doesn't introduce issues.
134-134
: Verify the usage ofcolumn_usage
field.Ensure that the addition of
column_usage
to theKnownQueryLineageInfo
dataclass is correctly utilized and doesn't introduce issues.Verification successful
Verify the usage of
column_usage
field.The
column_usage
field has been integrated into thesql_parsing_aggregator.py
file and is used in various contexts such as assignments and function calls. The integration appears thorough and appropriate for tracking column usage without introducing issues.
- Instances of
column_usage
are found in the following lines:
- Assignments: Lines where
column_usage
is assigned values.- Function calls: Lines where
column_usage
is passed as an argument.- Conditionals: Lines where
column_usage
is used in conditions.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `column_usage` field in the file. # Test: Search for the usage of `column_usage`. Expect: Appropriate usage for tracking column usage. rg --type python 'column_usage' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 131
Script:
#!/bin/bash # Description: Verify the usage of `column_usage` field in the file. # Test: Search for the usage of `column_usage`. Expect: Appropriate usage for tracking column usage. rg 'column_usage' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 714
203-203
: Verify the usage ofnum_preparsed_queries
field.Ensure that the addition of
num_preparsed_queries
to theSqlAggregatorReport
dataclass is correctly utilized and doesn't introduce issues.Verification successful
The
num_preparsed_queries
field is correctly utilized for tracking preparsed queries.
- The field is declared and initialized to 0.
- It is incremented within the code, indicating it is used to count occurrences of preparsed queries.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `num_preparsed_queries` field in the file. # Test: Search for the usage of `num_preparsed_queries`. Expect: Appropriate usage for tracking preparsed queries. rg --type python 'num_preparsed_queries' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 140
Script:
#!/bin/bash # Description: Verify the usage of `num_preparsed_queries` field in the file. # Test: Search for the usage of `num_preparsed_queries`. Expect: Appropriate usage for tracking preparsed queries. rg --type py 'num_preparsed_queries' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 191
222-222
: Verify the usage ofnum_lineage_skipped_due_to_filters
field.Ensure that the addition of
num_lineage_skipped_due_to_filters
to theSqlAggregatorReport
dataclass is correctly utilized and doesn't introduce issues.Verification successful
Verified the usage of
num_lineage_skipped_due_to_filters
field.The
num_lineage_skipped_due_to_filters
field is correctly utilized for tracking skipped lineage due to filters.
- Field declaration: Line 222
- Field usage: Incremented within the file
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `num_lineage_skipped_due_to_filters` field in the file. # Test: Search for the usage of `num_lineage_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped lineage. rg --type python 'num_lineage_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 153
Script:
#!/bin/bash # Description: Verify the usage of `num_lineage_skipped_due_to_filters` field in the file. # Test: Search for the usage of `num_lineage_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped lineage. rg 'num_lineage_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 220
257-259
: Verify the usage ofgenerate_query_usage_statistics
parameter.Ensure that the addition of
generate_query_usage_statistics
to theSqlParsingAggregator
class constructor is correctly utilized and doesn't introduce issues.Verification successful
The
generate_query_usage_statistics
parameter is correctly utilized.
- The parameter is declared, assigned to an instance variable, and used in logical conditions within the
SqlParsingAggregator
class to control the generation of query usage statistics.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `generate_query_usage_statistics` parameter in the file. # Test: Search for the usage of `generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation. rg --type python 'generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 150
Script:
#!/bin/bash # Description: Verify the usage of `generate_query_usage_statistics` parameter in the file. # Test: Search for the usage of `generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation. rg -t py 'generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 459
230-230
: Verify the usage ofnum_query_usage_stats_generated
field.Ensure that the addition of
num_query_usage_stats_generated
to theSqlAggregatorReport
dataclass is correctly utilized and doesn't introduce issues.Verification successful
The
num_query_usage_stats_generated
field is correctly utilized.The field
num_query_usage_stats_generated
is properly used for tracking the generation of query usage statistics. It is incremented appropriately, indicating that this field is being utilized as intended.
- Declaration and initialization:
num_query_usage_stats_generated: int = 0
- Increment operation:
self.report.num_query_usage_stats_generated += 1
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `num_query_usage_stats_generated` field in the file. # Test: Search for the usage of `num_query_usage_stats_generated`. Expect: Appropriate usage for tracking generated query usage statistics. rg --type python 'num_query_usage_stats_generated' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 150
Script:
#!/bin/bash # Description: Verify the usage of `num_query_usage_stats_generated` field in the file. # Test: Search for the usage of `num_query_usage_stats_generated`. Expect: Appropriate usage for tracking generated query usage statistics. rg --type py 'num_query_usage_stats_generated' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 225
18-18
: Verify the usage ofget_time_bucket
.Ensure that the usage of
get_time_bucket
aligns with the intended functionality and handles time-based operations correctly.Verification successful
The usage of
get_time_bucket
aligns with the intended functionality.The function is correctly used to convert timestamps into bucketed time formats based on the specified bucket duration.
- Location:
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `get_time_bucket` in the file. # Test: Search for the usage of `get_time_bucket`. Expect: Appropriate usage for time-based operations. rg --type python 'get_time_bucket' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 134
Script:
#!/bin/bash # Description: Verify the usage of `get_time_bucket` in the file. # Test: Search for the usage of `get_time_bucket`. Expect: Appropriate usage for time-based operations. rg --type py 'get_time_bucket' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 206
Script:
#!/bin/bash # Description: Retrieve more lines of context around the usage of `get_time_bucket` to ensure it aligns with the intended functionality. # Test: Display lines around the usage of `get_time_bucket`. rg --type py -A 5 -B 5 'get_time_bucket' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 1171
273-275
: Verify the usage ofself.generate_query_usage_statistics
.Ensure that the assignment of
generate_query_usage_statistics
toself.generate_query_usage_statistics
is correctly utilized in the class methods.Verification successful
The assignment and usage of
self.generate_query_usage_statistics
are correctly utilized in the class methods.
- The variable is assigned appropriately and used in multiple conditional checks to control the generation of query usage statistics.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `self.generate_query_usage_statistics` in the file. # Test: Search for the usage of `self.generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation. rg --type python 'self.generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 155
Script:
#!/bin/bash # Description: Verify the usage of `self.generate_query_usage_statistics` in the file. # Test: Search for the usage of `self.generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation. rg 'self.generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 403
262-263
: Verify the usage ofis_temp_table
andis_allowed_table
parameters.Ensure that the addition of
is_temp_table
andis_allowed_table
parameters to theSqlParsingAggregator
class constructor is correctly utilized and doesn't introduce issues.Verification successful
Verified the usage of
is_temp_table
andis_allowed_table
parameters.The parameters
is_temp_table
andis_allowed_table
have been correctly integrated into theSqlParsingAggregator
class. They are assigned to instance variables and utilized in methods to control table filtering.
- The parameters are assigned to
self._is_temp_table
andself._is_allowed_table
.- Methods
is_temp_table
andis_allowed_table
use these instance variables to perform their logic.- These methods are used in various parts of the class to appropriately filter tables.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `is_temp_table` and `is_allowed_table` parameters in the file. # Test: Search for the usage of `is_temp_table` and `is_allowed_table`. Expect: Appropriate usage for controlling table filtering. rg --type python 'is_temp_table|is_allowed_table' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 151
Script:
#!/bin/bash # Description: Verify the usage of `is_temp_table` and `is_allowed_table` parameters in the file. # Test: Search for the usage of `is_temp_table` and `is_allowed_table`. Expect: Appropriate usage for controlling table filtering. rg 'is_temp_table|is_allowed_table' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 964
381-389
: Verify the usage ofquery_usage_counts
field.Ensure that the addition of
query_usage_counts
to theSqlParsingAggregator
class is correctly utilized and doesn't introduce issues.Verification successful
The
query_usage_counts
field is correctly utilized.The
query_usage_counts
field added to theSqlParsingAggregator
class is appropriately used for tracking query usage statistics.
- Lines involving
query_usage_counts
include initialization, checks for its presence, and usage in various operations.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `query_usage_counts` field in the file. # Test: Search for the usage of `query_usage_counts`. Expect: Appropriate usage for tracking query usage statistics. rg --type python 'query_usage_counts' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 137
Script:
#!/bin/bash # Description: Verify the usage of `query_usage_counts` field in the file. # Test: Search for the usage of `query_usage_counts`. Expect: Appropriate usage for tracking query usage statistics. rg 'query_usage_counts' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 685
281-283
: Verify the usage of theusage_config
check.Ensure that the check for
usage_config
whengenerate_usage_statistics
orgenerate_query_usage_statistics
is enabled is correctly utilized and doesn't introduce issues.Verification successful
The usage of the
usage_config
check is correctly utilized.The code contains several assertions to ensure that
usage_config
is notNone
when necessary, indicating that the check is correctly handled and does not introduce issues.
- Line defining
usage_config
asOptional[BaseUsageConfig] = None
- Line assigning
usage_config
toself.usage_config
- Line checking if
self.usage_config
isNone
- Several assertions
assert self.usage_config is not None
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the `usage_config` check in the file. # Test: Search for the usage of the `usage_config` check. Expect: Appropriate usage for ensuring `usage_config` is provided. rg --type python 'usage_config' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 131
Script:
#!/bin/bash # Description: Verify the usage of the `usage_config` check in the file. # Test: Search for the usage of the `usage_config` check. Expect: Appropriate usage for ensuring `usage_config` is provided. rg --type py 'usage_config' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 662
4-4
: Verify the usage offunctools
.Ensure that the usage of
functools
doesn't introduce memory leaks or other issues as warned by static analysis tools.Verification successful
Verified the appropriate usage of
functools
.The
functools.lru_cache
decorator is correctly applied to the_name_from_urn
method, providing caching benefits without introducing memory leaks or other issues.
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
:
import functools
@functools.lru_cache(maxsize=128)
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `functools` in the file. # Test: Search for the usage of `functools`. Expect: Appropriate usage without memory leaks. rg --type python 'functools' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 128
Script:
#!/bin/bash # Description: Verify the usage of `functools` in the file. # Test: Search for the usage of `functools`. Expect: Appropriate usage without memory leaks. rg 'functools' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 138
Script:
#!/bin/bash # Description: Gather context around the usage of `functools.lru_cache` in the file. # Test: Search for the function where `functools.lru_cache` is applied and get surrounding lines to understand the context. rg -A 10 '@functools.lru_cache' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 555
148-169
: Verify the usage ofPreparsedQuery
dataclass.Ensure that the
PreparsedQuery
dataclass is correctly utilized and its fields align with the intended functionality.Verification successful
Verified the usage of
PreparsedQuery
dataclass.The
PreparsedQuery
dataclass is correctly utilized in the codebase. The fields and usage align with the intended functionality of representing pre-parsed queries.
- Defined as a dataclass.
- Used in type annotations.
- Instantiated in the method
add_preparsed_query
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `PreparsedQuery` dataclass in the file. # Test: Search for the usage of `PreparsedQuery`. Expect: Appropriate usage for representing pre-parsed queries. rg --type python 'PreparsedQuery' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 133
Script:
#!/bin/bash # Description: Verify the usage of `PreparsedQuery` dataclass in the file. # Test: Search for the usage of `PreparsedQuery`. Expect: Appropriate usage for representing pre-parsed queries. rg 'PreparsedQuery' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 303
Script:
#!/bin/bash # Description: Gather surrounding context of `PreparsedQuery` usages in the file. # Test: Search for the surrounding context of `PreparsedQuery` usages. rg 'PreparsedQuery' -A 5 -B 5 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 1751
141-146
: Verify the usage ofKnownLineageMapping
dataclass.Ensure that the
KnownLineageMapping
dataclass is correctly utilized and its fields align with the intended functionality.Verification successful
The
KnownLineageMapping
dataclass is correctly utilized.The
KnownLineageMapping
dataclass is used appropriately in type hints and type checks to handle lineage information, ensuring it aligns with the intended functionality.
- Definition of
KnownLineageMapping
: lines 141-146- Usage in
add
method: lines following the definitionScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `KnownLineageMapping` dataclass in the file. # Test: Search for the usage of `KnownLineageMapping`. Expect: Appropriate usage for lineage mappings. rg --type python 'KnownLineageMapping' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 138
Script:
#!/bin/bash # Description: Verify the usage of `KnownLineageMapping` dataclass in the file. # Test: Search for the usage of `KnownLineageMapping`. Expect: Appropriate usage for lineage mappings. rg 'KnownLineageMapping' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 258
Script:
#!/bin/bash # Description: Inspect the context of `KnownLineageMapping` usages in the file. # Test: Extract the context lines around the usages of `KnownLineageMapping` to verify alignment with the expected functionality. rg 'KnownLineageMapping' -A 5 -B 5 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 1015
226-226
: Verify the usage ofnum_queries_skipped_due_to_filters
field.Ensure that the addition of
num_queries_skipped_due_to_filters
to theSqlAggregatorReport
dataclass is correctly utilized and doesn't introduce issues.Verification successful
Verified the usage of
num_queries_skipped_due_to_filters
field.The addition of
num_queries_skipped_due_to_filters
to theSqlAggregatorReport
dataclass is correctly utilized to track the number of queries skipped due to filters.
- The field is incremented appropriately within the
_gen_query
method when a query doesn't involve any allowed tables.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `num_queries_skipped_due_to_filters` field in the file. # Test: Search for the usage of `num_queries_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped queries. rg --type python 'num_queries_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 153
Script:
#!/bin/bash # Description: Verify the usage of `num_queries_skipped_due_to_filters` field in the file. # Test: Search for the usage of `num_queries_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped queries. rg 'num_queries_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 220
Script:
#!/bin/bash # Description: Verify the context of the usage of `num_queries_skipped_due_to_filters` field in the file. # Test: Search for the context where `self.report.num_queries_skipped_due_to_filters += 1` is used. rg 'self\.report\.num_queries_skipped_due_to_filters \+= 1' -A 5 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 355
Script:
#!/bin/bash # Description: Verify the context of the usage of `num_queries_skipped_due_to_filters` field in the file. # Test: Search for the full context of the method or function where `self.report.num_queries_skipped_due_to_filters += 1` occurs. rg 'def ' -A 20 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py | rg -B 20 'self\.report\.num_queries_skipped_due_to_filters \+= 1'Length of output: 987
metadata-ingestion/tests/integration/snowflake/snowflake_privatelink_golden.json (2)
3895-3959
: LGTM! Schema fields are correctly added to thequerySubjects
aspect.The schema fields are correctly specified with appropriate field paths, types, and other properties.
4174-4238
: LGTM! Schema fields are correctly added to thequerySubjects
aspect.The schema fields are correctly specified with appropriate field paths, types, and other properties.
metadata-ingestion/tests/integration/snowflake/snowflake_golden.json (13)
Line range hint
1-1
:
Approved: Addition of new dataset entity.The addition of the new dataset entity
urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.test_schema.table_2,PROD)
is consistent with the PR summary.
4524-4566
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_1
is consistent with the PR summary.
5138-5174
: Approved: Addition of multiple schemaField entities.The addition of multiple schemaField entities related to
test_db.test_schema.table_2
is consistent with the PR summary.
5202-5243
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_10
is consistent with the PR summary.
5755-5796
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_4
is consistent with the PR summary.
5988-6029
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_5
is consistent with the PR summary.
6421-6462
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_3
is consistent with the PR summary.
6860-6901
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_6
is consistent with the PR summary.
7043-7084
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_8
is consistent with the PR summary.
7096-7167
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_1
andtest_db.test_schema.view_1
is consistent with the PR summary.
7268-7309
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_9
is consistent with the PR summary.
7435-7476
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_7
is consistent with the PR summary.
7488-7559
: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_2
andtest_db.test_schema.view_2
is consistent with the PR summary.
self.aggregator._is_temp_table = ( | ||
lambda name: DatasetUrn.create_from_ids( | ||
self.platform, | ||
name, | ||
env=self.config.env, | ||
platform_instance=self.config.platform_instance, | ||
).urn() | ||
not in self.known_urns | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve readability by using a named function for the lambda.
Using a named function improves readability and maintainability.
- self.aggregator._is_temp_table = (
- lambda name: DatasetUrn.create_from_ids(
- self.platform,
- name,
- env=self.config.env,
- platform_instance=self.config.platform_instance,
- ).urn()
- not in self.known_urns
- )
+ def is_temp_table(name: str) -> bool:
+ return DatasetUrn.create_from_ids(
+ self.platform,
+ name,
+ env=self.config.env,
+ platform_instance=self.config.platform_instance,
+ ).urn() not in self.known_urns
+ self.aggregator._is_temp_table = is_temp_table
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self.aggregator._is_temp_table = ( | |
lambda name: DatasetUrn.create_from_ids( | |
self.platform, | |
name, | |
env=self.config.env, | |
platform_instance=self.config.platform_instance, | |
).urn() | |
not in self.known_urns | |
) | |
def is_temp_table(name: str) -> bool: | |
return DatasetUrn.create_from_ids( | |
self.platform, | |
name, | |
env=self.config.env, | |
platform_instance=self.config.platform_instance, | |
).urn() not in self.known_urns | |
self.aggregator._is_temp_table = is_temp_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (3 hunks)
Additional comments not posted (2)
metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (2)
2-2
: Correctly updated return type toIterable[SnowflakeV2Source]
The return type of the
stateful_source
fixture has been correctly updated toIterable[SnowflakeV2Source]
, aligning with the changes in the source code.
47-49
: Proper use of context manager forSnowflakeV2Source
The addition of a context manager to manage the
SnowflakeV2Source
instance ensures proper resource management and cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range, codebase verification and nitpick comments (2)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (2)
71-77
: Local temporary path configuration for audit logs.The
local_temp_path
field is configured to store the audit log locally. The TODO comments suggest improvements for caching and local testing.Ensure the improvements for caching and local testing are tracked and implemented.
402-502
: Build enriched audit log query.The
_build_enriched_audit_log_query
function correctly builds the SQL query to fetch enriched audit logs from Snowflake. The TODO comments suggest improvements for generating better query fingerprints, adding table filter clauses, and dropping unnecessary columns.Ensure the improvements suggested in the TODO comments are tracked and implemented.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- docs-website/vercel-setup.sh (2 hunks)
- metadata-ingestion/scripts/install_deps.sh (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1 hunks)
Files skipped from review due to trivial changes (1)
- docs-website/vercel-setup.sh
Additional comments not posted (17)
metadata-ingestion/scripts/install_deps.sh (1)
21-22
: Addition ofkrb5-devel
dependency foryum
systems.The addition of
krb5-devel
is correctly placed under theyum
package manager section. This ensures that Kerberos development libraries are available for systems usingyum
.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (16)
92-94
: Snowflake connection configuration.The
connection
field is correctly defined to configure the Snowflake connection.
96-99
: Snowflake Queries Extractor Report fields.The fields for the time window and SQL aggregator report are correctly defined.
103-106
: Snowflake Queries Source Report field.The field for the queries extractor report is correctly defined.
108-120
: Initialization of SnowflakeQueriesExtractor.The constructor initializes the connection, configuration, reports, and SQL aggregator.
155-165
: Local temporary path for audit logs.The
local_temp_path
method ensures a temporary directory is created for storing audit logs. It logs the path being used.
166-170
: Check for temporary tables.The
is_temp_table
method checks if a table name matches any of the temporary table patterns.
172-174
: Check for allowed tables.The
is_allowed_table
method checks if a table name is allowed based on dataset patterns.
175-203
: Generate work units from queries.The
get_workunits_internal
method generates work units from the queries. It handles the audit log caching and iterates through the queries to add them to the SQL aggregator.
204-257
: Fetch audit logs from Snowflake.The
fetch_audit_log
method fetches audit logs from Snowflake. It includes TODO comments for fetching additional information and handling errors.
259-262
: Generate dataset identifier from qualified name.The
get_dataset_identifier_from_qualified_name
method generates a dataset identifier from a qualified name.
263-365
: Parse audit log row.The
_parse_audit_log_row
method parses a row from the audit log and generates aPreparsedQuery
object. It includes TODO comments for filtering table names and mapping email addresses.
368-373
: Initialization of SnowflakeQueriesSource.The constructor initializes the context, configuration, reports, and queries extractor.
385-388
: Create SnowflakeQueriesSource from config.The
create
method creates aSnowflakeQueriesSource
instance from a configuration dictionary and pipeline context.
390-392
: Generate work units from queries.The
get_workunits_internal
method generates work units from the queries using the queries extractor.
394-395
: Get report for SnowflakeQueriesSource.The
get_report
method returns the report for the SnowflakeQueriesSource.
504-515
: Snowflake query type mappings.The
SNOWFLAKE_QUERY_TYPE_MAPPING
constant correctly maps Snowflake query types to internal query types.
class SnowflakeQueriesExtractorConfig(SnowflakeIdentifierConfig, SnowflakeFilterConfig): | ||
# TODO: Support stateful ingestion for the time windows. | ||
window: BaseTimeWindowConfig = BaseTimeWindowConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add support for stateful ingestion for the time windows.
The TODO comment indicates that support for stateful ingestion for the time windows is pending.
Do you want me to generate the implementation for stateful ingestion or open a GitHub issue to track this task?
# TODO: make this a proper allow/deny pattern | ||
deny_usernames: List[str] = [] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this a proper allow/deny pattern.
The TODO comment suggests that the deny_usernames
field should be converted to a proper allow/deny pattern.
Consider refactoring this field to support a proper allow/deny pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- metadata-ingestion/setup.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py (19 hunks)
- metadata-models/src/main/resources/entity-registry.yml (1 hunks)
Files skipped from review due to trivial changes (2)
- metadata-ingestion/setup.py
- metadata-models/src/main/resources/entity-registry.yml
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py
* feat(forms) Handle deleting forms references when hard deleting forms (datahub-project#10820) * refactor(ui): Misc improvements to the setup ingestion flow (ingest uplift 1/2) (datahub-project#10764) Co-authored-by: John Joyce <john@Johns-MBP.lan> Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> * fix(ingestion/airflow-plugin): pipeline tasks discoverable in search (datahub-project#10819) * feat(ingest/transformer): tags to terms transformer (datahub-project#10758) Co-authored-by: Aseem Bansal <asmbansal2@gmail.com> * fix(ingestion/unity-catalog): fixed issue with profiling with GE turned on (datahub-project#10752) Co-authored-by: Aseem Bansal <asmbansal2@gmail.com> * feat(forms) Add java SDK for form entity PATCH + CRUD examples (datahub-project#10822) * feat(SDK) Add java SDK for structuredProperty entity PATCH + CRUD examples (datahub-project#10823) * feat(SDK) Add StructuredPropertyPatchBuilder in python sdk and provide sample CRUD files (datahub-project#10824) * feat(forms) Add CRUD endpoints to GraphQL for Form entities (datahub-project#10825) * add flag for includeSoftDeleted in scroll entities API (datahub-project#10831) * feat(deprecation) Return actor entity with deprecation aspect (datahub-project#10832) * feat(structuredProperties) Add CRUD graphql APIs for structured property entities (datahub-project#10826) * add scroll parameters to openapi v3 spec (datahub-project#10833) * fix(ingest): correct profile_day_of_week implementation (datahub-project#10818) * feat(ingest/glue): allow ingestion of empty databases from Glue (datahub-project#10666) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * feat(cli): add more details to get cli (datahub-project#10815) * fix(ingestion/glue): ensure date formatting works on all platforms for aws glue (datahub-project#10836) * fix(ingestion): fix datajob patcher (datahub-project#10827) * fix(smoke-test): add suffix in temp file creation (datahub-project#10841) * feat(ingest/glue): add helper method to permit user or group ownership (datahub-project#10784) * feat(): Show data platform instances in policy modal if they are set on the policy (datahub-project#10645) Co-authored-by: Hendrik Richert <hendrik.richert@swisscom.com> * docs(patch): add patch documentation for how implementation works (datahub-project#10010) Co-authored-by: John Joyce <john@acryl.io> * fix(jar): add missing custom-plugin-jar task (datahub-project#10847) * fix(): also check exceptions/stack trace when filtering log messages (datahub-project#10391) Co-authored-by: John Joyce <john@acryl.io> * docs(): Update posts.md (datahub-project#9893) Co-authored-by: Hyejin Yoon <0327jane@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * chore(ingest): update acryl-datahub-classify version (datahub-project#10844) * refactor(ingest): Refactor structured logging to support infos, warnings, and failures structured reporting to UI (datahub-project#10828) Co-authored-by: John Joyce <john@Johns-MBP.lan> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(restli): log aspect-not-found as a warning rather than as an error (datahub-project#10834) * fix(ingest/nifi): remove duplicate upstream jobs (datahub-project#10849) * fix(smoke-test): test access to create/revoke personal access tokens (datahub-project#10848) * fix(smoke-test): missing test for move domain (datahub-project#10837) * ci: update usernames to not considered for community (datahub-project#10851) * env: change defaults for data contract visibility (datahub-project#10854) * fix(ingest/tableau): quote special characters in external URL (datahub-project#10842) * fix(smoke-test): fix flakiness of auto complete test * ci(ingest): pin dask dependency for feast (datahub-project#10865) * fix(ingestion/lookml): liquid template resolution and view-to-view cll (datahub-project#10542) * feat(ingest/audit): add client id and version in system metadata props (datahub-project#10829) * chore(ingest): Mypy 1.10.1 pin (datahub-project#10867) * docs: use acryl-datahub-actions as expected python package to install (datahub-project#10852) * docs: add new js snippet (datahub-project#10846) * refactor(ingestion): remove company domain for security reason (datahub-project#10839) * fix(ingestion/spark): Platform instance and column level lineage fix (datahub-project#10843) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingestion/tableau): optionally ingest multiple sites and create site containers (datahub-project#10498) Co-authored-by: Yanik Häni <Yanik.Haeni1@swisscom.com> * fix(ingestion/looker): Add sqlglot dependency and remove unused sqlparser (datahub-project#10874) * fix(manage-tokens): fix manage access token policy (datahub-project#10853) * Batch get entity endpoints (datahub-project#10880) * feat(system): support conditional write semantics (datahub-project#10868) * fix(build): upgrade vercel builds to Node 20.x (datahub-project#10890) * feat(ingest/lookml): shallow clone repos (datahub-project#10888) * fix(ingest/looker): add missing dependency (datahub-project#10876) * fix(ingest): only populate audit stamps where accurate (datahub-project#10604) * fix(ingest/dbt): always encode tag urns (datahub-project#10799) * fix(ingest/redshift): handle multiline alter table commands (datahub-project#10727) * fix(ingestion/looker): column name missing in explore (datahub-project#10892) * fix(lineage) Fix lineage source/dest filtering with explored per hop limit (datahub-project#10879) * feat(conditional-writes): misc updates and fixes (datahub-project#10901) * feat(ci): update outdated action (datahub-project#10899) * feat(rest-emitter): adding async flag to rest emitter (datahub-project#10902) Co-authored-by: Gabe Lyons <gabe.lyons@acryl.io> * feat(ingest): add snowflake-queries source (datahub-project#10835) * fix(ingest): improve `auto_materialize_referenced_tags_terms` error handling (datahub-project#10906) * docs: add new company to adoption list (datahub-project#10909) * refactor(redshift): Improve redshift error handling with new structured reporting system (datahub-project#10870) Co-authored-by: John Joyce <john@Johns-MBP.lan> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * feat(ui) Finalize support for all entity types on forms (datahub-project#10915) * Index ExecutionRequestResults status field (datahub-project#10811) * feat(ingest): grafana connector (datahub-project#10891) Co-authored-by: Shirshanka Das <shirshanka@apache.org> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(gms) Add Form entity type to EntityTypeMapper (datahub-project#10916) * feat(dataset): add support for external url in Dataset (datahub-project#10877) * docs(saas-overview) added missing features to observe section (datahub-project#10913) Co-authored-by: John Joyce <john@acryl.io> * fix(ingest/spark): Fixing Micrometer warning (datahub-project#10882) * fix(structured properties): allow application of structured properties without schema file (datahub-project#10918) * fix(data-contracts-web) handle other schedule types (datahub-project#10919) * fix(ingestion/tableau): human-readable message for PERMISSIONS_MODE_SWITCHED error (datahub-project#10866) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * Add feature flag for view defintions (datahub-project#10914) Co-authored-by: Ethan Cartwright <ethan.cartwright@acryl.io> * feat(ingest/BigQuery): refactor+parallelize dataset metadata extraction (datahub-project#10884) * fix(airflow): add error handling around render_template() (datahub-project#10907) * feat(ingestion/sqlglot): add optional `default_dialect` parameter to sqlglot lineage (datahub-project#10830) * feat(mcp-mutator): new mcp mutator plugin (datahub-project#10904) * fix(ingest/bigquery): changes helper function to decode unicode scape sequences (datahub-project#10845) * feat(ingest/postgres): fetch table sizes for profile (datahub-project#10864) * feat(ingest/abs): Adding azure blob storage ingestion source (datahub-project#10813) * fix(ingest/redshift): reduce severity of SQL parsing issues (datahub-project#10924) * fix(build): fix lint fix web react (datahub-project#10896) * fix(ingest/bigquery): handle quota exceeded for project.list requests (datahub-project#10912) * feat(ingest): report extractor failures more loudly (datahub-project#10908) * feat(ingest/snowflake): integrate snowflake-queries into main source (datahub-project#10905) * fix(ingest): fix docs build (datahub-project#10926) * fix(ingest/snowflake): fix test connection (datahub-project#10927) * fix(ingest/lookml): add view load failures to cache (datahub-project#10923) * docs(slack) overhauled setup instructions and screenshots (datahub-project#10922) Co-authored-by: John Joyce <john@acryl.io> * fix(airflow): Add comma parsing of owners to DataJobs (datahub-project#10903) * fix(entityservice): fix merging sideeffects (datahub-project#10937) * feat(ingest): Support System Ingestion Sources, Show and hide system ingestion sources with Command-S (datahub-project#10938) Co-authored-by: John Joyce <john@Johns-MBP.lan> * chore() Set a default lineage filtering end time on backend when a start time is present (datahub-project#10925) Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> Co-authored-by: John Joyce <john@Johns-MBP.lan> * Added relationships APIs to V3. Added these generic APIs to V3 swagger doc. (datahub-project#10939) * docs: add learning center to docs (datahub-project#10921) * doc: Update hubspot form id (datahub-project#10943) * chore(airflow): add python 3.11 w/ Airflow 2.9 to CI (datahub-project#10941) * fix(ingest/Glue): column upstream lineage between S3 and Glue (datahub-project#10895) * fix(ingest/abs): split abs utils into multiple files (datahub-project#10945) * doc(ingest/looker): fix doc for sql parsing documentation (datahub-project#10883) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(ingest/bigquery): Adding missing BigQuery types (datahub-project#10950) * fix(ingest/setup): feast and abs source setup (datahub-project#10951) * fix(connections) Harden adding /gms to connections in backend (datahub-project#10942) * feat(siblings) Add flag to prevent combining siblings in the UI (datahub-project#10952) * fix(docs): make graphql doc gen more automated (datahub-project#10953) * feat(ingest/athena): Add option for Athena partitioned profiling (datahub-project#10723) * fix(spark-lineage): default timeout for future responses (datahub-project#10947) * feat(datajob/flow): add environment filter using info aspects (datahub-project#10814) * fix(ui/ingest): correct privilege used to show tab (datahub-project#10483) Co-authored-by: Kunal-kankriya <127090035+Kunal-kankriya@users.noreply.github.com> * feat(ingest/looker): include dashboard urns in browse v2 (datahub-project#10955) * add a structured type to batchGet in OpenAPI V3 spec (datahub-project#10956) * fix(ui): scroll on the domain sidebar to show all domains (datahub-project#10966) * fix(ingest/sagemaker): resolve incorrect variable assignment for SageMaker API call (datahub-project#10965) * fix(airflow/build): Pinning mypy (datahub-project#10972) * Fixed a bug where the OpenAPI V3 spec was incorrect. The bug was introduced in datahub-project#10939. (datahub-project#10974) * fix(ingest/test): Fix for mssql integration tests (datahub-project#10978) * fix(entity-service) exist check correctly extracts status (datahub-project#10973) * fix(structuredProps) casing bug in StructuredPropertiesValidator (datahub-project#10982) * bugfix: use anyOf instead of allOf when creating references in openapi v3 spec (datahub-project#10986) * fix(ui): Remove ant less imports (datahub-project#10988) * feat(ingest/graph): Add get_results_by_filter to DataHubGraph (datahub-project#10987) * feat(ingest/cli): init does not actually support environment variables (datahub-project#10989) * fix(ingest/graph): Update get_results_by_filter graphql query (datahub-project#10991) * feat(ingest/spark): Promote beta plugin (datahub-project#10881) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingest): support domains in meta -> "datahub" section (datahub-project#10967) * feat(ingest): add `check server-config` command (datahub-project#10990) * feat(cli): Make consistent use of DataHubGraphClientConfig (datahub-project#10466) Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig. This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig. I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled. Fixes datahub-project#9705 * fix(ingest/s3): Fixing container creation when there is no folder in path (datahub-project#10993) * fix(ingest/looker): support platform instance for dashboards & charts (datahub-project#10771) * feat(ingest/bigquery): improve handling of information schema in sql parser (datahub-project#10985) * feat(ingest): improve `ingest deploy` command (datahub-project#10944) * fix(backend): allow excluding soft-deleted entities in relationship-queries; exclude soft-deleted members of groups (datahub-project#10920) - allow excluding soft-deleted entities in relationship-queries - exclude soft-deleted members of groups * fix(ingest/looker): downgrade missing chart type log level (datahub-project#10996) * doc(acryl-cloud): release docs for 0.3.4.x (datahub-project#10984) Co-authored-by: John Joyce <john@acryl.io> Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Pedro Silva <pedro@acryl.io> * fix(protobuf/build): Fix protobuf check jar script (datahub-project#11006) * fix(ui/ingest): Support invalid cron jobs (datahub-project#10998) * fix(ingest): fix graph config loading (datahub-project#11002) Co-authored-by: Pedro Silva <pedro@acryl.io> * feat(docs): Document __DATAHUB_TO_FILE_ directive (datahub-project#10968) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(graphql/upsertIngestionSource): Validate cron schedule; parse error in CLI (datahub-project#11011) * feat(ece): support custom ownership type urns in ECE generation (datahub-project#10999) * feat(assertion-v2): changed Validation tab to Quality and created new Governance tab (datahub-project#10935) * fix(ingestion/glue): Add support for missing config options for profiling in Glue (datahub-project#10858) * feat(propagation): Add models for schema field docs, tags, terms (datahub-project#2959) (datahub-project#11016) Co-authored-by: Chris Collins <chriscollins3456@gmail.com> * docs: standardize terminology to DataHub Cloud (datahub-project#11003) * fix(ingestion/transformer): replace the externalUrl container (datahub-project#11013) * docs(slack) troubleshoot docs (datahub-project#11014) * feat(propagation): Add graphql API (datahub-project#11030) Co-authored-by: Chris Collins <chriscollins3456@gmail.com> * feat(propagation): Add models for Action feature settings (datahub-project#11029) * docs(custom properties): Remove duplicate from sidebar (datahub-project#11033) * feat(models): Introducing Dataset Partitions Aspect (datahub-project#10997) Co-authored-by: John Joyce <john@Johns-MBP.lan> Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> * feat(propagation): Add Documentation Propagation Settings (datahub-project#11038) * fix(models): chart schema fields mapping, add dataHubAction entity, t… (datahub-project#11040) * fix(ci): smoke test lint failures (datahub-project#11044) * docs: fix learning center color scheme & typo (datahub-project#11043) * feat: add cloud main page (datahub-project#11017) Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com> * feat(restore-indices): add additional step to also clear system metadata service (datahub-project#10662) Co-authored-by: John Joyce <john@acryl.io> * docs: fix typo (datahub-project#11046) * fix(lint): apply spotless (datahub-project#11050) * docs(airflow): example query to get datajobs for a dataflow (datahub-project#11034) * feat(cli): Add run-id option to put sub-command (datahub-project#11023) Adds an option to assign run-id to a given put command execution. This is useful when transformers do not exist for a given ingestion payload, we can follow up with custom metadata and assign it to an ingestion pipeline. * fix(ingest): improve sql error reporting calls (datahub-project#11025) * fix(airflow): fix CI setup (datahub-project#11031) * feat(ingest/dbt): add experimental `prefer_sql_parser_lineage` flag (datahub-project#11039) * fix(ingestion/lookml): enable stack-trace in lookml logs (datahub-project#10971) * (chore): Linting fix (datahub-project#11015) * chore(ci): update deprecated github actions (datahub-project#10977) * Fix ALB configuration example (datahub-project#10981) * chore(ingestion-base): bump base image packages (datahub-project#11053) * feat(cli): Trim report of dataHubExecutionRequestResult to max GMS size (datahub-project#11051) * fix(ingestion/lookml): emit dummy sql condition for lookml custom condition tag (datahub-project#11008) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(ingestion/powerbi): fix issue with broken report lineage (datahub-project#10910) * feat(ingest/tableau): add retry on timeout (datahub-project#10995) * change generate kafka connect properties from env (datahub-project#10545) Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> * fix(ingest): fix oracle cronjob ingestion (datahub-project#11001) Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> * chore(ci): revert update deprecated github actions (datahub-project#10977) (datahub-project#11062) * feat(ingest/dbt-cloud): update metadata_endpoint inference (datahub-project#11041) * build: Reduce size of datahub-frontend-react image by 50-ish% (datahub-project#10878) Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> * fix(ci): Fix lint issue in datahub_ingestion_run_summary_provider.py (datahub-project#11063) * docs(ingest): update developing-a-transformer.md (datahub-project#11019) * feat(search-test): update search tests from datahub-project#10408 (datahub-project#11056) * feat(cli): add aspects parameter to DataHubGraph.get_entity_semityped (datahub-project#11009) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * docs(airflow): update min version for plugin v2 (datahub-project#11065) * doc(ingestion/tableau): doc update for derived permission (datahub-project#11054) Co-authored-by: Pedro Silva <pedro.cls93@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(py): remove dep on types-pkg_resources (datahub-project#11076) * feat(ingest/mode): add option to exclude restricted (datahub-project#11081) * fix(ingest): set lastObserved in sdk when unset (datahub-project#11071) * doc(ingest): Update capabilities (datahub-project#11072) * chore(vulnerability): Log Injection (datahub-project#11090) * chore(vulnerability): Information exposure through a stack trace (datahub-project#11091) * chore(vulnerability): Comparison of narrow type with wide type in loop condition (datahub-project#11089) * chore(vulnerability): Insertion of sensitive information into log files (datahub-project#11088) * chore(vulnerability): Risky Cryptographic Algorithm (datahub-project#11059) * chore(vulnerability): Overly permissive regex range (datahub-project#11061) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix: update customer data (datahub-project#11075) * fix(models): fixing the datasetPartition models (datahub-project#11085) Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> * fix(ui): Adding view, forms GraphQL query, remove showing a fallback error message on unhandled GraphQL error (datahub-project#11084) Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> * feat(docs-site): hiding learn more from cloud page (datahub-project#11097) * fix(docs): Add correct usage of orFilters in search API docs (datahub-project#11082) Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com> * fix(ingest/mode): Regexp in mode name matcher didn't allow underscore (datahub-project#11098) * docs: Refactor customer stories section (datahub-project#10869) Co-authored-by: Jeff Merrick <jeff@wireform.io> * fix(release): fix full/slim suffix on tag (datahub-project#11087) * feat(config): support alternate hashing algorithm for doc id (datahub-project#10423) Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> Co-authored-by: John Joyce <john@acryl.io> * fix(emitter): fix typo in get method of java kafka emitter (datahub-project#11007) * fix(ingest): use correct native data type in all SQLAlchemy sources by compiling data type using dialect (datahub-project#10898) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * chore: Update contributors list in PR labeler (datahub-project#11105) * feat(ingest): tweak stale entity removal messaging (datahub-project#11064) * fix(ingestion): enforce lastObserved timestamps in SystemMetadata (datahub-project#11104) * fix(ingest/powerbi): fix broken lineage between chart and dataset (datahub-project#11080) * feat(ingest/lookml): CLL support for sql set in sql_table_name attribute of lookml view (datahub-project#11069) * docs: update graphql docs on forms & structured properties (datahub-project#11100) * test(search): search openAPI v3 test (datahub-project#11049) * fix(ingest/tableau): prevent empty site content urls (datahub-project#11057) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(entity-client): implement client batch interface (datahub-project#11106) * fix(snowflake): avoid reporting warnings/info for sys tables (datahub-project#11114) * fix(ingest): downgrade column type mapping warning to info (datahub-project#11115) * feat(api): add AuditStamp to the V3 API entity/aspect response (datahub-project#11118) * fix(ingest/redshift): replace r'\n' with '\n' to avoid token error redshift serverless… (datahub-project#11111) * fix(entiy-client): handle null entityUrn case for restli (datahub-project#11122) * fix(sql-parser): prevent bad urns from alter table lineage (datahub-project#11092) * fix(ingest/bigquery): use small batch size if use_tables_list_query_v2 is set (datahub-project#11121) * fix(graphql): add missing entities to EntityTypeMapper and EntityTypeUrnMapper (datahub-project#10366) * feat(ui): Changes to allow editable dataset name (datahub-project#10608) Co-authored-by: Jay Kadambi <jayasimhan_venkatadri@optum.com> * fix: remove saxo (datahub-project#11127) * feat(mcl-processor): Update mcl processor hooks (datahub-project#11134) * fix(openapi): fix openapi v2 endpoints & v3 documentation update * Revert "fix(openapi): fix openapi v2 endpoints & v3 documentation update" This reverts commit 573c1cb. * docs(policies): updates to policies documentation (datahub-project#11073) * fix(openapi): fix openapi v2 and v3 docs update (datahub-project#11139) * feat(auth): grant type and acr values custom oidc parameters support (datahub-project#11116) * fix(mutator): mutator hook fixes (datahub-project#11140) * feat(search): support sorting on multiple fields (datahub-project#10775) * feat(ingest): various logging improvements (datahub-project#11126) * fix(ingestion/lookml): fix for sql parsing error (datahub-project#11079) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * feat(docs-site) cloud page spacing and content polishes (datahub-project#11141) * feat(ui) Enable editing structured props on fields (datahub-project#11042) * feat(tests): add md5 and last computed to testResult model (datahub-project#11117) * test(openapi): openapi regression smoke tests (datahub-project#11143) * fix(airflow): fix tox tests + update docs (datahub-project#11125) * docs: add chime to adoption stories (datahub-project#11142) * fix(ingest/databricks): Updating code to work with Databricks sdk 0.30 (datahub-project#11158) * fix(kafka-setup): add missing script to image (datahub-project#11190) * fix(config): fix hash algo config (datahub-project#11191) * test(smoke-test): updates to smoke-tests (datahub-project#11152) * fix(elasticsearch): refactor idHashAlgo setting (datahub-project#11193) * chore(kafka): kafka version bump (datahub-project#11211) * readd UsageStatsWorkUnit * fix merge problems * change logo --------- Co-authored-by: Chris Collins <chriscollins3456@gmail.com> Co-authored-by: John Joyce <john@acryl.io> Co-authored-by: John Joyce <john@Johns-MBP.lan> Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> Co-authored-by: dushayntAW <158567391+dushayntAW@users.noreply.github.com> Co-authored-by: sagar-salvi-apptware <159135491+sagar-salvi-apptware@users.noreply.github.com> Co-authored-by: Aseem Bansal <asmbansal2@gmail.com> Co-authored-by: Kevin Chun <kevin1chun@gmail.com> Co-authored-by: jordanjeremy <72943478+jordanjeremy@users.noreply.github.com> Co-authored-by: skrydal <piotr.skrydalewicz@gmail.com> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> Co-authored-by: sid-acryl <155424659+sid-acryl@users.noreply.github.com> Co-authored-by: Julien Jehannet <80408664+aviv-julienjehannet@users.noreply.github.com> Co-authored-by: Hendrik Richert <github@richert.li> Co-authored-by: Hendrik Richert <hendrik.richert@swisscom.com> Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com> Co-authored-by: Felix Lüdin <13187726+Masterchen09@users.noreply.github.com> Co-authored-by: Pirry <158024088+chardaway@users.noreply.github.com> Co-authored-by: Hyejin Yoon <0327jane@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: cburroughs <chris.burroughs@gmail.com> Co-authored-by: ksrinath <ksrinath@users.noreply.github.com> Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com> Co-authored-by: Kunal-kankriya <127090035+Kunal-kankriya@users.noreply.github.com> Co-authored-by: Shirshanka Das <shirshanka@apache.org> Co-authored-by: ipolding-cais <155455744+ipolding-cais@users.noreply.github.com> Co-authored-by: Tamas Nemeth <treff7es@gmail.com> Co-authored-by: Shubham Jagtap <132359390+shubhamjagtap639@users.noreply.github.com> Co-authored-by: haeniya <yanik.haeni@gmail.com> Co-authored-by: Yanik Häni <Yanik.Haeni1@swisscom.com> Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com> Co-authored-by: Gabe Lyons <gabe.lyons@acryl.io> Co-authored-by: 808OVADOZE <52988741+shtephlee@users.noreply.github.com> Co-authored-by: noggi <anton.kuraev@acryl.io> Co-authored-by: Nicholas Pena <npena@foursquare.com> Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com> Co-authored-by: ethan-cartwright <ethan.cartwright.m@gmail.com> Co-authored-by: Ethan Cartwright <ethan.cartwright@acryl.io> Co-authored-by: Nadav Gross <33874964+nadavgross@users.noreply.github.com> Co-authored-by: Patrick Franco Braz <patrickfbraz@poli.ufrj.br> Co-authored-by: pie1nthesky <39328908+pie1nthesky@users.noreply.github.com> Co-authored-by: Joel Pinto Mata (KPN-DSH-DEX team) <130968841+joelmataKPN@users.noreply.github.com> Co-authored-by: Ellie O'Neil <110510035+eboneil@users.noreply.github.com> Co-authored-by: Ajoy Majumdar <ajoymajumdar@hotmail.com> Co-authored-by: deepgarg-visa <149145061+deepgarg-visa@users.noreply.github.com> Co-authored-by: Tristan Heisler <tristankheisler@gmail.com> Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io> Co-authored-by: Davi Arnaut <davi.arnaut@acryl.io> Co-authored-by: Pedro Silva <pedro@acryl.io> Co-authored-by: amit-apptware <132869468+amit-apptware@users.noreply.github.com> Co-authored-by: Sam Black <sam.black@acryl.io> Co-authored-by: Raj Tekal <varadaraj_tekal@optum.com> Co-authored-by: Steffen Grohsschmiedt <gitbhub@steffeng.eu> Co-authored-by: jaegwon.seo <162448493+wornjs@users.noreply.github.com> Co-authored-by: Renan F. Lima <51028757+lima-renan@users.noreply.github.com> Co-authored-by: Matt Exchange <xkollar@users.noreply.github.com> Co-authored-by: Jonny Dixon <45681293+acrylJonny@users.noreply.github.com> Co-authored-by: Pedro Silva <pedro.cls93@gmail.com> Co-authored-by: Pinaki Bhattacharjee <pinakipb2@gmail.com> Co-authored-by: Jeff Merrick <jeff@wireform.io> Co-authored-by: skrydal <piotr.skrydalewicz@acryl.io> Co-authored-by: AndreasHegerNuritas <163423418+AndreasHegerNuritas@users.noreply.github.com> Co-authored-by: jayasimhankv <145704974+jayasimhankv@users.noreply.github.com> Co-authored-by: Jay Kadambi <jayasimhan_venkatadri@optum.com> Co-authored-by: David Leifker <david.leifker@acryl.io>
snowflake.connector.SnowflakeConnection
with a newSnowflakeConnection
type.SnowflakeConnection
around using composition, removing some mixin classes likeSnowflakeQueryMixin
andSnowflakeConnectionMixin
.self.query(...)
is nowself.connection.query(...)
. As part of this, the connection is initialized in the constructor instead of in theget_workunits_internal
method.SnowflakeCommonMixin
by introducing theSnowflakeFilterMixin
andSnowflakeIdentifierMixin
instead. I'm not fully convinced this is the best design - something with composition like SnowsightUrlBuilder might actually be better, but would increase the size of the diff even more.Follow up TODOs:
include_view_lineage
flag.Checklist
Summary by CodeRabbit
New Features
QueryUsageStatistics
to track dataset usage statistics.Improvements
Bug Fixes
Tests
Chores