-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingestion/glue): Add support for missing config options for profiling in Glue #10858
fix(ingestion/glue): Add support for missing config options for profiling in Glue #10858
Conversation
WalkthroughThe recent changes significantly enhance the AWS Glue source for metadata ingestion by introducing comprehensive support for data profiling. Key updates include refined configuration options for profiling, enabling management at both table and partition levels, and improved handling through streamlined methods. Additionally, new test cases and profiling-related data stubs have been created to validate these enhancements effectively, ensuring robust performance and functionality. Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
Additional comments not posted (8)
metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (3)
23-26
: Add tests for the newmax_workers
field.The
max_workers
field is correctly defined with a default value and description. Ensure that tests are added to cover this new configuration option.Verification successful
Tests for the
max_workers
field are present.The following test files contain references to the
max_workers
field, confirming that it is covered by tests:
metadata-ingestion/tests/integration/trino/test_trino.py
metadata-ingestion/tests/integration/snowflake/test_snowflake.py
metadata-ingestion/tests/unit/utilities/test_partition_executor.py
metadata-ingestion/tests/integration/snowflake/test_snowflake_classification.py
metadata-ingestion/tests/unit/utilities/test_backpressure_aware_executor.py
metadata-ingestion/tests/integration/bigquery_v2/test_bigquery.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that tests for the `max_workers` field are present. # Test: Search for the usage of the `max_workers` field in test files. Expect: Test cases covering the `max_workers` field. rg --type python --word-regexp --glob '*test*.py' $'max_workers'Length of output: 94
Script:
#!/bin/bash # Description: Verify that tests for the `max_workers` field are present. # Test: Search for the usage of the `max_workers` field in test files. Expect: Test cases covering the `max_workers` field. rg --type py --word-regexp --glob '*test*.py' max_workersLength of output: 1334
15-18
: Add tests for the newprofile_table_level_only
field.The
profile_table_level_only
field is correctly defined with a default value and description. Ensure that tests are added to cover this new configuration option.Verification successful
Tests for the new
profile_table_level_only
field are present.The
profile_table_level_only
field is covered in the following test files:
smoke-test/tests/privileges/test_privileges.py
metadata-ingestion/tests/unit/test_ge_profiling_config.py
metadata-ingestion/tests/integration/snowflake/test_snowflake.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that tests for the `profile_table_level_only` field are present. # Test: Search for the usage of the `profile_table_level_only` field in test files. Expect: Test cases covering the `profile_table_level_only` field. rg --type python --word-regexp --glob '*test*.py' $'profile_table_level_only'Length of output: 107
Script:
#!/bin/bash # Description: Verify that tests for the `profile_table_level_only` field are present. # Test: Search for the usage of the `profile_table_level_only` field in test files. Expect: Test cases covering the `profile_table_level_only` field. rg --word-regexp --glob '*test*.py' 'profile_table_level_only'Length of output: 943
11-14
: Add tests for the newenabled
field.The
enabled
field is correctly defined with a default value and description. Ensure that tests are added to cover this new configuration option.Verification successful
Tests for the new
enabled
field are present.The
enabled
field is covered in multiple test files, ensuring that this new configuration option is adequately tested.
smoke-test/tests/test_stateful_ingestion.py
smoke-test/test_e2e.py
smoke-test/tests/managed_ingestion/managed_ingestion_test.py
smoke-test/tests/privileges/test_privileges.py
metadata-ingestion/tests/integration/superset/test_superset.py
metadata-ingestion/tests/integration/snowflake/test_snowflake_stateful.py
metadata-ingestion/tests/integration/unity/test_unity_catalog_ingest.py
metadata-ingestion/tests/integration/salesforce/test_salesforce.py
metadata-ingestion/tests/integration/tableau/test_tableau_ingest.py
metadata-ingestion/tests/integration/s3/test_s3.py
metadata-ingestion/tests/integration/trino/test_trino.py
metadata-ingestion/tests/integration/okta/test_okta.py
metadata-ingestion/tests/integration/snowflake/test_snowflake.py
metadata-ingestion/tests/integration/snowflake/test_snowflake_classification.py
metadata-ingestion/tests/integration/powerbi/test_profiling.py
metadata-ingestion/tests/integration/powerbi/test_stateful_ingestion.py
metadata-ingestion/tests/integration/ldap/test_ldap_stateful.py
metadata-ingestion/tests/integration/qlik_sense/test_qlik_sense.py
metadata-ingestion/tests/unit/test_unity_catalog_config.py
metadata-ingestion/tests/integration/metabase/test_metabase.py
metadata-ingestion/tests/integration/kafka/test_kafka_state.py
metadata-ingestion/tests/integration/lookml/test_lookml.py
metadata-ingestion/tests/integration/looker/test_looker.py
metadata-ingestion/tests/integration/kafka-connect/test_kafka_connect.py
metadata-ingestion/tests/unit/test_glue_source.py
metadata-ingestion/tests/integration/iceberg/test_iceberg.py
metadata-ingestion/tests/unit/test_redash_source.py
metadata-ingestion/tests/unit/test_ge_profiling_config.py
metadata-ingestion/tests/unit/test_classification.py
metadata-ingestion/tests/unit/test_bigqueryv2_usage_source.py
metadata-ingestion/tests/unit/test_datahub_source.py
metadata-ingestion/tests/integration/dynamodb/test_dynamodb.py
metadata-ingestion/tests/integration/dbt/test_dbt.py
metadata-ingestion/tests/unit/stateful_ingestion/state/test_stateful_ingestion.py
metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py
metadata-ingestion/tests/unit/test_elasticsearch_source.py
metadata-ingestion/tests/unit/stateful_ingestion/test_configs.py
metadata-ingestion/tests/unit/stateful_ingestion/provider/test_provider.py
metadata-ingestion/tests/integration/azure_ad/test_azure_ad.py
metadata-ingestion/tests/integration/bigquery_v2/test_bigquery.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that tests for the `enabled` field are present. # Test: Search for the usage of the `enabled` field in test files. Expect: Test cases covering the `enabled` field. rg --type python --word-regexp --glob '*test*.py' $'enabled'Length of output: 90
Script:
#!/bin/bash # Description: Verify that tests for the `enabled` field are present. # Test: Search for the usage of the `enabled` field in test files. Expect: Test cases covering the `enabled` field. rg --type py --word-regexp --glob '*test*.py' enabledLength of output: 10014
metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (5)
162-165
: Add tests for the updatedprofiling
field.The
profiling
field is updated to use a default factory. Ensure that tests are added to cover this change.Verification successful
Tests for the
profiling
field are already present.The search results indicate that multiple test files reference and test the
profiling
field. No additional tests are necessary.
smoke-test/tests/privileges/test_privileges.py
smoke-test/tests/managed_ingestion/managed_ingestion_test.py
metadata-ingestion/tests/unit/test_elasticsearch_source.py
metadata-ingestion/tests/unit/test_iceberg.py
metadata-ingestion/tests/unit/test_unity_catalog_config.py
metadata-ingestion/src/datahub/ingestion/source/unity/connection_test.py
metadata-ingestion/tests/integration/salesforce/test_salesforce.py
metadata-ingestion/tests/integration/s3/test_s3.py
metadata-ingestion/tests/integration/powerbi/test_profiling.py
metadata-ingestion/tests/integration/snowflake/test_snowflake.py
metadata-ingestion/tests/integration/unity/test_unity_catalog_ingest.py
metadata-ingestion/tests/integration/trino/test_trino.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that tests for the `profiling` field are present. # Test: Search for the usage of the `profiling` field in test files. Expect: Test cases covering the `profiling` field. rg --type python --word-regexp --glob '*test*.py' $'profiling'Length of output: 92
Script:
#!/bin/bash # Description: Verify that tests for the `profiling` field are present. # Test: Search for the usage of the `profiling` field in test files. Expect: Test cases covering the `profiling` field. rg --type py --word-regexp --glob '*test*.py' profilingLength of output: 4010
176-179
: Add tests for the newis_profiling_enabled
method.The
is_profiling_enabled
method is correctly defined to check if profiling is enabled. Ensure that tests are added to cover this new method.
5-5
: Remove unused imports if not needed.Ensure that the newly added imports for
ThreadPoolExecutor
andas_completed
are used in the code. If not, remove the unused imports.
Line range hint
840-884
: Add tests for the refactoredget_profile_if_enabled
method.The
get_profile_if_enabled
method has been refactored to use aThreadPoolExecutor
for processing partitions concurrently and added error handling. Ensure that tests are added to cover these changes.
788-820
: Add tests for the new profiling logic in_create_profile_mcp
.The
_create_profile_mcp
method has been updated to include logic for handling profiling settings. Ensure that tests are added to cover these changes.
3c389ea
to
f2cff22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range and nitpick comments (1)
metadata-ingestion/tests/unit/test_glue_source.py (1)
90-123
: Add docstring to theglue_source_with_profiling
function.To improve readability and maintainability, add a docstring describing the purpose and usage of the function.
def glue_source_with_profiling( platform_instance: Optional[str] = None, use_s3_bucket_tags: bool = False, use_s3_object_tags: bool = False, extract_delta_schema_from_parameters: bool = False, ) -> GlueSource: """ Returns a GlueSource object configured for table-level data profiling. Args: platform_instance (Optional[str]): The platform instance. use_s3_bucket_tags (bool): Whether to use S3 bucket tags. use_s3_object_tags (bool): Whether to use S3 object tags. extract_delta_schema_from_parameters (bool): Whether to extract delta schema from parameters. Returns: GlueSource: Configured GlueSource object. """ profiling_config = GlueProfilingConfig( enabled=True, profile_table_level_only=False, row_count="row_count", column_count="column_count", unique_count="unique_count", unique_proportion="unique_proportion", null_count="null_count", null_proportion="null_proportion", min="min", max="max", mean="mean", median="median", stdev="stdev", ) return GlueSource( ctx=PipelineContext(run_id="glue-source-test"), config=GlueSourceConfig( aws_region="us-west-2", extract_transforms=False, platform_instance=platform_instance, use_s3_bucket_tags=use_s3_bucket_tags, use_s3_object_tags=use_s3_object_tags, extract_delta_schema_from_parameters=extract_delta_schema_from_parameters, profiling=profiling_config, ), )
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
- metadata-ingestion/tests/unit/glue/glue_mces_golden.json (1 hunks)
- metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
- metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Files not summarized due to errors (1)
- metadata-ingestion/tests/unit/glue/glue_mces_golden.json: Error: Message exceeds token limit
Files skipped from review as they are similar to previous changes (2)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
Additional context used
Biome
metadata-ingestion/tests/unit/glue/glue_mces_golden.json
[error] 17-17: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 20-20: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 22-22: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 29-29: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 36-36: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 43-43: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 57-57: JSON standard does not allow single quoted strings
Use double quotes to escape the string.
(parse)
[error] 69-69: JSON standard does not allow single quoted strings
Use double quotes to escape the string.
(parse)
[error] 73-73: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 77-77: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 88-88: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 108-108: JSON standard does not allow single quoted strings
Use double quotes to escape the string.
(parse)
[error] 109-109: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 121-121: JSON standard does not allow single quoted strings
Use double quotes to escape the string.
(parse)
[error] 122-122: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 134-134: JSON standard does not allow single quoted strings
Use double quotes to escape the string.
(parse)
[error] 135-135: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 147-147: JSON standard does not allow single quoted strings
Use double quotes to escape the string.
(parse)
[error] 148-148: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 160-160: JSON standard does not allow single quoted strings
Use double quotes to escape the string.
(parse)
Additional comments not posted (5)
metadata-ingestion/tests/unit/test_glue_source.py (1)
473-518
: Ensure proper cleanup intest_glue_ingest_with_profiling
.Add cleanup code to ensure that resources are properly released after the test execution.
@freeze_time(FROZEN_TIME) def test_glue_ingest_with_profiling( tmp_path: Path, pytestconfig: PytestConfig, platform_instance: str, mce_file: str, mce_golden_file: str, ) -> None: glue_source_instance = glue_source_with_profiling( platform_instance=platform_instance ) with Stubber(glue_source_instance.glue_client) as glue_stubber: glue_stubber.add_response("get_databases", get_databases_response_profiling, {}) glue_stubber.add_response( "get_tables", get_tables_response_profiling_1, {"DatabaseName": "flights-database-profiling"}, ) glue_stubber.add_response( "get_table", {"Table": tables_profiling_1[0]}, {"DatabaseName": "flights-database-profiling", "Name": "avro-profiling"}, ) mce_objects = [wu.metadata for wu in glue_source_instance.get_workunits()] glue_stubber.assert_no_pending_responses() write_metadata_file(tmp_path / mce_file, mce_objects) # Verify the output. test_resources_dir = pytestconfig.rootpath / "tests/unit/glue" mce_helpers.check_golden_file( pytestconfig, output_path=tmp_path / mce_file, golden_path=test_resources_dir / mce_golden_file, )metadata-ingestion/tests/unit/test_glue_source_stubs.py (4)
883-902
: LGTM!The
get_databases_response_profiling
dictionary is correctly structured and consistent with existing database responses. Profiling-related parameters are appropriately included.
903-986
: LGTM!The
tables_profiling_1
list andget_tables_response_profiling_1
dictionary are correctly structured and consistent with existing table responses. Profiling-related parameters for table columns are appropriately included.
Line range hint
987-1000
:
LGTM!The
mock_get_object_response
function is correctly implemented to mock S3 client responses. It encodes the provided raw body and creates aStreamingBody
object.
Line range hint
1001-1018
:
LGTM!The
get_object_response_1
,get_object_response_2
,get_bucket_tagging
, andget_object_tagging
functions are correctly implemented to return mock S3 responses. They utilize themock_get_object_response
helper function and provide appropriate content for testing.
f2cff22
to
bb3659e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (1)
metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (1)
19-26
: Fieldmax_workers
is correctly added but consider updating the description.The
max_workers
field specifies the number of worker threads for profiling. The default value and description are appropriate. However, the description could be clearer by mentioning that the default value is based on the number of CPU cores.- description="Number of worker threads to use for profiling. Set to 1 to disable." + description="Number of worker threads to use for profiling. Default is 5 times the number of CPU cores. Set to 1 to disable."
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
- metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (1 hunks)
- metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
- metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Files skipped from review as they are similar to previous changes (3)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
- metadata-ingestion/tests/unit/test_glue_source.py
- metadata-ingestion/tests/unit/test_glue_source_stubs.py
Additional comments not posted (15)
metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2)
11-14
: Fieldenabled
is correctly added.The
enabled
field allows toggling profiling on or off. The default value and description are appropriate.
15-18
: Fieldprofile_table_level_only
is correctly added.The
profile_table_level_only
field allows limiting profiling to table-level. The default value and description are appropriate.metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (13)
1-21
: Container properties are correctly added.The container properties include custom properties, name, and qualified name. All fields are appropriately set.
22-33
: Status aspect is correctly added.The status aspect indicates that the container is not removed. The field is appropriately set.
34-44
: Data platform instance is correctly added.The data platform instance specifies that the platform is Glue. The field is appropriately set.
45-57
: SubTypes aspect is correctly added.The subTypes aspect indicates that the container is a Database. The field is appropriately set.
58-91
: Dataset properties are correctly added.The dataset properties include various custom properties, such as schema versions, average record size, classification, compression type, object count, record count, size, data type, location, input and output formats, compression status, number of buckets, and serde info. All fields are appropriately set.
92-95
: Dataset name and qualified name are correctly added.The dataset name and qualified name fields are appropriately set.
96-210
: Schema metadata is correctly added.The schema metadata includes schema name, platform, version, creation and modification times, hash, platform schema, and field details. All fields are appropriately set.
211-215
: Data platform instance is correctly added.The data platform instance specifies that the platform is Glue. The field is appropriately set.
216-230
: Ownership aspect is correctly added.The ownership aspect includes owner details and last modification time. All fields are appropriately set.
231-233
: Snapshot aspect is correctly added.The snapshot aspect is correctly formatted.
234-247
: SubTypes aspect is correctly added.The subTypes aspect indicates that the dataset is a Table. The field is appropriately set.
248-258
: Container aspect is correctly added.The container aspect specifies the container URN. The field is appropriately set.
259-287
: Dataset profile is correctly added.The dataset profile includes timestamp, partition specification, and field profiles. All fields are appropriately set.
91b113a
to
467b489
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
- metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (1 hunks)
- metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
- metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Files skipped from review as they are similar to previous changes (4)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
- metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json
- metadata-ingestion/tests/unit/test_glue_source.py
Additional comments not posted (3)
metadata-ingestion/tests/unit/test_glue_source_stubs.py (3)
883-901
: LGTM! Theget_databases_response_profiling
structure is consistent with existing database response structures.The added data structure aligns well with the expected schema and usage.
903-986
: LGTM! Thetables_profiling_1
structure is consistent with existing table response structures.The added data structure aligns well with the expected schema and usage.
986-987
: LGTM! Theget_tables_response_profiling_1
structure is consistent with existing table list response structures.The added data structure aligns well with the expected schema and usage.
d4e767c
to
a13f2ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- docs/how/updating-datahub.md (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
- metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (1 hunks)
- metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
- metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Additional comments not posted (12)
metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (1)
11-18
: New configuration options added toGlueProfilingConfig
.The fields
enabled
andprofile_table_level_only
have been added with appropriate default values and descriptions. This is a positive change as it enhances configurability and provides clear documentation for each option.metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (6)
5-5
: Addition of ThreadPoolExecutor and as_completed importsThe imports for
ThreadPoolExecutor
andas_completed
have been added to support the new multi-threading features for profiling. This change is consistent with the PR objectives and summary.
162-163
: Change in default value handling for profiling configurationThe
profiling
field inGlueSourceConfig
now usesdefault_factory
instead of a direct assignment. This is a Pythonic way to ensure that mutable default values are handled correctly, preventing potential bugs where the default value is shared across instances.
176-176
: Modification inis_profiling_enabled
methodThe method now checks if profiling is enabled based on the new
profiling
configuration. This change aligns with the added profiling capabilities and ensures that profiling is only performed when configured.
788-820
: Enhanced profiling logic in_create_profile_mcp
This section has been updated to include new profiling metrics such as unique count, unique proportion, null count, null proportion, min, max, mean, median, and standard deviation. These changes enhance the profiling capabilities of the system and align with the PR's objectives to improve profiling features.
Line range hint
840-882
: Refactoring ofget_profile_if_enabled
to use ThreadPoolExecutorThe method has been refactored to use
ThreadPoolExecutor
for handling partition profiling in a multi-threaded manner. This optimization is crucial for performance improvement when dealing with large datasets and aligns with the PR's goal to optimize the profiling process.
892-905
: Addition of_create_partition_profile_mcp
methodThis new method handles the creation of partition profiles. It is a direct response to the PR objectives to add missing configuration options and enhance profiling at the partition level. As previously noted in the outdated comments, tests for this method should be verified or added.
metadata-ingestion/tests/unit/test_glue_source_stubs.py (5)
883-901
: Review: Added profiling database stub.The added database stub for profiling (
flights-database-profiling
) appears correctly structured and includes comprehensive metadata. This aligns with the PR's objective to enhance profiling capabilities.
Line range hint
987-1002
: Review: Utility function for mocking S3 responses.The
mock_get_object_response
function is well-documented and serves its purpose of simulating S3get_object
responses for testing. This is a good practice for unit tests, ensuring that tests do not rely on actual S3 interactions.
Line range hint
1003-1005
: Review: Specific object response functions.Functions
get_object_response_1
andget_object_response_2
correctly utilize themock_get_object_response
to simulate specific S3 object responses. This modular approach enhances test readability and maintainability.Also applies to: 1011-1013
Line range hint
1014-1016
: Review: Tagging response functions.The functions
get_bucket_tagging
andget_object_tagging
provide mocked responses for S3 tagging. This is essential for testing any logic that depends on S3 tags, ensuring the system's robustness in handling tag-related features.Also applies to: 1018-1020
903-986
: Review: Added profiling table stub.The table stub for
avro-profiling
is detailed, including extensive column metadata and storage configurations. This is crucial for accurate profiling and testing. However, ensure that theParameters
field for each column, such asunique_proportion
,min
,median
, etc., is being utilized as expected in the profiling logic to avoid redundancy or misconfiguration.
metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
Outdated
Show resolved
Hide resolved
a13f2ad
to
a2fbece
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- docs/how/updating-datahub.md (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
- metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (1 hunks)
- metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
- metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Files skipped from review as they are similar to previous changes (5)
- docs/how/updating-datahub.md
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
- metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json
- metadata-ingestion/tests/unit/test_glue_source_stubs.py
Additional comments not posted (2)
metadata-ingestion/tests/unit/test_glue_source.py (2)
103-136
: LGTM! Ensure profiling feature is thoroughly tested.The new function
glue_source_with_profiling
is well-structured and integrates the profiling configurations correctly.However, ensure that all aspects of the profiling feature are thoroughly tested.
689-734
: LGTM! Ensure comprehensive test coverage for profiling.The new test function
test_glue_ingest_with_profiling
is well-structured and verifies the profiling feature against a golden file.However, ensure that all profiling metrics and edge cases are covered in the tests.
a2fbece
to
1b2f7a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- docs/how/updating-datahub.md (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (5 hunks)
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (1 hunks)
- metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (1 hunks)
- metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
- metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Files skipped from review as they are similar to previous changes (4)
- docs/how/updating-datahub.md
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
- metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json
- metadata-ingestion/tests/unit/test_glue_source_stubs.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
5-5:
concurrent.futures.ThreadPoolExecutor
imported but unusedRemove unused import
(F401)
5-5:
concurrent.futures.as_completed
imported but unusedRemove unused import
(F401)
Additional comments not posted (5)
metadata-ingestion/tests/unit/test_glue_source.py (2)
103-136
: LGTM!The
glue_source_with_profiling
function correctly sets up the profiling configuration and returns aGlueSource
instance.
689-734
: LGTM!The
test_glue_ingest_with_profiling
function correctly tests the profiling functionality within the glue source ingestion process.metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (3)
Line range hint
171-190
:
LGTM!The changes to the
GlueSourceConfig
class, including theprofiling
configuration and theis_profiling_enabled
method, are correct.
871-903
: LGTM!The changes to the
_create_profile_mcp
method, which conditionally handles profiling based onprofile_table_level_only
, are correct.
923-923
: LGTM!The refactoring of the
get_profile_if_enabled
method simplifies the check for profiling status and improves readability.
1b2f7a6
to
49290f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (4 hunks)
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (1 hunks)
Files skipped from review as they are similar to previous changes (2)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment on tests. Otherwise LGTM
49290f7
to
5646c18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (4 hunks)
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (1 hunks)
- metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
Files skipped from review as they are similar to previous changes (2)
- metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
- metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
Additional comments not posted (4)
metadata-ingestion/tests/unit/test_glue_source.py (4)
16-20
: Imports look good!The new import
GlueProfilingConfig
is necessary for the profiling configurations.
Line range hint
45-66
:
Imports for profiling stub functions look good!The new imports for profiling stub functions are necessary for the added test cases.
103-136
: Functionglue_source_with_profiling
looks good!The function correctly sets up the profiling configurations and returns the
GlueSource
instance.
689-724
: Functiontest_glue_ingest_with_profiling
looks good!The function correctly sets up the test environment, mocks necessary responses, and verifies the output against the golden file.
Failure in airflow unrelated |
* feat(forms) Handle deleting forms references when hard deleting forms (datahub-project#10820) * refactor(ui): Misc improvements to the setup ingestion flow (ingest uplift 1/2) (datahub-project#10764) Co-authored-by: John Joyce <john@Johns-MBP.lan> Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> * fix(ingestion/airflow-plugin): pipeline tasks discoverable in search (datahub-project#10819) * feat(ingest/transformer): tags to terms transformer (datahub-project#10758) Co-authored-by: Aseem Bansal <asmbansal2@gmail.com> * fix(ingestion/unity-catalog): fixed issue with profiling with GE turned on (datahub-project#10752) Co-authored-by: Aseem Bansal <asmbansal2@gmail.com> * feat(forms) Add java SDK for form entity PATCH + CRUD examples (datahub-project#10822) * feat(SDK) Add java SDK for structuredProperty entity PATCH + CRUD examples (datahub-project#10823) * feat(SDK) Add StructuredPropertyPatchBuilder in python sdk and provide sample CRUD files (datahub-project#10824) * feat(forms) Add CRUD endpoints to GraphQL for Form entities (datahub-project#10825) * add flag for includeSoftDeleted in scroll entities API (datahub-project#10831) * feat(deprecation) Return actor entity with deprecation aspect (datahub-project#10832) * feat(structuredProperties) Add CRUD graphql APIs for structured property entities (datahub-project#10826) * add scroll parameters to openapi v3 spec (datahub-project#10833) * fix(ingest): correct profile_day_of_week implementation (datahub-project#10818) * feat(ingest/glue): allow ingestion of empty databases from Glue (datahub-project#10666) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * feat(cli): add more details to get cli (datahub-project#10815) * fix(ingestion/glue): ensure date formatting works on all platforms for aws glue (datahub-project#10836) * fix(ingestion): fix datajob patcher (datahub-project#10827) * fix(smoke-test): add suffix in temp file creation (datahub-project#10841) * feat(ingest/glue): add helper method to permit user or group ownership (datahub-project#10784) * feat(): Show data platform instances in policy modal if they are set on the policy (datahub-project#10645) Co-authored-by: Hendrik Richert <hendrik.richert@swisscom.com> * docs(patch): add patch documentation for how implementation works (datahub-project#10010) Co-authored-by: John Joyce <john@acryl.io> * fix(jar): add missing custom-plugin-jar task (datahub-project#10847) * fix(): also check exceptions/stack trace when filtering log messages (datahub-project#10391) Co-authored-by: John Joyce <john@acryl.io> * docs(): Update posts.md (datahub-project#9893) Co-authored-by: Hyejin Yoon <0327jane@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * chore(ingest): update acryl-datahub-classify version (datahub-project#10844) * refactor(ingest): Refactor structured logging to support infos, warnings, and failures structured reporting to UI (datahub-project#10828) Co-authored-by: John Joyce <john@Johns-MBP.lan> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(restli): log aspect-not-found as a warning rather than as an error (datahub-project#10834) * fix(ingest/nifi): remove duplicate upstream jobs (datahub-project#10849) * fix(smoke-test): test access to create/revoke personal access tokens (datahub-project#10848) * fix(smoke-test): missing test for move domain (datahub-project#10837) * ci: update usernames to not considered for community (datahub-project#10851) * env: change defaults for data contract visibility (datahub-project#10854) * fix(ingest/tableau): quote special characters in external URL (datahub-project#10842) * fix(smoke-test): fix flakiness of auto complete test * ci(ingest): pin dask dependency for feast (datahub-project#10865) * fix(ingestion/lookml): liquid template resolution and view-to-view cll (datahub-project#10542) * feat(ingest/audit): add client id and version in system metadata props (datahub-project#10829) * chore(ingest): Mypy 1.10.1 pin (datahub-project#10867) * docs: use acryl-datahub-actions as expected python package to install (datahub-project#10852) * docs: add new js snippet (datahub-project#10846) * refactor(ingestion): remove company domain for security reason (datahub-project#10839) * fix(ingestion/spark): Platform instance and column level lineage fix (datahub-project#10843) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingestion/tableau): optionally ingest multiple sites and create site containers (datahub-project#10498) Co-authored-by: Yanik Häni <Yanik.Haeni1@swisscom.com> * fix(ingestion/looker): Add sqlglot dependency and remove unused sqlparser (datahub-project#10874) * fix(manage-tokens): fix manage access token policy (datahub-project#10853) * Batch get entity endpoints (datahub-project#10880) * feat(system): support conditional write semantics (datahub-project#10868) * fix(build): upgrade vercel builds to Node 20.x (datahub-project#10890) * feat(ingest/lookml): shallow clone repos (datahub-project#10888) * fix(ingest/looker): add missing dependency (datahub-project#10876) * fix(ingest): only populate audit stamps where accurate (datahub-project#10604) * fix(ingest/dbt): always encode tag urns (datahub-project#10799) * fix(ingest/redshift): handle multiline alter table commands (datahub-project#10727) * fix(ingestion/looker): column name missing in explore (datahub-project#10892) * fix(lineage) Fix lineage source/dest filtering with explored per hop limit (datahub-project#10879) * feat(conditional-writes): misc updates and fixes (datahub-project#10901) * feat(ci): update outdated action (datahub-project#10899) * feat(rest-emitter): adding async flag to rest emitter (datahub-project#10902) Co-authored-by: Gabe Lyons <gabe.lyons@acryl.io> * feat(ingest): add snowflake-queries source (datahub-project#10835) * fix(ingest): improve `auto_materialize_referenced_tags_terms` error handling (datahub-project#10906) * docs: add new company to adoption list (datahub-project#10909) * refactor(redshift): Improve redshift error handling with new structured reporting system (datahub-project#10870) Co-authored-by: John Joyce <john@Johns-MBP.lan> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * feat(ui) Finalize support for all entity types on forms (datahub-project#10915) * Index ExecutionRequestResults status field (datahub-project#10811) * feat(ingest): grafana connector (datahub-project#10891) Co-authored-by: Shirshanka Das <shirshanka@apache.org> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(gms) Add Form entity type to EntityTypeMapper (datahub-project#10916) * feat(dataset): add support for external url in Dataset (datahub-project#10877) * docs(saas-overview) added missing features to observe section (datahub-project#10913) Co-authored-by: John Joyce <john@acryl.io> * fix(ingest/spark): Fixing Micrometer warning (datahub-project#10882) * fix(structured properties): allow application of structured properties without schema file (datahub-project#10918) * fix(data-contracts-web) handle other schedule types (datahub-project#10919) * fix(ingestion/tableau): human-readable message for PERMISSIONS_MODE_SWITCHED error (datahub-project#10866) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * Add feature flag for view defintions (datahub-project#10914) Co-authored-by: Ethan Cartwright <ethan.cartwright@acryl.io> * feat(ingest/BigQuery): refactor+parallelize dataset metadata extraction (datahub-project#10884) * fix(airflow): add error handling around render_template() (datahub-project#10907) * feat(ingestion/sqlglot): add optional `default_dialect` parameter to sqlglot lineage (datahub-project#10830) * feat(mcp-mutator): new mcp mutator plugin (datahub-project#10904) * fix(ingest/bigquery): changes helper function to decode unicode scape sequences (datahub-project#10845) * feat(ingest/postgres): fetch table sizes for profile (datahub-project#10864) * feat(ingest/abs): Adding azure blob storage ingestion source (datahub-project#10813) * fix(ingest/redshift): reduce severity of SQL parsing issues (datahub-project#10924) * fix(build): fix lint fix web react (datahub-project#10896) * fix(ingest/bigquery): handle quota exceeded for project.list requests (datahub-project#10912) * feat(ingest): report extractor failures more loudly (datahub-project#10908) * feat(ingest/snowflake): integrate snowflake-queries into main source (datahub-project#10905) * fix(ingest): fix docs build (datahub-project#10926) * fix(ingest/snowflake): fix test connection (datahub-project#10927) * fix(ingest/lookml): add view load failures to cache (datahub-project#10923) * docs(slack) overhauled setup instructions and screenshots (datahub-project#10922) Co-authored-by: John Joyce <john@acryl.io> * fix(airflow): Add comma parsing of owners to DataJobs (datahub-project#10903) * fix(entityservice): fix merging sideeffects (datahub-project#10937) * feat(ingest): Support System Ingestion Sources, Show and hide system ingestion sources with Command-S (datahub-project#10938) Co-authored-by: John Joyce <john@Johns-MBP.lan> * chore() Set a default lineage filtering end time on backend when a start time is present (datahub-project#10925) Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> Co-authored-by: John Joyce <john@Johns-MBP.lan> * Added relationships APIs to V3. Added these generic APIs to V3 swagger doc. (datahub-project#10939) * docs: add learning center to docs (datahub-project#10921) * doc: Update hubspot form id (datahub-project#10943) * chore(airflow): add python 3.11 w/ Airflow 2.9 to CI (datahub-project#10941) * fix(ingest/Glue): column upstream lineage between S3 and Glue (datahub-project#10895) * fix(ingest/abs): split abs utils into multiple files (datahub-project#10945) * doc(ingest/looker): fix doc for sql parsing documentation (datahub-project#10883) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(ingest/bigquery): Adding missing BigQuery types (datahub-project#10950) * fix(ingest/setup): feast and abs source setup (datahub-project#10951) * fix(connections) Harden adding /gms to connections in backend (datahub-project#10942) * feat(siblings) Add flag to prevent combining siblings in the UI (datahub-project#10952) * fix(docs): make graphql doc gen more automated (datahub-project#10953) * feat(ingest/athena): Add option for Athena partitioned profiling (datahub-project#10723) * fix(spark-lineage): default timeout for future responses (datahub-project#10947) * feat(datajob/flow): add environment filter using info aspects (datahub-project#10814) * fix(ui/ingest): correct privilege used to show tab (datahub-project#10483) Co-authored-by: Kunal-kankriya <127090035+Kunal-kankriya@users.noreply.github.com> * feat(ingest/looker): include dashboard urns in browse v2 (datahub-project#10955) * add a structured type to batchGet in OpenAPI V3 spec (datahub-project#10956) * fix(ui): scroll on the domain sidebar to show all domains (datahub-project#10966) * fix(ingest/sagemaker): resolve incorrect variable assignment for SageMaker API call (datahub-project#10965) * fix(airflow/build): Pinning mypy (datahub-project#10972) * Fixed a bug where the OpenAPI V3 spec was incorrect. The bug was introduced in datahub-project#10939. (datahub-project#10974) * fix(ingest/test): Fix for mssql integration tests (datahub-project#10978) * fix(entity-service) exist check correctly extracts status (datahub-project#10973) * fix(structuredProps) casing bug in StructuredPropertiesValidator (datahub-project#10982) * bugfix: use anyOf instead of allOf when creating references in openapi v3 spec (datahub-project#10986) * fix(ui): Remove ant less imports (datahub-project#10988) * feat(ingest/graph): Add get_results_by_filter to DataHubGraph (datahub-project#10987) * feat(ingest/cli): init does not actually support environment variables (datahub-project#10989) * fix(ingest/graph): Update get_results_by_filter graphql query (datahub-project#10991) * feat(ingest/spark): Promote beta plugin (datahub-project#10881) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingest): support domains in meta -> "datahub" section (datahub-project#10967) * feat(ingest): add `check server-config` command (datahub-project#10990) * feat(cli): Make consistent use of DataHubGraphClientConfig (datahub-project#10466) Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig. This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig. I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled. Fixes datahub-project#9705 * fix(ingest/s3): Fixing container creation when there is no folder in path (datahub-project#10993) * fix(ingest/looker): support platform instance for dashboards & charts (datahub-project#10771) * feat(ingest/bigquery): improve handling of information schema in sql parser (datahub-project#10985) * feat(ingest): improve `ingest deploy` command (datahub-project#10944) * fix(backend): allow excluding soft-deleted entities in relationship-queries; exclude soft-deleted members of groups (datahub-project#10920) - allow excluding soft-deleted entities in relationship-queries - exclude soft-deleted members of groups * fix(ingest/looker): downgrade missing chart type log level (datahub-project#10996) * doc(acryl-cloud): release docs for 0.3.4.x (datahub-project#10984) Co-authored-by: John Joyce <john@acryl.io> Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Pedro Silva <pedro@acryl.io> * fix(protobuf/build): Fix protobuf check jar script (datahub-project#11006) * fix(ui/ingest): Support invalid cron jobs (datahub-project#10998) * fix(ingest): fix graph config loading (datahub-project#11002) Co-authored-by: Pedro Silva <pedro@acryl.io> * feat(docs): Document __DATAHUB_TO_FILE_ directive (datahub-project#10968) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(graphql/upsertIngestionSource): Validate cron schedule; parse error in CLI (datahub-project#11011) * feat(ece): support custom ownership type urns in ECE generation (datahub-project#10999) * feat(assertion-v2): changed Validation tab to Quality and created new Governance tab (datahub-project#10935) * fix(ingestion/glue): Add support for missing config options for profiling in Glue (datahub-project#10858) * feat(propagation): Add models for schema field docs, tags, terms (datahub-project#2959) (datahub-project#11016) Co-authored-by: Chris Collins <chriscollins3456@gmail.com> * docs: standardize terminology to DataHub Cloud (datahub-project#11003) * fix(ingestion/transformer): replace the externalUrl container (datahub-project#11013) * docs(slack) troubleshoot docs (datahub-project#11014) * feat(propagation): Add graphql API (datahub-project#11030) Co-authored-by: Chris Collins <chriscollins3456@gmail.com> * feat(propagation): Add models for Action feature settings (datahub-project#11029) * docs(custom properties): Remove duplicate from sidebar (datahub-project#11033) * feat(models): Introducing Dataset Partitions Aspect (datahub-project#10997) Co-authored-by: John Joyce <john@Johns-MBP.lan> Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> * feat(propagation): Add Documentation Propagation Settings (datahub-project#11038) * fix(models): chart schema fields mapping, add dataHubAction entity, t… (datahub-project#11040) * fix(ci): smoke test lint failures (datahub-project#11044) * docs: fix learning center color scheme & typo (datahub-project#11043) * feat: add cloud main page (datahub-project#11017) Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com> * feat(restore-indices): add additional step to also clear system metadata service (datahub-project#10662) Co-authored-by: John Joyce <john@acryl.io> * docs: fix typo (datahub-project#11046) * fix(lint): apply spotless (datahub-project#11050) * docs(airflow): example query to get datajobs for a dataflow (datahub-project#11034) * feat(cli): Add run-id option to put sub-command (datahub-project#11023) Adds an option to assign run-id to a given put command execution. This is useful when transformers do not exist for a given ingestion payload, we can follow up with custom metadata and assign it to an ingestion pipeline. * fix(ingest): improve sql error reporting calls (datahub-project#11025) * fix(airflow): fix CI setup (datahub-project#11031) * feat(ingest/dbt): add experimental `prefer_sql_parser_lineage` flag (datahub-project#11039) * fix(ingestion/lookml): enable stack-trace in lookml logs (datahub-project#10971) * (chore): Linting fix (datahub-project#11015) * chore(ci): update deprecated github actions (datahub-project#10977) * Fix ALB configuration example (datahub-project#10981) * chore(ingestion-base): bump base image packages (datahub-project#11053) * feat(cli): Trim report of dataHubExecutionRequestResult to max GMS size (datahub-project#11051) * fix(ingestion/lookml): emit dummy sql condition for lookml custom condition tag (datahub-project#11008) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(ingestion/powerbi): fix issue with broken report lineage (datahub-project#10910) * feat(ingest/tableau): add retry on timeout (datahub-project#10995) * change generate kafka connect properties from env (datahub-project#10545) Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> * fix(ingest): fix oracle cronjob ingestion (datahub-project#11001) Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> * chore(ci): revert update deprecated github actions (datahub-project#10977) (datahub-project#11062) * feat(ingest/dbt-cloud): update metadata_endpoint inference (datahub-project#11041) * build: Reduce size of datahub-frontend-react image by 50-ish% (datahub-project#10878) Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> * fix(ci): Fix lint issue in datahub_ingestion_run_summary_provider.py (datahub-project#11063) * docs(ingest): update developing-a-transformer.md (datahub-project#11019) * feat(search-test): update search tests from datahub-project#10408 (datahub-project#11056) * feat(cli): add aspects parameter to DataHubGraph.get_entity_semityped (datahub-project#11009) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * docs(airflow): update min version for plugin v2 (datahub-project#11065) * doc(ingestion/tableau): doc update for derived permission (datahub-project#11054) Co-authored-by: Pedro Silva <pedro.cls93@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix(py): remove dep on types-pkg_resources (datahub-project#11076) * feat(ingest/mode): add option to exclude restricted (datahub-project#11081) * fix(ingest): set lastObserved in sdk when unset (datahub-project#11071) * doc(ingest): Update capabilities (datahub-project#11072) * chore(vulnerability): Log Injection (datahub-project#11090) * chore(vulnerability): Information exposure through a stack trace (datahub-project#11091) * chore(vulnerability): Comparison of narrow type with wide type in loop condition (datahub-project#11089) * chore(vulnerability): Insertion of sensitive information into log files (datahub-project#11088) * chore(vulnerability): Risky Cryptographic Algorithm (datahub-project#11059) * chore(vulnerability): Overly permissive regex range (datahub-project#11061) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * fix: update customer data (datahub-project#11075) * fix(models): fixing the datasetPartition models (datahub-project#11085) Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> * fix(ui): Adding view, forms GraphQL query, remove showing a fallback error message on unhandled GraphQL error (datahub-project#11084) Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> * feat(docs-site): hiding learn more from cloud page (datahub-project#11097) * fix(docs): Add correct usage of orFilters in search API docs (datahub-project#11082) Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com> * fix(ingest/mode): Regexp in mode name matcher didn't allow underscore (datahub-project#11098) * docs: Refactor customer stories section (datahub-project#10869) Co-authored-by: Jeff Merrick <jeff@wireform.io> * fix(release): fix full/slim suffix on tag (datahub-project#11087) * feat(config): support alternate hashing algorithm for doc id (datahub-project#10423) Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> Co-authored-by: John Joyce <john@acryl.io> * fix(emitter): fix typo in get method of java kafka emitter (datahub-project#11007) * fix(ingest): use correct native data type in all SQLAlchemy sources by compiling data type using dialect (datahub-project#10898) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * chore: Update contributors list in PR labeler (datahub-project#11105) * feat(ingest): tweak stale entity removal messaging (datahub-project#11064) * fix(ingestion): enforce lastObserved timestamps in SystemMetadata (datahub-project#11104) * fix(ingest/powerbi): fix broken lineage between chart and dataset (datahub-project#11080) * feat(ingest/lookml): CLL support for sql set in sql_table_name attribute of lookml view (datahub-project#11069) * docs: update graphql docs on forms & structured properties (datahub-project#11100) * test(search): search openAPI v3 test (datahub-project#11049) * fix(ingest/tableau): prevent empty site content urls (datahub-project#11057) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(entity-client): implement client batch interface (datahub-project#11106) * fix(snowflake): avoid reporting warnings/info for sys tables (datahub-project#11114) * fix(ingest): downgrade column type mapping warning to info (datahub-project#11115) * feat(api): add AuditStamp to the V3 API entity/aspect response (datahub-project#11118) * fix(ingest/redshift): replace r'\n' with '\n' to avoid token error redshift serverless… (datahub-project#11111) * fix(entiy-client): handle null entityUrn case for restli (datahub-project#11122) * fix(sql-parser): prevent bad urns from alter table lineage (datahub-project#11092) * fix(ingest/bigquery): use small batch size if use_tables_list_query_v2 is set (datahub-project#11121) * fix(graphql): add missing entities to EntityTypeMapper and EntityTypeUrnMapper (datahub-project#10366) * feat(ui): Changes to allow editable dataset name (datahub-project#10608) Co-authored-by: Jay Kadambi <jayasimhan_venkatadri@optum.com> * fix: remove saxo (datahub-project#11127) * feat(mcl-processor): Update mcl processor hooks (datahub-project#11134) * fix(openapi): fix openapi v2 endpoints & v3 documentation update * Revert "fix(openapi): fix openapi v2 endpoints & v3 documentation update" This reverts commit 573c1cb. * docs(policies): updates to policies documentation (datahub-project#11073) * fix(openapi): fix openapi v2 and v3 docs update (datahub-project#11139) * feat(auth): grant type and acr values custom oidc parameters support (datahub-project#11116) * fix(mutator): mutator hook fixes (datahub-project#11140) * feat(search): support sorting on multiple fields (datahub-project#10775) * feat(ingest): various logging improvements (datahub-project#11126) * fix(ingestion/lookml): fix for sql parsing error (datahub-project#11079) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> * feat(docs-site) cloud page spacing and content polishes (datahub-project#11141) * feat(ui) Enable editing structured props on fields (datahub-project#11042) * feat(tests): add md5 and last computed to testResult model (datahub-project#11117) * test(openapi): openapi regression smoke tests (datahub-project#11143) * fix(airflow): fix tox tests + update docs (datahub-project#11125) * docs: add chime to adoption stories (datahub-project#11142) * fix(ingest/databricks): Updating code to work with Databricks sdk 0.30 (datahub-project#11158) * fix(kafka-setup): add missing script to image (datahub-project#11190) * fix(config): fix hash algo config (datahub-project#11191) * test(smoke-test): updates to smoke-tests (datahub-project#11152) * fix(elasticsearch): refactor idHashAlgo setting (datahub-project#11193) * chore(kafka): kafka version bump (datahub-project#11211) * readd UsageStatsWorkUnit * fix merge problems * change logo --------- Co-authored-by: Chris Collins <chriscollins3456@gmail.com> Co-authored-by: John Joyce <john@acryl.io> Co-authored-by: John Joyce <john@Johns-MBP.lan> Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> Co-authored-by: dushayntAW <158567391+dushayntAW@users.noreply.github.com> Co-authored-by: sagar-salvi-apptware <159135491+sagar-salvi-apptware@users.noreply.github.com> Co-authored-by: Aseem Bansal <asmbansal2@gmail.com> Co-authored-by: Kevin Chun <kevin1chun@gmail.com> Co-authored-by: jordanjeremy <72943478+jordanjeremy@users.noreply.github.com> Co-authored-by: skrydal <piotr.skrydalewicz@gmail.com> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com> Co-authored-by: sid-acryl <155424659+sid-acryl@users.noreply.github.com> Co-authored-by: Julien Jehannet <80408664+aviv-julienjehannet@users.noreply.github.com> Co-authored-by: Hendrik Richert <github@richert.li> Co-authored-by: Hendrik Richert <hendrik.richert@swisscom.com> Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com> Co-authored-by: Felix Lüdin <13187726+Masterchen09@users.noreply.github.com> Co-authored-by: Pirry <158024088+chardaway@users.noreply.github.com> Co-authored-by: Hyejin Yoon <0327jane@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: cburroughs <chris.burroughs@gmail.com> Co-authored-by: ksrinath <ksrinath@users.noreply.github.com> Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com> Co-authored-by: Kunal-kankriya <127090035+Kunal-kankriya@users.noreply.github.com> Co-authored-by: Shirshanka Das <shirshanka@apache.org> Co-authored-by: ipolding-cais <155455744+ipolding-cais@users.noreply.github.com> Co-authored-by: Tamas Nemeth <treff7es@gmail.com> Co-authored-by: Shubham Jagtap <132359390+shubhamjagtap639@users.noreply.github.com> Co-authored-by: haeniya <yanik.haeni@gmail.com> Co-authored-by: Yanik Häni <Yanik.Haeni1@swisscom.com> Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com> Co-authored-by: Gabe Lyons <gabe.lyons@acryl.io> Co-authored-by: 808OVADOZE <52988741+shtephlee@users.noreply.github.com> Co-authored-by: noggi <anton.kuraev@acryl.io> Co-authored-by: Nicholas Pena <npena@foursquare.com> Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com> Co-authored-by: ethan-cartwright <ethan.cartwright.m@gmail.com> Co-authored-by: Ethan Cartwright <ethan.cartwright@acryl.io> Co-authored-by: Nadav Gross <33874964+nadavgross@users.noreply.github.com> Co-authored-by: Patrick Franco Braz <patrickfbraz@poli.ufrj.br> Co-authored-by: pie1nthesky <39328908+pie1nthesky@users.noreply.github.com> Co-authored-by: Joel Pinto Mata (KPN-DSH-DEX team) <130968841+joelmataKPN@users.noreply.github.com> Co-authored-by: Ellie O'Neil <110510035+eboneil@users.noreply.github.com> Co-authored-by: Ajoy Majumdar <ajoymajumdar@hotmail.com> Co-authored-by: deepgarg-visa <149145061+deepgarg-visa@users.noreply.github.com> Co-authored-by: Tristan Heisler <tristankheisler@gmail.com> Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io> Co-authored-by: Davi Arnaut <davi.arnaut@acryl.io> Co-authored-by: Pedro Silva <pedro@acryl.io> Co-authored-by: amit-apptware <132869468+amit-apptware@users.noreply.github.com> Co-authored-by: Sam Black <sam.black@acryl.io> Co-authored-by: Raj Tekal <varadaraj_tekal@optum.com> Co-authored-by: Steffen Grohsschmiedt <gitbhub@steffeng.eu> Co-authored-by: jaegwon.seo <162448493+wornjs@users.noreply.github.com> Co-authored-by: Renan F. Lima <51028757+lima-renan@users.noreply.github.com> Co-authored-by: Matt Exchange <xkollar@users.noreply.github.com> Co-authored-by: Jonny Dixon <45681293+acrylJonny@users.noreply.github.com> Co-authored-by: Pedro Silva <pedro.cls93@gmail.com> Co-authored-by: Pinaki Bhattacharjee <pinakipb2@gmail.com> Co-authored-by: Jeff Merrick <jeff@wireform.io> Co-authored-by: skrydal <piotr.skrydalewicz@acryl.io> Co-authored-by: AndreasHegerNuritas <163423418+AndreasHegerNuritas@users.noreply.github.com> Co-authored-by: jayasimhankv <145704974+jayasimhankv@users.noreply.github.com> Co-authored-by: Jay Kadambi <jayasimhan_venkatadri@optum.com> Co-authored-by: David Leifker <david.leifker@acryl.io>
Summary:
Glue Source Profiling Configuration:
enabled
: Flag to enable or disable profiling (default:false
).profile_table_level_only
: Flag to enable profiling at the table level only, excluding column-level profiling (default:false
).max_workers
: Number of worker threads to use for profiling, defaulting to 5 times the CPU count.Test Cases:
Documentation Update:
updating-datahub.md
file to document the breaking change related to the profiling configuration for Glue source under the "Breaking Changes" section.QA:
Checklist
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Improvements
Tests