-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/gc): Add dataflow and soft deleted entities cleanup #11102
feat(ingest/gc): Add dataflow and soft deleted entities cleanup #11102
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes introduce a new metadata cleanup capability to the DataHub ingestion framework. A new entry point is added to the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
ba8200c
to
d4204da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- metadata-ingestion/setup.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/metadata_cleanup.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/metadata_cleanup.py
1-1:
json
imported but unusedRemove unused import:
json
(F401)
3-3:
os
imported but unusedRemove unused import:
os
(F401)
8-8:
typing.Set
imported but unusedRemove unused import:
typing.Set
(F401)
14-14:
datahub.configuration.source_common.EnvConfigMixin
imported but unusedRemove unused import
(F401)
15-15:
datahub.configuration.source_common.PlatformInstanceConfigMixin
imported but unusedRemove unused import
(F401)
18-18:
datahub.emitter.mce_builder.make_dataset_urn_with_platform_instance
imported but unusedRemove unused import
(F401)
19-19:
datahub.emitter.mce_builder.make_user_urn
imported but unusedRemove unused import
(F401)
21-21:
datahub.emitter.sql_parsing_builder.SqlParsingBuilder
imported but unusedRemove unused import:
datahub.emitter.sql_parsing_builder.SqlParsingBuilder
(F401)
25-25:
datahub.ingestion.api.decorators.capability
imported but unusedRemove unused import:
datahub.ingestion.api.decorators.capability
(F401)
33-33:
datahub.ingestion.api.source.SourceCapability
imported but unusedRemove unused import:
datahub.ingestion.api.source.SourceCapability
(F401)
39-39:
datahub.ingestion.source.usage.usage_common.BaseUsageConfig
imported but unusedRemove unused import:
datahub.ingestion.source.usage.usage_common.BaseUsageConfig
(F401)
40-40:
datahub.sql_parsing.schema_resolver.SchemaResolver
imported but unusedRemove unused import:
datahub.sql_parsing.schema_resolver.SchemaResolver
(F401)
41-41:
datahub.sql_parsing.sqlglot_lineage.sqlglot_lineage
imported but unusedRemove unused import:
datahub.sql_parsing.sqlglot_lineage.sqlglot_lineage
(F401)
242-243: Use a single
if
statement instead of nestedif
statementsCombine
if
statements usingand
(SIM102)
372-372: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
Additional comments not posted (7)
metadata-ingestion/src/datahub/ingestion/source/metadata_cleanup.py (6)
123-155
: LGTM!The
MetadataCleanupConfig
class is well-defined and follows best practices.
158-163
: LGTM!The
DataFlowEntity
class is well-defined and follows best practices.
166-173
: LGTM!The
DataJobEntity
class is well-defined and follows best practices.
176-179
: LGTM!The
MetadataCleanupSourceReport
class is well-defined and follows best practices.
182-204
: LGTM!The
MetadataCleanupSource
class is well-defined and follows best practices.
46-99
: LGTM!The GraphQL queries are well-defined and follow best practices.
metadata-ingestion/setup.py (1)
687-687
: LGTM!The new entry point for the
MetadataCleanupSource
class is well-defined and follows best practices.
make_dataset_urn_with_platform_instance, | ||
make_user_urn, | ||
) | ||
from datahub.emitter.sql_parsing_builder import SqlParsingBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports.
The following imports are not used and should be removed: make_dataset_urn_with_platform_instance
, make_user_urn
, SqlParsingBuilder
.
-from datahub.emitter.mce_builder import (
- make_dataset_urn_with_platform_instance,
- make_user_urn,
-)
-from datahub.emitter.sql_parsing_builder import SqlParsingBuilder
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
make_dataset_urn_with_platform_instance, | |
make_user_urn, | |
) | |
from datahub.emitter.sql_parsing_builder import SqlParsingBuilder |
Tools
Ruff
18-18:
datahub.emitter.mce_builder.make_dataset_urn_with_platform_instance
imported but unusedRemove unused import
(F401)
19-19:
datahub.emitter.mce_builder.make_user_urn
imported but unusedRemove unused import
(F401)
21-21:
datahub.emitter.sql_parsing_builder.SqlParsingBuilder
imported but unusedRemove unused import:
datahub.emitter.sql_parsing_builder.SqlParsingBuilder
(F401)
from dataclasses import dataclass, field | ||
from datetime import datetime, timezone | ||
from functools import partial | ||
from typing import Dict, Iterable, List, Optional, Set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import.
The Set
import from typing
is not used and should be removed.
-from typing import Set
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from typing import Dict, Iterable, List, Optional, Set | |
from typing import Dict, Iterable, List, Optional |
Tools
Ruff
8-8:
typing.Set
imported but unusedRemove unused import:
typing.Set
(F401)
EnvConfigMixin, | ||
PlatformInstanceConfigMixin, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports.
The following imports are not used and should be removed: EnvConfigMixin
, PlatformInstanceConfigMixin
.
-from datahub.configuration.source_common import (
- EnvConfigMixin,
- PlatformInstanceConfigMixin,
-)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
EnvConfigMixin, | |
PlatformInstanceConfigMixin, | |
) |
Tools
Ruff
14-14:
datahub.configuration.source_common.EnvConfigMixin
imported but unusedRemove unused import
(F401)
15-15:
datahub.configuration.source_common.PlatformInstanceConfigMixin
imported but unusedRemove unused import
(F401)
import json | ||
import logging | ||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports.
The following imports are not used and should be removed: json
, os
.
-import json
-import os
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import json | |
import logging | |
import os | |
import logging |
Tools
Ruff
1-1:
json
imported but unusedRemove unused import:
json
(F401)
3-3:
os
imported but unusedRemove unused import:
os
(F401)
capability, | ||
config_class, | ||
platform_name, | ||
support_status, | ||
) | ||
from datahub.ingestion.api.source import ( | ||
MetadataWorkUnitProcessor, | ||
Source, | ||
SourceCapability, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports.
The following imports are not used and should be removed: capability
, SourceCapability
, BaseUsageConfig
, SchemaResolver
, sqlglot_lineage
.
-from datahub.ingestion.api.decorators import (
- capability,
-)
-from datahub.ingestion.api.source import (
- SourceCapability,
-)
-from datahub.ingestion.source.usage.usage_common import BaseUsageConfig
-from datahub.sql_parsing.schema_resolver import SchemaResolver
-from datahub.sql_parsing.sqlglot_lineage import sqlglot_lineage
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
capability, | |
config_class, | |
platform_name, | |
support_status, | |
) | |
from datahub.ingestion.api.source import ( | |
MetadataWorkUnitProcessor, | |
Source, | |
SourceCapability, | |
from datahub.ingestion.api.decorators import ( | |
) | |
from datahub.ingestion.api.source import ( | |
MetadataWorkUnitProcessor, | |
Source, | |
) |
Tools
Ruff
25-25:
datahub.ingestion.api.decorators.capability
imported but unusedRemove unused import:
datahub.ingestion.api.decorators.capability
(F401)
33-33:
datahub.ingestion.api.source.SourceCapability
imported but unusedRemove unused import:
datahub.ingestion.api.source.SourceCapability
(F401)
if self.config.keep_last_n: | ||
if len(dpis) >= self.config.keep_last_n: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine nested if
statements.
Use a single if
statement instead of nested if
statements.
- if self.config.keep_last_n:
- if len(dpis) >= self.config.keep_last_n:
+ if self.config.keep_last_n and len(dpis) >= self.config.keep_last_n:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if self.config.keep_last_n: | |
if len(dpis) >= self.config.keep_last_n: | |
if self.config.keep_last_n and len(dpis) >= self.config.keep_last_n: |
Tools
Ruff
242-243: Use a single
if
statement instead of nestedif
statementsCombine
if
statements usingand
(SIM102)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range, codebase verification and nitpick comments (1)
metadata-ingestion/src/datahub/ingestion/source/metadata_cleanup.py (1)
104-136
: Fix typos in field descriptions.There are typos in the descriptions of
delete_empty_data_jobs
anddelete_empty_data_flows
.- description="Wether to delete Data Jobs without runs" + description="Whether to delete Data Jobs without runs" - description="Wether to delete Data Flows without runs" + description="Whether to delete Data Flows without runs"
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/metadata_cleanup.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/metadata_cleanup.py
358-358: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
Additional comments not posted (13)
metadata-ingestion/src/datahub/ingestion/source/metadata_cleanup.py (13)
139-144
: LGTM!The class
DataFlowEntity
is well-defined.
147-154
: LGTM!The class
DataJobEntity
is well-defined.
157-163
: LGTM!The class
MetadataCleanupSourceReport
is well-defined.
175-182
: LGTM!The
__init__
method is well-defined and handles errors appropriately.
184-187
: LGTM!The
create
method is well-defined and follows best practices.
189-190
: LGTM!The
get_report
method is well-defined.
192-193
: LGTM!The
get_workunit_processors
method is well-defined and follows best practices.
195-217
: LGTM!The
fetch_dpis
method is well-defined, handles errors appropriately, and follows best practices.
219-232
: LGTM!The
keep_last_n_dpi
method is well-defined, handles errors appropriately, and follows best practices.
233-244
: LGTM!The
delete_entity
method is well-defined, handles errors appropriately, and follows best practices.
246-263
: LGTM!The
delete_dpi_from_datajobs
method is well-defined, handles errors appropriately, and follows best practices.
264-288
: LGTM!The
remove_old_dpis
method is well-defined, handles errors appropriately, and follows best practices.
290-314
: LGTM!The
get_data_flows
method is well-defined, handles errors appropriately, and follows best practices.
import logging | ||
from collections import defaultdict | ||
from dataclasses import dataclass, field | ||
from datetime import datetime, timezone | ||
from functools import partial | ||
from typing import Dict, Iterable, List, Optional | ||
|
||
from pydantic import Field | ||
|
||
from datahub.configuration import ConfigModel | ||
from datahub.ingestion.api.common import PipelineContext | ||
from datahub.ingestion.api.decorators import ( | ||
SupportStatus, | ||
config_class, | ||
platform_name, | ||
support_status, | ||
) | ||
from datahub.ingestion.api.source import MetadataWorkUnitProcessor, Source, SourceReport | ||
from datahub.ingestion.api.source_helpers import auto_workunit_reporter | ||
from datahub.ingestion.api.workunit import MetadataWorkUnit | ||
from datahub.ingestion.graph.client import DataHubGraph | ||
from datahub.utilities.lossy_collections import LossyList | ||
from datahub.utilities.stats_collections import TopKDict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports.
The following imports are not used and should be removed: defaultdict
, partial
, Field
, datetime
, timezone
, LossyList
.
- from collections import defaultdict
- from functools import partial
- from pydantic import Field
- from datetime import datetime, timezone
- from datahub.utilities.lossy_collections import LossyList
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import logging | |
from collections import defaultdict | |
from dataclasses import dataclass, field | |
from datetime import datetime, timezone | |
from functools import partial | |
from typing import Dict, Iterable, List, Optional | |
from pydantic import Field | |
from datahub.configuration import ConfigModel | |
from datahub.ingestion.api.common import PipelineContext | |
from datahub.ingestion.api.decorators import ( | |
SupportStatus, | |
config_class, | |
platform_name, | |
support_status, | |
) | |
from datahub.ingestion.api.source import MetadataWorkUnitProcessor, Source, SourceReport | |
from datahub.ingestion.api.source_helpers import auto_workunit_reporter | |
from datahub.ingestion.api.workunit import MetadataWorkUnit | |
from datahub.ingestion.graph.client import DataHubGraph | |
from datahub.utilities.lossy_collections import LossyList | |
from datahub.utilities.stats_collections import TopKDict | |
import logging | |
from dataclasses import dataclass, field | |
from typing import Dict, Iterable, List, Optional | |
from pydantic import Field | |
from datahub.configuration import ConfigModel | |
from datahub.ingestion.api.common import PipelineContext | |
from datahub.ingestion.api.decorators import ( | |
SupportStatus, | |
config_class, | |
platform_name, | |
support_status, | |
) | |
from datahub.ingestion.api.source import MetadataWorkUnitProcessor, Source, SourceReport | |
from datahub.ingestion.api.source_helpers import auto_workunit_reporter | |
from datahub.ingestion.api.workunit import MetadataWorkUnit | |
from datahub.ingestion.graph.client import DataHubGraph | |
from datahub.utilities.stats_collections import TopKDict |
def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]: | ||
assert self.ctx.graph | ||
|
||
dataFlows: Dict[str, DataFlowEntity] = {} | ||
for flow in self.get_data_flows(): | ||
dataFlows[flow.urn] = flow | ||
|
||
scroll_id: Optional[str] = None | ||
dataJobs: Dict[str, List[DataJobEntity]] = defaultdict(list) | ||
|
||
while True: | ||
result = self.ctx.graph.execute_graphql( | ||
DATAJOB_QUERY, | ||
{"query": "*", scroll_id: scroll_id if scroll_id else "null"}, | ||
) | ||
scrollAcrossEntities = result.get("scrollAcrossEntities") | ||
if not scrollAcrossEntities: | ||
raise ValueError("Missing scrollAcrossEntities in response") | ||
|
||
scroll_id = scrollAcrossEntities.get("nextScrollId") | ||
for job in scrollAcrossEntities.get("searchResults"): | ||
datajob_entity = DataJobEntity( | ||
urn=job.get("entity").get("urn"), | ||
flow_urn=job.get("entity").get("dataFlow").get("urn"), | ||
lastIngested=job.get("entity").get("lastIngested"), | ||
jobId=job.get("entity").get("jobId"), | ||
dataPlatformInstance=job.get("entity").get("dataPlatformInstance"), | ||
total_runs=job.get("entity").get("runs").get("total"), | ||
) | ||
if datajob_entity.total_runs > 0: | ||
self.delete_dpi_from_datajobs(datajob_entity) | ||
if ( | ||
datajob_entity.total_runs == 0 | ||
and self.config.delete_empty_data_jobs | ||
): | ||
logger.info( | ||
f"Deleting datajob {datajob_entity.urn} because there are no runs" | ||
) | ||
self.delete_entity(datajob_entity.urn, "dataJob") | ||
else: | ||
dataJobs[datajob_entity.flow_urn].append(datajob_entity) | ||
|
||
for key in dataFlows.keys(): | ||
if ( | ||
not dataJobs.get(key) or len(dataJobs[key]) == 0 | ||
) and self.config.delete_empty_data_flows: | ||
logger.info( | ||
f"Deleting dataflow {key} because there are not datajobs" | ||
) | ||
self.delete_entity(key, "dataFlow") | ||
|
||
if not scroll_id: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use key in dict
instead of key in dict.keys()
.
Remove .keys()
for better readability and performance.
- for key in dataFlows.keys():
+ for key in dataFlows:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]: | |
assert self.ctx.graph | |
dataFlows: Dict[str, DataFlowEntity] = {} | |
for flow in self.get_data_flows(): | |
dataFlows[flow.urn] = flow | |
scroll_id: Optional[str] = None | |
dataJobs: Dict[str, List[DataJobEntity]] = defaultdict(list) | |
while True: | |
result = self.ctx.graph.execute_graphql( | |
DATAJOB_QUERY, | |
{"query": "*", scroll_id: scroll_id if scroll_id else "null"}, | |
) | |
scrollAcrossEntities = result.get("scrollAcrossEntities") | |
if not scrollAcrossEntities: | |
raise ValueError("Missing scrollAcrossEntities in response") | |
scroll_id = scrollAcrossEntities.get("nextScrollId") | |
for job in scrollAcrossEntities.get("searchResults"): | |
datajob_entity = DataJobEntity( | |
urn=job.get("entity").get("urn"), | |
flow_urn=job.get("entity").get("dataFlow").get("urn"), | |
lastIngested=job.get("entity").get("lastIngested"), | |
jobId=job.get("entity").get("jobId"), | |
dataPlatformInstance=job.get("entity").get("dataPlatformInstance"), | |
total_runs=job.get("entity").get("runs").get("total"), | |
) | |
if datajob_entity.total_runs > 0: | |
self.delete_dpi_from_datajobs(datajob_entity) | |
if ( | |
datajob_entity.total_runs == 0 | |
and self.config.delete_empty_data_jobs | |
): | |
logger.info( | |
f"Deleting datajob {datajob_entity.urn} because there are no runs" | |
) | |
self.delete_entity(datajob_entity.urn, "dataJob") | |
else: | |
dataJobs[datajob_entity.flow_urn].append(datajob_entity) | |
for key in dataFlows.keys(): | |
if ( | |
not dataJobs.get(key) or len(dataJobs[key]) == 0 | |
) and self.config.delete_empty_data_flows: | |
logger.info( | |
f"Deleting dataflow {key} because there are not datajobs" | |
) | |
self.delete_entity(key, "dataFlow") | |
if not scroll_id: | |
break | |
def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]: | |
assert self.ctx.graph | |
dataFlows: Dict[str, DataFlowEntity] = {} | |
for flow in self.get_data_flows(): | |
dataFlows[flow.urn] = flow | |
scroll_id: Optional[str] = None | |
dataJobs: Dict[str, List[DataJobEntity]] = defaultdict(list) | |
while True: | |
result = self.ctx.graph.execute_graphql( | |
DATAJOB_QUERY, | |
{"query": "*", scroll_id: scroll_id if scroll_id else "null"}, | |
) | |
scrollAcrossEntities = result.get("scrollAcrossEntities") | |
if not scrollAcrossEntities: | |
raise ValueError("Missing scrollAcrossEntities in response") | |
scroll_id = scrollAcrossEntities.get("nextScrollId") | |
for job in scrollAcrossEntities.get("searchResults"): | |
datajob_entity = DataJobEntity( | |
urn=job.get("entity").get("urn"), | |
flow_urn=job.get("entity").get("dataFlow").get("urn"), | |
lastIngested=job.get("entity").get("lastIngested"), | |
jobId=job.get("entity").get("jobId"), | |
dataPlatformInstance=job.get("entity").get("dataPlatformInstance"), | |
total_runs=job.get("entity").get("runs").get("total"), | |
) | |
if datajob_entity.total_runs > 0: | |
self.delete_dpi_from_datajobs(datajob_entity) | |
if ( | |
datajob_entity.total_runs == 0 | |
and self.config.delete_empty_data_jobs | |
): | |
logger.info( | |
f"Deleting datajob {datajob_entity.urn} because there are no runs" | |
) | |
self.delete_entity(datajob_entity.urn, "dataJob") | |
else: | |
dataJobs[datajob_entity.flow_urn].append(datajob_entity) | |
for key in dataFlows: | |
if ( | |
not dataJobs.get(key) or len(dataJobs[key]) == 0 | |
) and self.config.delete_empty_data_flows: | |
logger.info( | |
f"Deleting dataflow {key} because there are not datajobs" | |
) | |
self.delete_entity(key, "dataFlow") | |
if not scroll_id: | |
break |
Tools
Ruff
358-358: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/metadata_cleanup.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/metadata_cleanup.py
363-363: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
Additional comments not posted (16)
metadata-ingestion/src/datahub/ingestion/source/metadata_cleanup.py (16)
104-136
: LGTM!The
MetadataCleanupConfig
class is well-defined with appropriate fields and descriptions.
140-144
: LGTM!The
DataFlowEntity
class is well-defined with appropriate fields.
148-154
: LGTM!The
DataJobEntity
class is well-defined with appropriate fields.
158-163
: LGTM!The
MetadataCleanupSourceReport
class is well-defined with appropriate fields.
180-187
: LGTM!The
__init__
method is well-defined with appropriate initialization steps.
189-192
: LGTM!The
create
method is well-defined with appropriate steps to create an instance ofMetadataCleanupSource
.
194-195
: LGTM!The
get_report
method is well-defined with appropriate steps to return the report.
197-198
: LGTM!The
get_workunit_processors
method is well-defined with appropriate steps to return the work unit processors.
200-222
: LGTM!The
fetch_dpis
method is well-defined with appropriate steps to fetch data process instances for a given job URN.
224-236
: LGTM!The
keep_last_n_dpi
method is well-defined with appropriate steps to keep the last N data process instances for a given job.
238-249
: LGTM!The
delete_entity
method is well-defined with appropriate steps to delete an entity with a given URN and type.
251-267
: LGTM!The
delete_dpi_from_datajobs
method is well-defined with appropriate steps to delete data process instances from data jobs.
269-293
: LGTM!The
remove_old_dpis
method is well-defined with appropriate steps to remove old data process instances based on retention days.
295-319
: LGTM!The
get_data_flows
method is well-defined with appropriate steps to fetch data flows.
321-374
: LGTM!The
get_workunits_internal
method is well-defined with appropriate steps to fetch internal work units.Tools
Ruff
363-363: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
2-6
: Remove unused imports.The following imports are not used and should be removed:
defaultdict
,partial
.- from collections import defaultdict - from functools import partialLikely invalid or redundant comment.
from datahub.utilities.lossy_collections import LossyList | ||
from datahub.utilities.stats_collections import TopKDict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports.
The following imports are not used and should be removed: LossyList
.
- from datahub.utilities.lossy_collections import LossyList
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from datahub.utilities.lossy_collections import LossyList | |
from datahub.utilities.stats_collections import TopKDict | |
from datahub.utilities.stats_collections import TopKDict |
from functools import partial | ||
from typing import Dict, Iterable, List, Optional | ||
|
||
from pydantic import Field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import.
The Field
import from pydantic
is not used and should be removed.
- from pydantic import Field
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from pydantic import Field |
else: | ||
dataJobs[datajob_entity.flow_urn].append(datajob_entity) | ||
|
||
for key in dataFlows.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use key in dict
instead of key in dict.keys()
.
Remove .keys()
for better readability and performance.
- for key in dataFlows.keys():
+ for key in dataFlows:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for key in dataFlows.keys(): | |
for key in dataFlows: |
Tools
Ruff
363-363: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
import logging | ||
from collections import defaultdict | ||
from dataclasses import dataclass, field | ||
from datetime import datetime, timezone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports.
The following imports are not used and should be removed: datetime
, timezone
.
- from datetime import datetime, timezone
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import logging | |
from collections import defaultdict | |
from dataclasses import dataclass, field | |
from datetime import datetime, timezone | |
import logging | |
from collections import defaultdict | |
from dataclasses import dataclass, field |
|
||
@dataclass | ||
class DataHubGcSourceReport(SourceReport): | ||
class DataHubGcSourceReport(DataProcessCleanupReport, SoftDeletedEntitiesReport): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, I tend to prefer nesting / composition instead of inheritance
e.g.
class DataHubGcSourceReport
data_process_cleanup: DataProcessCleanupReport
soft_deleted_entities_cleanup: SoftDeletedEntitiesReport
same for the configs
however, I know it's a larger change so we can definitely defer it to later as well
|
||
@classmethod | ||
def create(cls, config_dict, ctx): | ||
config = DataHubGcSourceConfig.parse_obj(config_dict) | ||
return cls(ctx, config) | ||
|
||
def get_workunit_processors(self) -> List[Optional[MetadataWorkUnitProcessor]]: | ||
return [partial(auto_workunit_reporter, self.get_report())] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here - this disables auto status and a couple other things on purpose
|
||
entity_urn = Urn.create_from_string(urn) | ||
self.report.num_soft_deleted_entity_removed += 1 | ||
self.report.num_soft_deleted_entity_removed_by_type[entity_urn.entity_type] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use a defaultdict here and for other types in the report
entity_urn.entity_type | ||
].append(urn) | ||
|
||
self.ctx.graph.delete_entity(urn, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.ctx.graph.delete_entity(urn, True) | |
self.ctx.graph.delete_entity(urn, hard=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be in the gc
directory?
Add way to cleanup Dataflows/datajobs/dataprocessinstances and soft deleted entities.
Checklist
Summary by CodeRabbit
New Features
Improvements