-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deterministic table name collision resolution for normalization #2206
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing, I think it would be more helpful if the format was:
{stream_name}_{lineage_hash}_{submodel_name}
Otherwise it is very hard to understand what is in the table with just hashes
...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py
Outdated
Show resolved
Hide resolved
...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing, I think it would be more helpful if the format was:
{stream_name}_{lineage_hash}_{submodel_name}
Otherwise it is very hard to understand what is in the table with just hashes
On Hubspot catalog, you get these fields for example:
- root stream name: contacts
- Lineage: contacts/properties/
- Nested Property name: hs_content_membership_registration_domain_sent_to
On Postgres destinations, we already truncate the name of that table to hs_content_membership___tration_domain_sent_to
to save a few characters but if we were to include contacts
or contacts/properties
on top of it then it would become even more difficult to fit...
But yes, if we can at least fit in contacts
(maybe in cases where we don't hit the character limits at least?)
@@ -418,7 +421,7 @@ def generate_final_model(self, from_table: str, column_names: Dict[str, Tuple[st | |||
def list_fields(self, column_names: Dict[str, Tuple[str, str]]) -> List[str]: | |||
return [column_names[field][0] for field in column_names] | |||
|
|||
def add_to_outputs(self, sql: str, is_intermediate: bool, suffix: str = "") -> str: | |||
def add_to_outputs(self, sql: str, is_intermediate: bool, suffix: str = None) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you run mypy
you'll get this warning because of this change:
airbyte-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py:424:77: error: Incompatible default for argument "suffix" (default has type "None",
argument has type "str")
So it was on purpose that suffix was by default with the empty string ""
...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/bases/base-normalization/unit_tests/test_catalog_processor.py
Outdated
Show resolved
Hide resolved
...e-integrations/bases/base-normalization/normalization/transform_catalog/catalog_processor.py
Outdated
Show resolved
Hide resolved
...e-integrations/bases/base-normalization/normalization/transform_catalog/catalog_processor.py
Show resolved
Hide resolved
FYI when running the tests I included here: #2211 An exception is raised with your code because it is generating duplicated table names on the Hubspot catalog.json I don't know why... is it because 6 characters hashes are still generating collisions? A remark is also that intermediate tables are also using the lineage_hash in their naming... Should this be kept aligned and lineage hash are only on nested tables/intermediate tables? |
...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py
Outdated
Show resolved
Hide resolved
Looking closer at the destination limits, BigQuery (snowflake too?) have pretty good character limits on identifier length, would it make sense not to include the not so user-friendly lineage hash by default in those destinations since we would more rarely need to truncate anything? |
PTAL, should be working as expected if you look at the test cases. |
.../bases/base-normalization/unit_tests/resources/stripe_catalog_expected_nested_snowflake.json
Outdated
Show resolved
Hide resolved
@@ -26,6 +26,7 @@ | |||
|
|||
import pendulum as pendulum | |||
from base_python import BaseClient | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new annoying formatting disagreement between spotless and black? 🙄
def get_table_name( | ||
name_transformer: DestinationNameTransformer, root_table: str, base_table_name: str, suffix: str, lineage_hash: str | ||
) -> str: | ||
prefix = f"{name_transformer.normalize_table_name(root_table)[:10]}_{lineage_hash}_" if root_table else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we truncate only if we run into character limits problems?
If we have space to spare, why wouldn't we include the full name for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. we should be degrading only if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good % the comment and adding some comment in the code to explain the truncation logic
# Add extra characters '__', signaling a truncate in identifier | ||
print(f"Truncating {input_name} (#{len(input_name)}) to {prefix}__{suffix} (#{2 + len(prefix) + len(suffix)})") | ||
input_name = f"{prefix}__{suffix}" | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoudn't it be that if it is not there we should assume no need for truncation?
@@ -463,18 +464,10 @@ def generate_new_table_name(self, is_intermediate: bool, suffix: str) -> str: | |||
# see alias in dbt: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-aliases/ | |||
pass | |||
pass | |||
elif self.parent is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not self.parent
def get_table_name( | ||
name_transformer: DestinationNameTransformer, root_table: str, base_table_name: str, suffix: str, lineage_hash: str | ||
) -> str: | ||
prefix = f"{name_transformer.normalize_table_name(root_table)[:10]}_{lineage_hash}_" if root_table else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. we should be degrading only if needed.
new_table_name = self.name_transformer.normalize_table_name(f"{table_name}_{i}") | ||
if new_table_name not in tables_registry: | ||
break | ||
new_table_name = get_table_name(self.name_transformer, self.json_path[0], new_table_name, suffix, self.json_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While keeping your logic in get_table_name
, we could make it even easier for the user to follow through the multiple levels of nesting, WDYT of "_".join(self.json_path)
instead of self.json_path[0]
?
The json_path will still be truncated in priority (keeping the first parent within the MINIMUM_PARENT_LENGTH
characters) to fit in the proper limits when necessary, but we would expose as much JSON path as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 64df137, just needed to exclude the final element of the path (itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
@param input_name is the identifier name to middle truncate | ||
@param custom_limit uses a custom length as the max instead of the destination max length | ||
""" | ||
limit = custom_limit if custom_limit > 0 else self.get_name_max_length() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we use None instead of a placeholder value?
Resolves #2055
Curious what you think of this approach of having hashes for all nested tables instead of just duplicates.
My reasoning was that it seemed un-intuitive that you could go from a schema with just
B
->x
which is rendered as subtablex
and then maybe update the schema to have bothB
->x
ANDA
->x
which would label the one fromB
asx_hash
and the new one fromA
asx
, if the method used to decide which is the duplicate definesA
<B
.With this, it isn't ergonomic to refer to the tables with the hashes, but at the same time it may be more ergonomic than trying to figure out how to rename usages downstream?
I want to confirm this first. Then I'll add more unit testing.