Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Asset Inventory] Dataflow pipeline is using an deprecated SDK version (2.41.0) #1374

Closed
MarcFundenberger opened this issue Oct 23, 2024 · 5 comments

Comments

@MarcFundenberger
Copy link

The GCP Dataflow console is alerting us that the Dataflow SDK version used by the Asset Inventory tool is deprecated (2.41.0).
Can you please modify the template to use an up-to-date SDK ?

bmenasha added a commit to bmenasha/professional-services that referenced this issue Nov 15, 2024
…1374 and other performance/bug fixes

Issue # 1374:
    Use the latest Dataflow SDK version.

Issue # 1373: Unable to deal with new cloudbuild.googleapis.com/Build assets:

    The core issue was that the discovery_name of this new asset type is
incorrectly reported as cloudbuild.googleapis.com/Build rather than 'Build'. Try
to deal with that by correcting any discovery_name with a '/' in it. But there
are other fixes necessary to speed processing.

Other performance/bug fixes:

- Use the discovery document generated schema if we have one over any resource
  generated one. This is a big performance improvement as determining the schema
  from the resource is time consuming, it's also not productive as if we have an
  API resource schema, it should always match the resource json anyways.

- Add ancestors, update_time, location, json_data to discovery generated schema.
  This prevents those properties from being dropped if we always rely on it.

- Sanitize discovery document generated schemas. If we are to always rely on
  them, it's possible they could be invalid, so enforce the bigquery
  rules on them as well.

- Use copy.deepcopy less, only when we copy a source into a destination field.

- Prevent bigquery columns with BQ_FORBIDDEN_PREFIXES from being created. There
  are some bigtable resources that can include these prefixes.

- Some BigQuery model resources had NaN and Infinity values for numeric fields.
  Try to handle those in sanitization.

- When merging schemas, stop after we have BQ_MAX_COLUMNS fields. This helps to
  stop the merge process earlier. (It can take forever if there are many unique
  fields and many elements).

- When enforcing schema on a resource, recognize when we are handling addition
  properties and add the additional property fields to the value of the
  additional property key value list in push_down_additional_properties. This
  produced more regular schemas.

- Add ignore_unknown_values to the load job so that we don't fail if resource
  contains fields not present in the schema.

- Accept and pass --add-load-date-suffix via main.py.

- Better naming of some local variables for readability.

- Some format changes suggested by Intellij.
bmenasha added a commit to bmenasha/professional-services that referenced this issue Nov 15, 2024
…1374 and other performance/bug fixes

Issue # 1374:
    Use the latest Dataflow SDK version.

Issue # 1373: Unable to deal with new cloudbuild.googleapis.com/Build assets:

    The core issue was that the discovery_name of this new asset type is
incorrectly reported as cloudbuild.googleapis.com/Build rather than 'Build'. Try
to deal with that by correcting any discovery_name with a '/' in it. But there
are other fixes necessary to speed processing.

Other performance/bug fixes:

- Use the discovery document generated schema if we have one over any resource
  generated one. This is a big performance improvement as determining the schema
  from the resource is time consuming, it's also not productive as if we have an
  API resource schema, it should always match the resource json anyways.

- Add ancestors, update_time, location, json_data to discovery generated schema.
  This prevents those properties from being dropped if we always rely on it.

- Sanitize discovery document generated schemas. If we are to always rely on
  them, it's possible they could be invalid, so enforce the bigquery
  rules on them as well.

- Use copy.deepcopy less, only when we copy a source into a destination field.

- Prevent bigquery columns with BQ_FORBIDDEN_PREFIXES from being created. There
  are some bigtable resources that can include these prefixes.

- Some BigQuery model resources had NaN and Infinity values for numeric fields.
  Try to handle those in sanitization.

- When merging schemas, stop after we have BQ_MAX_COLUMNS fields. This helps to
  stop the merge process earlier. (It can take forever if there are many unique
  fields and many elements).

- When enforcing schema on a resource, recognize when we are handling addition
  properties and add the additional property fields to the value of the
  additional property key value list in push_down_additional_properties. This
  produced more regular schemas.

- Add ignore_unknown_values to the load job so that we don't fail if resource
  contains fields not present in the schema.

- Accept and pass --add-load-date-suffix via main.py.

- Better naming of some local variables for readability.

- Some format changes suggested by Intellij.
@bmenasha
Copy link
Contributor

this pull request should resolve it. #1394
I'll update the latest pipeline after it's merged.
thanks

agold-rh added a commit that referenced this issue Nov 20, 2024
…1394)

Issue # 1374:
    Use the latest Dataflow SDK version.

Issue # 1373: Unable to deal with new cloudbuild.googleapis.com/Build assets:

    The core issue was that the discovery_name of this new asset type is
incorrectly reported as cloudbuild.googleapis.com/Build rather than 'Build'. Try
to deal with that by correcting any discovery_name with a '/' in it. But there
are other fixes necessary to speed processing.

Other performance/bug fixes:

- Use the discovery document generated schema if we have one over any resource
  generated one. This is a big performance improvement as determining the schema
  from the resource is time consuming, it's also not productive as if we have an
  API resource schema, it should always match the resource json anyways.

- Add ancestors, update_time, location, json_data to discovery generated schema.
  This prevents those properties from being dropped if we always rely on it.

- Sanitize discovery document generated schemas. If we are to always rely on
  them, it's possible they could be invalid, so enforce the bigquery
  rules on them as well.

- Use copy.deepcopy less, only when we copy a source into a destination field.

- Prevent bigquery columns with BQ_FORBIDDEN_PREFIXES from being created. There
  are some bigtable resources that can include these prefixes.

- Some BigQuery model resources had NaN and Infinity values for numeric fields.
  Try to handle those in sanitization.

- When merging schemas, stop after we have BQ_MAX_COLUMNS fields. This helps to
  stop the merge process earlier. (It can take forever if there are many unique
  fields and many elements).

- When enforcing schema on a resource, recognize when we are handling addition
  properties and add the additional property fields to the value of the
  additional property key value list in push_down_additional_properties. This
  produced more regular schemas.

- Add ignore_unknown_values to the load job so that we don't fail if resource
  contains fields not present in the schema.

- Accept and pass --add-load-date-suffix via main.py.

- Better naming of some local variables for readability.

- Some format changes suggested by Intellij.

Co-authored-by: Andrew Gold <41129777+agold-rh@users.noreply.github.com>
@bmenasha
Copy link
Contributor

change is merged. dataflow template was updated. let me know if there are any problems.
thanks

@MarcFundenberger
Copy link
Author

Hello
Thanks for your work. Pipeline is working as intended.
However the column resource.json_data changed type from STRING to JSON with the new pipeline.
I agree it's better this way but it has the unforeseen consequence that you can't query it on dates that where generated with the old and new version of the pipeline...
E.g. :

select resource.json_data from `dfdp-sre-data.shardedAdeoAssetInventory.container_googleapis_com_Cluster_202411*`

Crashed with error Cannot read field of type STRING as JSON Field: resource.json_data
But the column resource.json_data holds the same data than the rest of the column, so it's basically redundant...
Anyway, think it's bad practice to use our inventory Dataset this way, so I'll educate our users that try this (and complain) .

Perhaps next time we could work out a way for us tho test a pre-release version of the pipeline ?

Thanks again.

@bmenasha
Copy link
Contributor

Yes, this was a backwards incompatible change. When this code was originally written, BigQuery didn't have a JSON datatype. Honestly if the JSON datatype existed back then I doubt we would have written this tool :)

Perhaps next time we could work out a way for us tho test a pre-release version of the pipeline ?

The Dataflow template gs://professional-services-tools-asset-inventory/test/import_pipeline is typically updated prior to the latest Dataflow template if you want to get changes sooner.

so it's basically redundant...

There are situations where the BigQuery schema can't represent the data, for example when the bigquery column count is exceeded, the depth of the nested records exceeds bigquery's limits, the column name is too long. We just truncate the data in such cases, but I didn't want to lose the original data in that event.

@MarcFundenberger
Copy link
Author

We just truncate the data in such cases, but I didn't want to lose the original data in that event.

Yes, you're right, of course.
Thank you for your answers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants