Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(registry): add remove latest partition job #44795

Merged
merged 1 commit into from
Aug 27, 2024

Conversation

bnchrch
Copy link
Contributor

@bnchrch bnchrch commented Aug 26, 2024

What

Provides a job to delete all latest partitions.

Why?

If you remove the partitions they will be readded triggering a reprocess of latest

Review guide

User Impact

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Aug 26, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Aug 26, 2024 8:55pm

Copy link
Contributor Author

bnchrch commented Aug 26, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @bnchrch and the rest of your teammates on Graphite Graphite

@bnchrch bnchrch marked this pull request as ready for review August 26, 2024 20:55
@bnchrch bnchrch requested a review from a team as a code owner August 26, 2024 20:55
Copy link
Contributor

@erohmensing erohmensing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I watched ben test this new job (and subsequent materialization) with mine own eyes. Let's do it

@@ -1,6 +1,6 @@
[tool.poetry]
name = "orchestrator"
version = "0.4.0"
version = "0.4.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
version = "0.4.1"
version = "0.5.0"

Since we added a new job

@op(required_resource_keys={"latest_metadata_file_blobs"})
def remove_latest_metadata_partitions_op(context):
"""
This op is responsible for removing for latest metadata files. (Generally used to reprocess metadata files).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This op is responsible for removing for latest metadata files. (Generally used to reprocess metadata files).
This op is responsible for removing for latest metadata partitions. (Generally used to reprocess metadata files).

The difference between these 2 is what made it so hard for me to comprehend for so long, so keeping it correct :D

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three questions:

1. Why do yo want to reprocess latest?

My understanding of the system is that partitions are per etag, so if latest changed new metadata partition will get added. If latest did not change I don't get why we'd want to recreate the partition and retrrigger all downstream operations.

2. What is scheduling the job?

I'm not seeing any code change related to scheduling this remove_latest_metadata_partitions. There's probably an implicit logic I'm unaware of.

3. Would this be useful for my release candidate PR

I want to rematerialize an asset on blob deletion.
The asset is a list of available release candidates.

I took an approach which I'm not super satisfied of here.

Your change makes me think I could take the following approach:

  • Consider release_candidate metadata as partitions
  • Make the job which is remove stale metadata partition trigger asset rematerialization on partition deletion with AssetMaterialization(asset_key=AssetKey("my_asset"), description="Rematerializing my asset")

Does it sound better?

@bnchrch
Copy link
Contributor Author

bnchrch commented Aug 27, 2024

@alafanechere Great questions.

@erohmensing and I added this so in the event we updated the registry entry generation logic, we could manually retrigger registry entry processing.

How we do that is simply by deleting the partition keys for all latest metadata files

as it will be picked up on the next "add partition run" and processed again.

@erohmensing
Copy link
Contributor

To add - for example if we add a new (generated) field in the metadata, we want to be able to populate that info for the connectors without having to re-release them

@alafanechere
Copy link
Contributor

Sounds definitely legit. @erohmensing @bnchrch will you manually run the job to delete the partitions?

@erohmensing
Copy link
Contributor

@alafanechere Yep, the plan is to do that in the evening in SF so as not to create a backlog that slows connectors being published during the day!

Copy link
Contributor Author

bnchrch commented Aug 27, 2024

@alafanechere As the new DAG master Ill merge when we have your approval :)

@alafanechere
Copy link
Contributor

alafanechere commented Aug 27, 2024

@alafanechere As the new DAG master Ill merge when we have your approval :)

I'm on GitHub mobile so I don't think I can 👍 from there. But I do approve! Feel free to merge

@bnchrch bnchrch merged commit 6d4ed9d into master Aug 27, 2024
37 checks passed
@bnchrch bnchrch deleted the 08-26-feat_registry_add_remove_latest_partition_job branch August 27, 2024 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants