Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIP-62: add lineage support for Object Store #40829

Merged
merged 1 commit into from
Jul 23, 2024
Merged

Conversation

mobuchowski
Copy link
Contributor

@mobuchowski mobuchowski commented Jul 16, 2024

This PR is based on #40819

This adds support for getting lineage directly from Object Store's ObjectStoragePath.

Not every operation is being tracked, only those that modify or read the files, not the metadata.

Copy/rename/move operations are being tracked internally, while for tracking reads and writes there's TrackingFileWrapper - proxy that collects reads and writes.

This allows also tracking data reads/writes from other systems that accept file APIs - for example, from Object Store tutorial in Airflow:

base = ObjectStoragePath("s3://aws_default@airflow-tutorial-data/")
(...)

path = base / f"air_quality_{formatted_date}.parquet"

df = pd.DataFrame(response.json()).astype(aq_fields)
with path.open("wb") as file:
    df.to_parquet(file)

can generate lineage.

FileTransferOperator already has OL support (AIP-53 one).

@mobuchowski mobuchowski force-pushed the aip-62/object-storage branch from f277448 to 26d257b Compare July 17, 2024 13:16
@mobuchowski mobuchowski added the AIP-62 Tasks tracking implementation of AIP-62 Getting Lineage from Hook Instrumentation label Jul 17, 2024
@mobuchowski mobuchowski force-pushed the aip-62/object-storage branch 4 times, most recently from 2313f98 to b4dd98d Compare July 18, 2024 14:20
@potiuk potiuk force-pushed the aip-62/object-storage branch from b4dd98d to dea8071 Compare July 18, 2024 18:30
@mobuchowski mobuchowski force-pushed the aip-62/object-storage branch 3 times, most recently from 15f6920 to ccdee06 Compare July 22, 2024 09:01
@mobuchowski mobuchowski added the full tests needed We need to run full set of tests for this PR to merge label Jul 22, 2024
Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me aside from the store property (which I don’t have enough knowledge on)

airflow/io/path.py Outdated Show resolved Hide resolved
Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
@mobuchowski mobuchowski force-pushed the aip-62/object-storage branch from ccdee06 to 2c3c34b Compare July 23, 2024 12:22
@mobuchowski mobuchowski merged commit 6adae0b into main Jul 23, 2024
74 of 76 checks passed
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jul 24, 2024
@mobuchowski mobuchowski deleted the aip-62/object-storage branch August 2, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-62 Tasks tracking implementation of AIP-62 Getting Lineage from Hook Instrumentation area:dev-tools area:lineage area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) full tests needed We need to run full set of tests for this PR to merge provider:amazon-aws AWS/Amazon - related issues provider:common-io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants