Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading manifest from S3/GCS/Azure BlobStorage #448

Closed
Tracked by #1103
tatiana opened this issue Aug 8, 2023 · 8 comments · Fixed by #1109
Closed
Tracked by #1103

Support loading manifest from S3/GCS/Azure BlobStorage #448

tatiana opened this issue Aug 8, 2023 · 8 comments · Fixed by #1109
Assignees
Labels
area:parsing Related to parsing DAG/DBT improvement, issues, or fixes
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented Aug 8, 2023

While talking to one of the community members, Edgaras Navickas, he mentioned it would be great if users could reference a manifest in an S3 bucket. This was a follow-up to issues reported in the slack thread.

Example:

DbtDag(
    project_config=ProjectConfig(
        manifest_path="s3://path/to/manifest.json",
        manifest_conn_id="aws_conn",
    ),
    render_config=RenderConfig(
        load_mode=LoadMode.DBT_MANIFEST,
    )
    # ...,
)

We can have separate tickets to support loading manifests from other cloud providers.

@tatiana tatiana added this to the 1.2.0 milestone Sep 6, 2023
@tatiana tatiana modified the milestones: 1.2.0, 1.3.0 Sep 28, 2023
@tatiana tatiana added the area:parsing Related to parsing DAG/DBT improvement, issues, or fixes label Nov 8, 2023
@MrBones757
Copy link
Contributor

MrBones757 commented Nov 23, 2023

I'd like to add some additional ideas / comments here.

Rather than support the s3 uri, would it be worth creating a set of classes similar to the way we handle profiles.

I.e im thinking something like

ManifestSourceBase
-> S3ManifestSource
-> ArtifactoryManifestSource
-> NexusManifestSource
etc

This would make it really modular and allow us to source from numerous artifact stores, and support more than aws specific s3, think azure blob store, Google Storage, Cloudflare R2 as well as those above.

Thinking about integration, it would be fairly easy to just allow ManifestSourceBase as a possible type for the manifest path arg, and make ManifestSourceBase an abstract base.

This also relates in a way to #570, as they are somewhat related & competing ideas as they both deal with a remotely sourced manifest (and indeed, profiles.yml, for which the same logic could be used - Perhaps ManifestSourceBase -> CosmosFileSource or something, that would accept injects creds, or airflow conns)

@idealopamp
Copy link

I'd like to add some additional ideas / comments here.

Rather than support the s3 uri, would it be worth creating a set of classes similar to the way we handle profiles.

I.e im thinking something like

ManifestSourceBase -> S3ManifestSource -> ArtifactoryManifestSource -> NexusManifestSource etc

This would make it really modular and allow us to source from numerous artifact stores, and support more than aws specific s3, think azure blob store, Google Storage, Cloudflare R2 as well as those above.

Thinking about integration, it would be fairly easy to just allow ManifestSourceBase as a possible type for the manifest path arg, and make ManifestSourceBase an abstract base.

This also relates in a way to #570, as they are somewhat related & competing ideas as they both deal with a remotely sourced manifest (and indeed, profiles.yml, for which the same logic could be used - Perhaps ManifestSourceBase -> CosmosFileSource or something, that would accept injects creds, or airflow conns)

This would be great. Our team would love to see this for Google Cloud Storage.

@tatiana tatiana modified the milestones: 1.3.0, 1.4.0 Dec 7, 2023
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 9, 2024
Copy link

dosubot bot commented Mar 9, 2024

Hi, @tatiana,

I'm helping the Cosmos team manage their backlog and am marking this issue as stale. The issue involves adding support for referencing a manifest in an S3 bucket, with additional suggestions for creating a modular set of classes to handle various artifact stores. It seems that the issue is still unresolved, and I'd like to confirm if it's still relevant to the latest version of the Cosmos repository. If it is, please let the Cosmos team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution, and I look forward to hearing from you.

Dosu

@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 16, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 16, 2024
@tatiana tatiana modified the milestones: 1.4.0, Cosmos 1.6.0 Jul 1, 2024
@tatiana tatiana reopened this Jul 1, 2024
@tatiana tatiana changed the title Support loading manifest from S3 Support loading manifest from S3/GCS/Azure BlobStorage Jul 1, 2024
@tatiana
Copy link
Collaborator Author

tatiana commented Jul 2, 2024

There is interest from Astro customers on this feature.

@Pawel-Drabczyk
Copy link

Are there any estimates of when this feature can be released? This change will help our team a lot.

@pankajkoti
Copy link
Contributor

@Pawel-Drabczyk I am analysing this issue at the moment and we would ideally like to have this in the upcoming Cosmos 1.6.0 release.

@pankajkoti
Copy link
Contributor

I have created a PR in draft for supporting this #1109. Tested the implementation with AWS S3 and GCP GCS. Need some help with testing with Azure store wrt to right resources and access.

@pankajkoti
Copy link
Contributor

PR #1109 is ready for review and I have addressed the review comments so far.

dwreeves pushed a commit to dwreeves/astronomer-cosmos that referenced this issue Jul 31, 2024
…t Store (astronomer#1109)

## Summary

This PR introduces the capability to load `manifest.json` files from
various cloud storage services using Airflow's Object Store integration.
The supported cloud storages include AWS S3, Google Cloud Storage (GCS),
and Azure Blob Storage. The feature allows seamless integration with
remote paths, providing enhanced flexibility and scalability for
managing DBT projects.

### Key Changes

1. Parameters in DbtDag and DbtTaskGroup:
`manifest_path`: Accepts both local paths and remote URLs (e.g., S3,
GCS, Azure Blob Storage).
`manifest_conn_id`: (Optional) An Airflow connection ID for accessing
the remote path.
2. Automatic Detection of Storage Type:
The system automatically identifies the storage service based on the
scheme of the URL provided (e.g., s3://, gs://, abfs://) by integrating
with Airflow Object Store
3. If a `manifest_conn_id` is provided, it is used to fetch the
necessary credentials.
4. If no `manifest_conn_id` is provided, the default connection ID for
the identified scheme is used.

### Validation and Error Handling

1. Validates the existence of the `manifest.json` file when a path is
specified.
2. Raises appropriate errors if a remote `manifest_path` is given but
the required min Airflow version 2.8(Object Store feature) support is
not available.


### Backward Compatibility
Ensures compatibility with existing workflows that use local paths for
the manifest.json.


### How to Use

1. Local Path:
```
DbtDag(
    project_config=ProjectConfig(
        manifest_path="/path/to/local/manifest.json",
    ),
    ...
)
```
2. Remote Path (e.g., S3):
```
DbtDag(
    project_config=ProjectConfig(
        manifest_path="s3://bucket/path/to/manifest.json",
        manifest_conn_id="aws_s3_conn",
    ),
    ...
)
```
3.  Remote Path without Explicit Connection ID:
```
DbtDag(
    project_config=ProjectConfig(
        manifest_path="gs://bucket/path/to/manifest.json",
        # No manifest_conn_id provided, will use default Airflow GCS connection `google_cloud_default`
    ),
    ...
)
```


### Additional Notes

1. Ensure that the required Airflow version (2.8 or later) is used to
take advantage of the Object Store features.
2. Review the updated documentation for detailed usage instructions and
examples.


### Testing

1. Added unit tests to cover various scenarios including local paths,
remote paths with and without manifest_conn_id.
2. Verified integration with different cloud storage services (AWS S3,
GCS, Azure Blob Storage).
3. Ensured backward compatibility with existing local path workflows.

## Related Issue(s)

closes: astronomer#448

## Breaking Change?
No.

## Checklist

- [x] I have made corresponding changes to the documentation (if
required)
- [x] I have added tests that prove my fix is effective or that my
feature works

---------

Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:parsing Related to parsing DAG/DBT improvement, issues, or fixes
Projects
None yet
5 participants