Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand @schedule to trigger based on external events, such as changes to AWS S3 bucket #468

Closed
rapuckett opened this issue Apr 5, 2021 · 7 comments

Comments

@rapuckett
Copy link

Right now, it's possible to trigger a Metaflow SFN run by manually creating a Lambda triggered by an EventBridge rule (in my current use case, it happens when new data uploaded to an S3 bucket). This process is manual and will potentially involve a lot of boilerplate code, so a FlowSpec-level @schedule decorator would be great for setting this up for each Flow that gets deployed to production.

The decorator would ideally be abstract enough to work on a variety of resource events (eg: S3, RDS) and also not be bound to AWS, specifically - likely by leveraging plugins.

@savingoyal
Copy link
Collaborator

In addition to expanding @schedule decorator for triggering, we can also add methods that a user can invoke in their flows to trigger other flows deployed on Step Functions.

@rapuckett Can you also elaborate a bit on your specific use case so that we can make sure it's covered in any enhancements we make to @schedule?

@rapuckett
Copy link
Author

rapuckett commented Apr 8, 2021

My specific use case involves triggering a SFN-based run when new data is deposited into some location in an S3 bucket (eg: s3://my-data-bucket/ocr.csv). To accomplish this I've created an S3 event notification on the s3:ObjectCreated:* event which then invokes my Lambda function. The Lambda invokes a given SFN (currently hardcoded ARN) by creating a random'ish execution name and then calling the appropriate boto3 method.

name = ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(10))
        
response = sfn.start_execution(
    stateMachineArn='arn:aws:states:us-east-1:xxxxxxxxxxxx:stateMachine:MyFlow',
    name=name
)

The idea is that the Lambda and S3 event trigger (or EventBridge Rule) would be created automatically when an SFN flow gets created, thus removing the need for the ML engineer to have to repeatedly create these resources by hand.

I'm not sure how the details might change for, say, GCP or Azure, but an example @schedule could look something like:

@schedule(datasource='<URI>' triggered_at_most='{hourly|daily|weekly|monthly}')

Where URI would be 's3://my-data-bucket/ocr.csv' in my case, and triggered_at_most would safeguard the case where the datasource might change often (intentionally or accidentally) so we would limit the number of times our SFN would get executed. This would be nice to have, but basic triggering would be great for now.

Hope this makes sense.

@savingoyal
Copy link
Collaborator

Makes sense. I will follow up with a design proposal.

@tuulos
Copy link
Collaborator

tuulos commented Apr 16, 2021

related ticket: #280

@talebzeghmi
Copy link

talebzeghmi commented May 5, 2021

This is awesome! It'd be great if the trigger is "meta" and can work for other orchestrators such as Argo and KFP.

related: #245

@leeyh20
Copy link

leeyh20 commented Nov 22, 2022

Hello! Is this feature being worked on? It sounds like a very good feature

@savingoyal
Copy link
Collaborator

#1271 introduces the basics to support this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants