Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ability to set expiration time period for messages in the crawler queue #627

Open
elrayle opened this issue Dec 9, 2024 · 0 comments

Comments

@elrayle
Copy link
Collaborator

elrayle commented Dec 9, 2024

Description

Update CD crawler to allow timeToLive for queue messages to be configurable. It is currently using the default of 7 days, which is too short and results in messages getting lost on this arbitrary timeline. This impacts our internal harvester and the missing license backfill process. If this can't be fixed, the DAG will have to reduce the number of packages it sends to the harvester. This will likely slow down processing. It is currently averaging only 125k per day, but will process closer to 500k on some days. The primary driver of this is the number of files being scanned by scancode. This will require some thought into how best to keep the process running without missing packages because they get dropped off after expiring.

Rationale

The backfill DAG puts more messages on the queue than the throughput of the GH CD harvester. If these messages just drop off the queue unprocessed, then it will appear that they are indeed missing their license, which may be incorrect.

Definition of Done

  • There is a new config to set the expiration to use for a message and the configured expiration is seen with messages in the queue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant