FEATURE: Split xref into separate sub-tasks #3814
Labels
backend
Issues related to Aleph’s backend, API, CLI etc.
feature-request
Requests for new features or enhancements of existing features
Moderate
Issue that may require attention
Is your feature request related to a problem? Please describe.
In Aleph, a task is a single unit of background work. Aleph tracks progress of background jobs at the task-level (i.e. it stores how many tasks have been processed so far and how many are still pending). If a task fails, the entire task is retried.
Cross-referencing an entire collection is implemented as a single task. This causes multiple problems:
It means that Aleph currently isn’t able to display information about the progress of the cross-referencing process (besides the fact that it is still running).
If the task fails, it is retried from the beginning, re-computing cross-referencings for entities that have already been processed. Depending on the size of the Aleph instance and the size of the collection, computing cross-referencings can take hours, sometimes even days.
Aleph uses the Elasticsearch scroll API to iterate over all entities in the collection in batches. The scroll API has a timeout for the maximum time between requesting two batches of entities. If fetching candidates and computing the similarity score for a batch of entities takes longer than the timeout, Elasticsearch will raise an error when Aleph tries to request the next batch of entities.
Describe the solution you'd like
Computing cross-referencings for a collection should be split into two types of background tasks:
An initial task should use the ES scroll API to iterate over all entities in the collection in batches, similar to what happens right now. However, it shouldn’t actually compute the xref matches for the entities as part of that same task. Instead, this task should merely enqueue a separate task for every batch.
These tasks then compute the actual xref matches.
For example, if a collection contains 1000 entities and given a batch size of 500:
This has several advantages:
The disadvantage is that it adds a lot of tasks to the queue with possibly large payloads (the task payload would need to include the IDs for one batch of entities).
Describe alternatives you've considered
Increasing scroll timeouts: We’ve done this before, but it is only a short-term solution, as it cannot be repeated indefinitely. However, we’re currently migrating to from Redis to RabbitMQ as the primary data store for queued tasks. In contrast to Redis, RabbitMQ stores tasks on disk, so this should be less of a problem.
Additional context
The text was updated successfully, but these errors were encountered: