Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE: Split xref into separate sub-tasks #3814

Open
tillprochaska opened this issue Jul 16, 2024 · 3 comments
Open

FEATURE: Split xref into separate sub-tasks #3814

tillprochaska opened this issue Jul 16, 2024 · 3 comments
Assignees
Labels
backend Issues related to Aleph’s backend, API, CLI etc. feature-request Requests for new features or enhancements of existing features Moderate Issue that may require attention

Comments

@tillprochaska
Copy link
Contributor

tillprochaska commented Jul 16, 2024

Is your feature request related to a problem? Please describe.
In Aleph, a task is a single unit of background work. Aleph tracks progress of background jobs at the task-level (i.e. it stores how many tasks have been processed so far and how many are still pending). If a task fails, the entire task is retried.

Cross-referencing an entire collection is implemented as a single task. This causes multiple problems:

  • It means that Aleph currently isn’t able to display information about the progress of the cross-referencing process (besides the fact that it is still running).

  • If the task fails, it is retried from the beginning, re-computing cross-referencings for entities that have already been processed. Depending on the size of the Aleph instance and the size of the collection, computing cross-referencings can take hours, sometimes even days.

  • Aleph uses the Elasticsearch scroll API to iterate over all entities in the collection in batches. The scroll API has a timeout for the maximum time between requesting two batches of entities. If fetching candidates and computing the similarity score for a batch of entities takes longer than the timeout, Elasticsearch will raise an error when Aleph tries to request the next batch of entities.

Describe the solution you'd like
Computing cross-referencings for a collection should be split into two types of background tasks:

  • An initial task should use the ES scroll API to iterate over all entities in the collection in batches, similar to what happens right now. However, it shouldn’t actually compute the xref matches for the entities as part of that same task. Instead, this task should merely enqueue a separate task for every batch.

  • These tasks then compute the actual xref matches.

For example, if a collection contains 1000 entities and given a batch size of 500:

  • Aleph fetches first batch of 500 entities from ES and enqueues a separate background task to compute xref for entities 1 to 500.
  • Aleph fetches second batch of 500 entities from ES and enqueues a separate background task to compute xref for entities 501 to 1000.
  • A worker computes xref for entities 1 to 500.
  • A worker computes xref for entities 501 to 1000.

This has several advantages:

  • If one of the subtasks fails, the failure is limited to one batch of entities (if the batch size is 500, only xref matches for 500 entities would have to be recomputed).
  • Aleph will display detailed progress for the cross-referencing. (For example, if a collection contains 5000 entities and the batch size is 500, Aleph would be able to report progress in 10% increments.)
  • The problems with ES scroll timeouts are gone. The process that uses the scroll API to iterate over all entities doesn’t actually compute the xref matches (which takes a lot of time and can easily exceed the scroll timeouts), it only adds another task to the queue (which is very quick).
  • Xref can be parallelized. Each batch of entities can be handled by a separate worker. (For example, if you have 10 workers that are otherwise idle, xref could take only 1/10th of the time to complete compared to the current implementation.)

The disadvantage is that it adds a lot of tasks to the queue with possibly large payloads (the task payload would need to include the IDs for one batch of entities).

Describe alternatives you've considered
Increasing scroll timeouts: We’ve done this before, but it is only a short-term solution, as it cannot be repeated indefinitely. However, we’re currently migrating to from Redis to RabbitMQ as the primary data store for queued tasks. In contrast to Redis, RabbitMQ stores tasks on disk, so this should be less of a problem.

Additional context

  • This article gives an overview about how cross-referencing is implemented in Aleph.
@tillprochaska tillprochaska added feature-request Requests for new features or enhancements of existing features triage These issues need to be reviewed by the Aleph team backend Issues related to Aleph’s backend, API, CLI etc. and removed triage These issues need to be reviewed by the Aleph team labels Jul 16, 2024
@tillprochaska tillprochaska changed the title FEATURE: Split up xref into separate sub-tasks. FEATURE: Split xref into separate sub-tasks. Jul 16, 2024
@tillprochaska tillprochaska changed the title FEATURE: Split xref into separate sub-tasks. FEATURE: Split xref into separate sub-tasks Jul 16, 2024
@tillprochaska tillprochaska added the Moderate Issue that may require attention label Jul 16, 2024
@TheApeMachine
Copy link

Hi,

I had a meeting today where this issue, as one of two, was proposed as something I could work on? If so, I'd be happy to. I went through the contributing guidelines, code of conduct, etc. and read that I should assign this issue to myself, however, unless I am missing it, I don't think I can.

In any case, I forked the develop branch, and have a local setup going, so I'll just take a look :) Let me know if there is anything I should be doing to follow the process correctly!

@stchris
Copy link
Contributor

stchris commented Jul 26, 2024

Hi @TheApeMachine , we are thankful for you working on this! As a hint: we are working on the 4.0.0 release of aleph which changes the task queuing system from a Redis-based one to one based on RabbitMQ. So in order to future-proof any changes you make you might want to consider targeting the release/4.0.0 branch of aleph.

@TheApeMachine
Copy link

@stchris Thanks for that, I would have branched off develop otherwise :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Issues related to Aleph’s backend, API, CLI etc. feature-request Requests for new features or enhancements of existing features Moderate Issue that may require attention
Projects
None yet
Development

No branches or pull requests

3 participants