Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add where-i-left-off functionality #4

Open
5 tasks
radu-gheorghe opened this issue Aug 16, 2023 · 0 comments
Open
5 tasks

Add where-i-left-off functionality #4

radu-gheorghe opened this issue Aug 16, 2023 · 0 comments

Comments

@radu-gheorghe
Copy link
Contributor

Scenario: large reindex job that may fail for whatever reason (reindexer crashes, source/destination Solr crashes, peak load incoming and we need to interrupt the job...). I don't want to start over again.

Trouble is, if one only sorts by ID, resuming from the ID where we left off is meaningless. But some people have a "last modified timestamp" in documents, which might be very useful here:

  • we call the reindexer sorting by that timestamp ascending AND the ID (which has to be there for cursors)
  • reindexer writes to a file where it left off. This would be new functionality, more on it below
  • when we stop/crash, we can see where we left off. When we restart, we can put a query like [WHERE-I-LEFT-OFF to *] so that the new cursor will take only data that either wasn't processed or was processed but modified later

Bonus: this can be used for large datasets where reindexing can take days, so you can do an initial import and then a delta (i.e. what changed since the initial import started)

To implement the where I left off, we need:

  • a new parameter - the location of this file with "where I left off"
  • on every page read, we look into the batch and see the last timestamp. Keep the last N of those in a list (N would be queue_size/rows + write_threads)
  • we check the queue size and judge based on that + the maximum number of in-flight requests (depends on the number of write threads and the current queue size) what's the last timestamp where we actually processed the data. If we can't easily get the queue size, we can just take the first item from the list as the worst-case-scenario)
  • write where we left off (and maybe the debug info - the last seen timestamp + current queue size?) in the target file and fsync() it
  • document this functionality and how we can use it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant