Add where-i-left-off functionality #4

radu-gheorghe · 2023-08-16T13:59:33Z

Scenario: large reindex job that may fail for whatever reason (reindexer crashes, source/destination Solr crashes, peak load incoming and we need to interrupt the job...). I don't want to start over again.

Trouble is, if one only sorts by ID, resuming from the ID where we left off is meaningless. But some people have a "last modified timestamp" in documents, which might be very useful here:

we call the reindexer sorting by that timestamp ascending AND the ID (which has to be there for cursors)
reindexer writes to a file where it left off. This would be new functionality, more on it below
when we stop/crash, we can see where we left off. When we restart, we can put a query like [WHERE-I-LEFT-OFF to *] so that the new cursor will take only data that either wasn't processed or was processed but modified later

Bonus: this can be used for large datasets where reindexing can take days, so you can do an initial import and then a delta (i.e. what changed since the initial import started)

To implement the where I left off, we need:

a new parameter - the location of this file with "where I left off"
on every page read, we look into the batch and see the last timestamp. Keep the last N of those in a list (N would be queue_size/rows + write_threads)
we check the queue size and judge based on that + the maximum number of in-flight requests (depends on the number of write threads and the current queue size) what's the last timestamp where we actually processed the data. If we can't easily get the queue size, we can just take the first item from the list as the worst-case-scenario)
write where we left off (and maybe the debug info - the last seen timestamp + current queue size?) in the target file and fsync() it
document this functionality and how we can use it

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add where-i-left-off functionality #4

Add where-i-left-off functionality #4

radu-gheorghe commented Aug 16, 2023

Add where-i-left-off functionality #4

Add where-i-left-off functionality #4

Comments

radu-gheorghe commented Aug 16, 2023