You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scenario: large reindex job that may fail for whatever reason (reindexer crashes, source/destination Solr crashes, peak load incoming and we need to interrupt the job...). I don't want to start over again.
Trouble is, if one only sorts by ID, resuming from the ID where we left off is meaningless. But some people have a "last modified timestamp" in documents, which might be very useful here:
we call the reindexer sorting by that timestamp ascending AND the ID (which has to be there for cursors)
reindexer writes to a file where it left off. This would be new functionality, more on it below
when we stop/crash, we can see where we left off. When we restart, we can put a query like [WHERE-I-LEFT-OFF to *] so that the new cursor will take only data that either wasn't processed or was processed but modified later
Bonus: this can be used for large datasets where reindexing can take days, so you can do an initial import and then a delta (i.e. what changed since the initial import started)
To implement the where I left off, we need:
a new parameter - the location of this file with "where I left off"
on every page read, we look into the batch and see the last timestamp. Keep the last N of those in a list (N would be queue_size/rows + write_threads)
we check the queue size and judge based on that + the maximum number of in-flight requests (depends on the number of write threads and the current queue size) what's the last timestamp where we actually processed the data. If we can't easily get the queue size, we can just take the first item from the list as the worst-case-scenario)
write where we left off (and maybe the debug info - the last seen timestamp + current queue size?) in the target file and fsync() it
document this functionality and how we can use it
The text was updated successfully, but these errors were encountered:
Scenario: large reindex job that may fail for whatever reason (reindexer crashes, source/destination Solr crashes, peak load incoming and we need to interrupt the job...). I don't want to start over again.
Trouble is, if one only sorts by ID, resuming from the ID where we left off is meaningless. But some people have a "last modified timestamp" in documents, which might be very useful here:
[WHERE-I-LEFT-OFF to *]
so that the new cursor will take only data that either wasn't processed or was processed but modified laterBonus: this can be used for large datasets where reindexing can take days, so you can do an initial import and then a delta (i.e. what changed since the initial import started)
To implement the where I left off, we need:
The text was updated successfully, but these errors were encountered: