Implement RMap re-indexing application #139

emetsger · 2017-10-23T20:39:53Z

As a part of the production release, RMap needs to be able to (re)populate its Solr index from the triplestore. The need to reindex the triplestore could arise for any number of reasons, but two common scenarios are:

recovery from a corrupt index
change in index schema, which would require updating existing documents in the index

The re-indexer could be implemented as a Spring Boot application.

Option 1

At a high level, re-indexing the triplestore would involve:

retrieving the RMap events from the triplestore
publishing those events to Kafka
letting the indexer consume those events normally

Note that step 3 could re-use the existing code path currently used by the index. It is steps 1 and 2 that differ from the normal code path (normally ORMapEventMgr produces events in response to user actions; in a re-indexing scenario, separate logic would be used to produce events).

Option 2

An optional workflow could be:

retrieve the RMap events from the triplestore
create Solr documents for the events
deposit documents directly to Solr (bypass Kafka)

Considerations

The triplestore, as @karenhanson has warned, can exhibit performance issues when dealing with large result sets, and response times may be slow. Maybe the triplestore can fall over! I don't know. The re-indexer should be prepared to deal with a triplestore that may (appear to) become unavailable during the course of re-indexing, and be able to resume indexing from a particular event id (or timestamp?).
The indexer stores Kafka metadata with Solr documents. This allows the indexer to resume reading the event stream without receiving duplicate events or missing events (i.e. KafkaMetadata is part of the implementation for exactly once messaging). If Option 2 is implemented, we would need to be sure the indexer can properly resume consuming the event stream. This should be fine as long as the DEFAULT_SEEK_BEHAVIOR is Seek.EARLIEST in SaveOffsetOnRebalance.
If RMap can be placed in a "read-only" mode while re-indexing is taking place, then the RMap application and the indexer don't have to be concerned with handling indexing events coming from the RMap API/UI and from the re-indexing process. If we want RMap to be live (e.g. in read/write mode) while re-indexing takes place, then we'll need to be more fancy, and I would lean towards Option 1 for that kind of implementation. For example, there could be two topics, one for receiving events from the re-indexing application, and one for receiving events from RMap. Consumption from the RMap topic could be paused (this is supported by Kafka), and have the indexer consume from the re-indexing topic instead. When the re-index is finished, the RMap topic can be resumed.

The text was updated successfully, but these errors were encountered:

emetsger added kafka Search Index labels Oct 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement RMap re-indexing application #139

Implement RMap re-indexing application #139

emetsger commented Oct 23, 2017 •

edited

Loading

Implement RMap re-indexing application #139

Implement RMap re-indexing application #139

Comments

emetsger commented Oct 23, 2017 • edited Loading

Option 1

Option 2

Considerations

emetsger commented Oct 23, 2017 •

edited

Loading