Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RMap re-indexing application #139

Open
emetsger opened this issue Oct 23, 2017 · 0 comments
Open

Implement RMap re-indexing application #139

emetsger opened this issue Oct 23, 2017 · 0 comments

Comments

@emetsger
Copy link
Contributor

emetsger commented Oct 23, 2017

As a part of the production release, RMap needs to be able to (re)populate its Solr index from the triplestore. The need to reindex the triplestore could arise for any number of reasons, but two common scenarios are:

  • recovery from a corrupt index
  • change in index schema, which would require updating existing documents in the index

The re-indexer could be implemented as a Spring Boot application.

Option 1

At a high level, re-indexing the triplestore would involve:

  1. retrieving the RMap events from the triplestore
  2. publishing those events to Kafka
  3. letting the indexer consume those events normally

Note that step 3 could re-use the existing code path currently used by the index. It is steps 1 and 2 that differ from the normal code path (normally ORMapEventMgr produces events in response to user actions; in a re-indexing scenario, separate logic would be used to produce events).

Option 2

An optional workflow could be:

  1. retrieve the RMap events from the triplestore
  2. create Solr documents for the events
  3. deposit documents directly to Solr (bypass Kafka)

Considerations

  1. The triplestore, as @karenhanson has warned, can exhibit performance issues when dealing with large result sets, and response times may be slow. Maybe the triplestore can fall over! I don't know. The re-indexer should be prepared to deal with a triplestore that may (appear to) become unavailable during the course of re-indexing, and be able to resume indexing from a particular event id (or timestamp?).
  2. The indexer stores Kafka metadata with Solr documents. This allows the indexer to resume reading the event stream without receiving duplicate events or missing events (i.e. KafkaMetadata is part of the implementation for exactly once messaging). If Option 2 is implemented, we would need to be sure the indexer can properly resume consuming the event stream. This should be fine as long as the DEFAULT_SEEK_BEHAVIOR is Seek.EARLIEST in SaveOffsetOnRebalance.
  3. If RMap can be placed in a "read-only" mode while re-indexing is taking place, then the RMap application and the indexer don't have to be concerned with handling indexing events coming from the RMap API/UI and from the re-indexing process. If we want RMap to be live (e.g. in read/write mode) while re-indexing takes place, then we'll need to be more fancy, and I would lean towards Option 1 for that kind of implementation. For example, there could be two topics, one for receiving events from the re-indexing application, and one for receiving events from RMap. Consumption from the RMap topic could be paused (this is supported by Kafka), and have the indexer consume from the re-indexing topic instead. When the re-index is finished, the RMap topic can be resumed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant