Manual incremental updating of the Lucene Full-Text-Index #925

subotic · 2018-07-05T13:15:19Z

As we don't use the GraphDB Lucene Connector, we need to manually update the Lucene Full-Text-Index.

We do this by running the following SPARQL query after each update:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { luc:fullTextSearchIndex luc:updateIndex _:b1 . }

According to the docs we should instead/also use the following SPARQL query:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA {  luc:fullTextSearchIndex luc:addToIndex <resourceIRI> . }

The first query adds all new resources to the index. If an update only changes an existing resource, then the index for this resource will not be updated. At least this is how I understand the documentation.

The text was updated successfully, but these errors were encountered:

subotic · 2019-06-21T09:47:03Z

@benjamingeer I don't know if you are aware of this issue.

Also, the first query (updating the index for all new resources) should be automatically run on startup. With the BEOL project, we allways have some issues, where the index needs to be updated manually. Also, now that the scripts (graphdb-se-docker-...) for loading the data cannot be used anymore, I load the data by hand using the GraphDB Workbench, and in those cases, it would be also helpful if webapi would update the search index at startup.

benjamingeer · 2019-06-21T09:53:16Z

Also, now that the scripts (graphdb-se-docker-...) for loading the data cannot be used anymore

Why not?

it would be also helpful if webapi would update the search index at startup.

Should this be the same thing that HttpTriplestoreConnector.sparqlHttpUpdate does after each update?

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { luc:fullTextSearchIndex luc:updateIndex _:b1 . }

Should we switch to using the GraphDB Lucene Connector?

subotic · 2019-06-21T10:33:32Z

Also, now that the scripts (graphdb-se-docker-...) for loading the data cannot be used anymore

Why not?

Because we have data that we cannot delete. Changing data in production means adding or exchanging project specific graphs. The scripts we have, delete everything and load test data. Good for development, but not for production. Also using the scripts requires to clone the repository, which shouldn't be necessary for production. Everything that is necessary for running the stack, should be packaged as versioned Docker images. This way we can eliminate a few sources for errors.

it would be also helpful if webapi would update the search index at startup.

Should this be the same thing that HttpTriplestoreConnector.sparqlHttpUpdate does after each update?

Yes, exactly.

Should we switch to using the GraphDB Lucene Connector?

This is what Ontotext is recommending. A few years ago, I tried using the GraphDB Lucene Connector but wasn't very successful. We should try it out again. It has a different syntax for search queries, so it is not a quick thing to do.

benjamingeer · 2019-06-21T10:37:04Z

The scripts we have, delete everything and load test data. Good for development, but not for production.

How about if we make better scripts? The repository update framework could be a good basis for that. There's already code there for downloading and uploading named graphs. Would you like to make a list of requirements? :)

A few years ago, I tried using the GraphDB Lucene Connector but wasn't very successful. We should try it out again. It has a different syntax for search queries, so it is not a quick thing to do.

For now I can make a PR so Knora updates the Lucene index on startup, and later we can try again to use the GraphDB Lucene Connector.

subotic · 2019-06-21T10:43:55Z

How about if we make better scripts? The repository update framework could be a good basis for that. There's already code there for downloading and uploading named graphs. Would you like to make a list of requirements? :)

I've started to create a command line tool for this: https://github.com/dhlab-basel/knoractl

But what would be very helpful and I'm not sure how to do it, is to add an admin route, which returns all data (all ontology graphs, the data graph, project, user, permissions) for a certain project as a trig file. The goal would be to create a project and import data with knora-py and then download a single trig, allowing easy deployment.

We will need to figure out how to also get all the stuff from Sipi.

benjamingeer · 2019-06-21T10:50:10Z

I've started to create a command line tool for this

Wow, it's in C++. Have fun! 😃

add an admin route, which returns all data (all ontology graphs, the data graph, project, user, permissions) for a certain project as a trig file

This I can do.

benjamingeer · 2019-06-21T11:56:42Z

Add admin route to dump project data #1358

loicjaouen · 2019-06-24T10:32:41Z

My experience is to run the said command sparqlHttpUpdate after each data upload to our prod server:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { luc:fullTextSearchIndex luc:updateIndex _:b1 . }

I must not have touched specifically the issue of modifying by hand and existing value.

Having said that, the content of the comments changed quite a bit since the issue's title was set.

When we come to modifying production data, it is not easy to know what are the good practices, the existing queries or tools, the ones that are safe and the ones that are works-in-progress.

Uploading new projects was not a problem since the we have worked out our consistency issues in our existing data corpus when upgrading to knora v6 (which has a more strict checker).

Updating an existing project still sometimes leads to an exception that blocks the whole graphdb/knora, so for now, as we can still afford it, I always dump the whole prod graphdb, generate a replicate, test my changes, make sure that I have a dump that I can start from before either apply the changes on prod or delete prod and reinstall from the dump.

Last example was:

-               rdfs:subClassOf knora-base:DocumentRepresentation ,
+               rdfs:subClassOf knora-base:Resource ,

doing this by hand in prod (workbench edit) lead to an unrecoverable exception, even though there was no instance of that class.

The story goes like this: you do a change and at the next restart of knora you get:

api      | [2019-04-05 00:02:15,507] ERROR - OneForOneStrategy - Failed to connect to triplestore

caused by:

api      | Caused by: java.net.SocketTimeoutException: Read timed out

I didn't dig to the bottom of the story and did not find the reason for that timeout.

The whole knora is down until you roll back to previous back-up or tested dump (15mn).

So it would be nice to have:

trusted tools to modify an existing corpus that avoid this situation
an isolation of projects to avoid a total freeze after changing one project

But we can live for quite some time with full dump as safe roll-back.

benjamingeer · 2019-06-24T10:42:22Z

doing this by hand in prod (workbench edit) lead to an unrecoverable exception, even though there was no instance of that class.

If you change something in the triplestore by hand while Knora is running, Knora will get confused. If you want to do this, you have to stop Knora, make the change in the triplestore, then start Knora again.

api | Caused by: java.net.SocketTimeoutException: Read timed out

Was it trying to connect to the triplestore, or to Sipi? Either way, this doesn't necessarily indicate a problem with your data. There are a lot of reasons why the triplestore could take a long time to do something. Try increasing triplestore.query-timeout in application.conf.

loicjaouen · 2019-06-24T10:51:54Z

@benjamingeer : knora was down, it is at the restart when knora was trying to initiate itself from data read from the tripple store that we get this.
Doing the same through a graphdb dump and a sed instruction worked without touching the query-timeout settings.
Next time it happens we might dig further, we were already a version of knora behind and under time constraints ;)

Dumping a whole graph is already a step toward projects isolation in the handling of operational data maintenance.

Having a way to clean a whole project (to replace it with a new trig) would also help.

When you dump the data, beware of users that belong to multiple projects.

benjamingeer · 2019-06-24T11:16:15Z

Doing the same through a graphdb dump and a sed instruction worked without touching the query-timeout settings.

Maybe your SPARQL did something unexpected. For the sake of predictability, I've found it easier to update data by modifying dump files than by writing SPARQL by hand. That's why Knora's repository update framework (in upgrade) works that way: it's a Python script that dumps the repository to files and parses the files using rdflib to modify their contents.

When you dump the data, beware of users that belong to multiple projects.

Yes, we have this in mind.

subotic · 2019-06-24T11:18:17Z

Having a way to clean a whole project (to replace it with a new trig) would also help.

In the short-term, this will be the job of knoractl. Later, there will be an Admin WebApp, that should provide more comfort.

Unfortunately, the fastest way I have found to replace all the graphs of a single project is by doing these steps:

use GraphDB workbench to dump all data WITHOUT the graphs of the project in question
use GraphDB workbench to clear the repository (not remove)
upload the dump from step 1
upload the trig with project onto and data
restart webapi. With a refresh ontology caches route, it wouldn't even be necessary to restart webapi. This is simple to implement.

Also, step 0 would be nice with a GraphDB route turn on maintenance mode, not allowing any connections. Then in step 6, the maintenance mode could be turned off. This could actually be fairly easily implemented.

I will do it when I make more progress with knoractl.

loicjaouen · 2019-06-24T12:38:09Z

@subotic :

How do you do step 2? Drop graphs sometimes took ridiculous amount of time (>3h) so now I run the graphdb-se-docker-init-knora-prod.sh with the line loading the data commented out.
Actually, also how do you make step 1, the granularity is a graph or do you step inside with smart sparql requests?

@benjamingeer : it was not even a sparql request per se, just editing the class definition in graphdb's workbench, anyway, we get to the same conclusion, edit the dump. That will be fine for a while, but it is not a scalable solution.

subotic · 2019-06-24T13:01:42Z

In the GraphDB Workbench go to "Explore -> Graphs overview". There you can select what you want to export. Here I select all but the graphs I want to replace and download as TRIG.

After that, you press the "Clear repository" button. It drops all graphs immediately. Then load the exported graphs from before.

Go to "Import -> RDF -> Server files". I have mounted an external directory to this folder inside the Docker container. Now I can use an SFTP client to upload the TRIGs to the server and start the import.

This process works but is error-prone as each manual step can (and sooner or later will) be messed up.

loicjaouen · 2019-06-24T13:16:27Z

yes, that's more or less the way I do too, and even if our whole prod corpus takes minutes to dump and 15 minutes to load, it sometimes takes ages to drop graphs.

As I dump the whole corpus, I do:

curl -X GET -H "Accept:application/x-trig" "http://user:pwd@localhost:7200/repositories/knora-prod/statements?infer=false" -o knora-prod-$(date +'%Y-%m-%d').trig

subotic mentioned this issue Jul 5, 2018

Full-text-index using GraphDB Lucene connector #918

Closed

subotic added the enhancement improve existing code or new feature label Jun 21, 2019

benjamingeer self-assigned this Jun 21, 2019

benjamingeer mentioned this issue Jun 21, 2019

Ben's PR history #571

Open

benjamingeer mentioned this issue Jul 1, 2019

feat(triplestore): Update Lucene index on startup #1362

Merged

2 tasks

benjamingeer closed this as completed in #1362 Jul 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual incremental updating of the Lucene Full-Text-Index #925

Manual incremental updating of the Lucene Full-Text-Index #925

subotic commented Jul 5, 2018

subotic commented Jun 21, 2019

benjamingeer commented Jun 21, 2019 •

edited

Loading

subotic commented Jun 21, 2019

benjamingeer commented Jun 21, 2019

subotic commented Jun 21, 2019 •

edited

Loading

benjamingeer commented Jun 21, 2019

benjamingeer commented Jun 21, 2019

loicjaouen commented Jun 24, 2019

benjamingeer commented Jun 24, 2019

loicjaouen commented Jun 24, 2019

benjamingeer commented Jun 24, 2019 •

edited

Loading

subotic commented Jun 24, 2019

loicjaouen commented Jun 24, 2019

subotic commented Jun 24, 2019

loicjaouen commented Jun 24, 2019

Manual incremental updating of the Lucene Full-Text-Index #925

Manual incremental updating of the Lucene Full-Text-Index #925

Comments

subotic commented Jul 5, 2018

subotic commented Jun 21, 2019

benjamingeer commented Jun 21, 2019 • edited Loading

subotic commented Jun 21, 2019

benjamingeer commented Jun 21, 2019

subotic commented Jun 21, 2019 • edited Loading

benjamingeer commented Jun 21, 2019

benjamingeer commented Jun 21, 2019

loicjaouen commented Jun 24, 2019

benjamingeer commented Jun 24, 2019

loicjaouen commented Jun 24, 2019

benjamingeer commented Jun 24, 2019 • edited Loading

subotic commented Jun 24, 2019

loicjaouen commented Jun 24, 2019

subotic commented Jun 24, 2019

loicjaouen commented Jun 24, 2019

benjamingeer commented Jun 21, 2019 •

edited

Loading

subotic commented Jun 21, 2019 •

edited

Loading

benjamingeer commented Jun 24, 2019 •

edited

Loading