Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual incremental updating of the Lucene Full-Text-Index #925

Closed
subotic opened this issue Jul 5, 2018 · 15 comments · Fixed by #1362
Closed

Manual incremental updating of the Lucene Full-Text-Index #925

subotic opened this issue Jul 5, 2018 · 15 comments · Fixed by #1362
Assignees
Labels
enhancement improve existing code or new feature

Comments

@subotic
Copy link
Collaborator

subotic commented Jul 5, 2018

As we don't use the GraphDB Lucene Connector, we need to manually update the Lucene Full-Text-Index.

We do this by running the following SPARQL query after each update:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { luc:fullTextSearchIndex luc:updateIndex _:b1 . }

According to the docs we should instead/also use the following SPARQL query:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA {  luc:fullTextSearchIndex luc:addToIndex <resourceIRI> . }

The first query adds all new resources to the index. If an update only changes an existing resource, then the index for this resource will not be updated. At least this is how I understand the documentation.

@subotic
Copy link
Collaborator Author

subotic commented Jun 21, 2019

@benjamingeer I don't know if you are aware of this issue.

Also, the first query (updating the index for all new resources) should be automatically run on startup. With the BEOL project, we allways have some issues, where the index needs to be updated manually. Also, now that the scripts (graphdb-se-docker-...) for loading the data cannot be used anymore, I load the data by hand using the GraphDB Workbench, and in those cases, it would be also helpful if webapi would update the search index at startup.

@subotic subotic added the enhancement improve existing code or new feature label Jun 21, 2019
@benjamingeer
Copy link

benjamingeer commented Jun 21, 2019

Also, now that the scripts (graphdb-se-docker-...) for loading the data cannot be used anymore

Why not?

it would be also helpful if webapi would update the search index at startup.

Should this be the same thing that HttpTriplestoreConnector.sparqlHttpUpdate does after each update?

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { luc:fullTextSearchIndex luc:updateIndex _:b1 . }

Should we switch to using the GraphDB Lucene Connector?

@subotic
Copy link
Collaborator Author

subotic commented Jun 21, 2019

Also, now that the scripts (graphdb-se-docker-...) for loading the data cannot be used anymore

Why not?

Because we have data that we cannot delete. Changing data in production means adding or exchanging project specific graphs. The scripts we have, delete everything and load test data. Good for development, but not for production. Also using the scripts requires to clone the repository, which shouldn't be necessary for production. Everything that is necessary for running the stack, should be packaged as versioned Docker images. This way we can eliminate a few sources for errors.

it would be also helpful if webapi would update the search index at startup.

Should this be the same thing that HttpTriplestoreConnector.sparqlHttpUpdate does after each update?

Yes, exactly.

Should we switch to using the GraphDB Lucene Connector?

This is what Ontotext is recommending. A few years ago, I tried using the GraphDB Lucene Connector but wasn't very successful. We should try it out again. It has a different syntax for search queries, so it is not a quick thing to do.

@benjamingeer
Copy link

The scripts we have, delete everything and load test data. Good for development, but not for production.

How about if we make better scripts? The repository update framework could be a good basis for that. There's already code there for downloading and uploading named graphs. Would you like to make a list of requirements? :)

A few years ago, I tried using the GraphDB Lucene Connector but wasn't very successful. We should try it out again. It has a different syntax for search queries, so it is not a quick thing to do.

For now I can make a PR so Knora updates the Lucene index on startup, and later we can try again to use the GraphDB Lucene Connector.

@subotic
Copy link
Collaborator Author

subotic commented Jun 21, 2019

How about if we make better scripts? The repository update framework could be a good basis for that. There's already code there for downloading and uploading named graphs. Would you like to make a list of requirements? :)

I've started to create a command line tool for this: https://github.com/dhlab-basel/knoractl

But what would be very helpful and I'm not sure how to do it, is to add an admin route, which returns all data (all ontology graphs, the data graph, project, user, permissions) for a certain project as a trig file. The goal would be to create a project and import data with knora-py and then download a single trig, allowing easy deployment.

We will need to figure out how to also get all the stuff from Sipi.

@benjamingeer
Copy link

I've started to create a command line tool for this

Wow, it's in C++. Have fun! 😃

add an admin route, which returns all data (all ontology graphs, the data graph, project, user, permissions) for a certain project as a trig file

This I can do.

@benjamingeer
Copy link

Add admin route to dump project data #1358

@loicjaouen
Copy link
Contributor

My experience is to run the said command sparqlHttpUpdate after each data upload to our prod server:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { luc:fullTextSearchIndex luc:updateIndex _:b1 . }

I must not have touched specifically the issue of modifying by hand and existing value.


Having said that, the content of the comments changed quite a bit since the issue's title was set.

When we come to modifying production data, it is not easy to know what are the good practices, the existing queries or tools, the ones that are safe and the ones that are works-in-progress.

Uploading new projects was not a problem since the we have worked out our consistency issues in our existing data corpus when upgrading to knora v6 (which has a more strict checker).

Updating an existing project still sometimes leads to an exception that blocks the whole graphdb/knora, so for now, as we can still afford it, I always dump the whole prod graphdb, generate a replicate, test my changes, make sure that I have a dump that I can start from before either apply the changes on prod or delete prod and reinstall from the dump.

Last example was:

-               rdfs:subClassOf knora-base:DocumentRepresentation ,
+               rdfs:subClassOf knora-base:Resource ,

doing this by hand in prod (workbench edit) lead to an unrecoverable exception, even though there was no instance of that class.

The story goes like this: you do a change and at the next restart of knora you get:

api      | [2019-04-05 00:02:15,507] ERROR - OneForOneStrategy - Failed to connect to triplestore

caused by:

api      | Caused by: java.net.SocketTimeoutException: Read timed out

I didn't dig to the bottom of the story and did not find the reason for that timeout.

The whole knora is down until you roll back to previous back-up or tested dump (15mn).

So it would be nice to have:

  • trusted tools to modify an existing corpus that avoid this situation
  • an isolation of projects to avoid a total freeze after changing one project

But we can live for quite some time with full dump as safe roll-back.

@benjamingeer
Copy link

doing this by hand in prod (workbench edit) lead to an unrecoverable exception, even though there was no instance of that class.

If you change something in the triplestore by hand while Knora is running, Knora will get confused. If you want to do this, you have to stop Knora, make the change in the triplestore, then start Knora again.

api | Caused by: java.net.SocketTimeoutException: Read timed out

Was it trying to connect to the triplestore, or to Sipi? Either way, this doesn't necessarily indicate a problem with your data. There are a lot of reasons why the triplestore could take a long time to do something. Try increasing triplestore.query-timeout in application.conf.

@loicjaouen
Copy link
Contributor

@benjamingeer : knora was down, it is at the restart when knora was trying to initiate itself from data read from the tripple store that we get this.
Doing the same through a graphdb dump and a sed instruction worked without touching the query-timeout settings.
Next time it happens we might dig further, we were already a version of knora behind and under time constraints ;)

Dumping a whole graph is already a step toward projects isolation in the handling of operational data maintenance.

Having a way to clean a whole project (to replace it with a new trig) would also help.

When you dump the data, beware of users that belong to multiple projects.

@benjamingeer
Copy link

benjamingeer commented Jun 24, 2019

Doing the same through a graphdb dump and a sed instruction worked without touching the query-timeout settings.

Maybe your SPARQL did something unexpected. For the sake of predictability, I've found it easier to update data by modifying dump files than by writing SPARQL by hand. That's why Knora's repository update framework (in upgrade) works that way: it's a Python script that dumps the repository to files and parses the files using rdflib to modify their contents.

When you dump the data, beware of users that belong to multiple projects.

Yes, we have this in mind.

@subotic
Copy link
Collaborator Author

subotic commented Jun 24, 2019

Having a way to clean a whole project (to replace it with a new trig) would also help.

In the short-term, this will be the job of knoractl. Later, there will be an Admin WebApp, that should provide more comfort.

Unfortunately, the fastest way I have found to replace all the graphs of a single project is by doing these steps:

  1. use GraphDB workbench to dump all data WITHOUT the graphs of the project in question
  2. use GraphDB workbench to clear the repository (not remove)
  3. upload the dump from step 1
  4. upload the trig with project onto and data
  5. restart webapi. With a refresh ontology caches route, it wouldn't even be necessary to restart webapi. This is simple to implement.

Also, step 0 would be nice with a GraphDB route turn on maintenance mode, not allowing any connections. Then in step 6, the maintenance mode could be turned off. This could actually be fairly easily implemented.

I will do it when I make more progress with knoractl.

@loicjaouen
Copy link
Contributor

@subotic :

  • How do you do step 2? Drop graphs sometimes took ridiculous amount of time (>3h) so now I run the graphdb-se-docker-init-knora-prod.sh with the line loading the data commented out.
  • Actually, also how do you make step 1, the granularity is a graph or do you step inside with smart sparql requests?

@benjamingeer : it was not even a sparql request per se, just editing the class definition in graphdb's workbench, anyway, we get to the same conclusion, edit the dump. That will be fine for a while, but it is not a scalable solution.

@subotic
Copy link
Collaborator Author

subotic commented Jun 24, 2019

In the GraphDB Workbench go to "Explore -> Graphs overview". There you can select what you want to export. Here I select all but the graphs I want to replace and download as TRIG.

Screenshot 2019-06-24 14 45 05

After that, you press the "Clear repository" button. It drops all graphs immediately. Then load the exported graphs from before.

Go to "Import -> RDF -> Server files". I have mounted an external directory to this folder inside the Docker container. Now I can use an SFTP client to upload the TRIGs to the server and start the import.

Screenshot 2019-06-24 14 51 38

This process works but is error-prone as each manual step can (and sooner or later will) be messed up.

@loicjaouen
Copy link
Contributor

yes, that's more or less the way I do too, and even if our whole prod corpus takes minutes to dump and 15 minutes to load, it sometimes takes ages to drop graphs.

As I dump the whole corpus, I do:

curl -X GET -H "Accept:application/x-trig" "http://user:pwd@localhost:7200/repositories/knora-prod/statements?infer=false" -o knora-prod-$(date +'%Y-%m-%d').trig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement improve existing code or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants