-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manual incremental updating of the Lucene Full-Text-Index #925
Comments
@benjamingeer I don't know if you are aware of this issue. Also, the first query (updating the index for all new resources) should be automatically run on startup. With the BEOL project, we allways have some issues, where the index needs to be updated manually. Also, now that the scripts ( |
Why not?
Should this be the same thing that PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { luc:fullTextSearchIndex luc:updateIndex _:b1 . } Should we switch to using the GraphDB Lucene Connector? |
Because we have data that we cannot delete. Changing data in production means adding or exchanging project specific graphs. The scripts we have, delete everything and load test data. Good for development, but not for production. Also using the scripts requires to clone the repository, which shouldn't be necessary for production. Everything that is necessary for running the stack, should be packaged as versioned Docker images. This way we can eliminate a few sources for errors.
Yes, exactly.
This is what Ontotext is recommending. A few years ago, I tried using the GraphDB Lucene Connector but wasn't very successful. We should try it out again. It has a different syntax for search queries, so it is not a quick thing to do. |
How about if we make better scripts? The repository update framework could be a good basis for that. There's already code there for downloading and uploading named graphs. Would you like to make a list of requirements? :)
For now I can make a PR so Knora updates the Lucene index on startup, and later we can try again to use the GraphDB Lucene Connector. |
I've started to create a command line tool for this: https://github.com/dhlab-basel/knoractl But what would be very helpful and I'm not sure how to do it, is to add an admin route, which returns all data (all ontology graphs, the data graph, project, user, permissions) for a certain project as a trig file. The goal would be to create a project and import data with We will need to figure out how to also get all the stuff from Sipi. |
Wow, it's in C++. Have fun! 😃
This I can do. |
Add admin route to dump project data #1358 |
My experience is to run the said command PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA { luc:fullTextSearchIndex luc:updateIndex _:b1 . } I must not have touched specifically the issue of modifying by hand and existing value. Having said that, the content of the comments changed quite a bit since the issue's title was set. When we come to modifying production data, it is not easy to know what are the good practices, the existing queries or tools, the ones that are safe and the ones that are works-in-progress. Uploading new projects was not a problem since the we have worked out our consistency issues in our existing data corpus when upgrading to knora v6 (which has a more strict checker). Updating an existing project still sometimes leads to an exception that blocks the whole graphdb/knora, so for now, as we can still afford it, I always dump the whole prod graphdb, generate a replicate, test my changes, make sure that I have a dump that I can start from before either apply the changes on prod or delete prod and reinstall from the dump. Last example was: - rdfs:subClassOf knora-base:DocumentRepresentation ,
+ rdfs:subClassOf knora-base:Resource , doing this by hand in prod (workbench edit) lead to an unrecoverable exception, even though there was no instance of that class. The story goes like this: you do a change and at the next restart of knora you get:
caused by:
I didn't dig to the bottom of the story and did not find the reason for that timeout. The whole knora is down until you roll back to previous back-up or tested dump (15mn). So it would be nice to have:
But we can live for quite some time with full dump as safe roll-back. |
If you change something in the triplestore by hand while Knora is running, Knora will get confused. If you want to do this, you have to stop Knora, make the change in the triplestore, then start Knora again.
Was it trying to connect to the triplestore, or to Sipi? Either way, this doesn't necessarily indicate a problem with your data. There are a lot of reasons why the triplestore could take a long time to do something. Try increasing |
@benjamingeer : knora was down, it is at the restart when knora was trying to initiate itself from data read from the tripple store that we get this. Dumping a whole graph is already a step toward projects isolation in the handling of operational data maintenance. Having a way to clean a whole project (to replace it with a new trig) would also help. When you dump the data, beware of users that belong to multiple projects. |
Maybe your SPARQL did something unexpected. For the sake of predictability, I've found it easier to update data by modifying dump files than by writing SPARQL by hand. That's why Knora's repository update framework (in
Yes, we have this in mind. |
In the short-term, this will be the job of knoractl. Later, there will be an Admin WebApp, that should provide more comfort. Unfortunately, the fastest way I have found to replace all the graphs of a single project is by doing these steps:
Also, step 0 would be nice with a GraphDB route I will do it when I make more progress with |
@subotic :
@benjamingeer : it was not even a sparql request per se, just editing the class definition in graphdb's workbench, anyway, we get to the same conclusion, edit the dump. That will be fine for a while, but it is not a scalable solution. |
In the GraphDB Workbench go to "Explore -> Graphs overview". There you can select what you want to export. Here I select all but the graphs I want to replace and download as TRIG. After that, you press the "Clear repository" button. It drops all graphs immediately. Then load the exported graphs from before. Go to "Import -> RDF -> Server files". I have mounted an external directory to this folder inside the Docker container. Now I can use an SFTP client to upload the TRIGs to the server and start the import. This process works but is error-prone as each manual step can (and sooner or later will) be messed up. |
yes, that's more or less the way I do too, and even if our whole prod corpus takes minutes to dump and 15 minutes to load, it sometimes takes ages to drop graphs. As I dump the whole corpus, I do: curl -X GET -H "Accept:application/x-trig" "http://user:pwd@localhost:7200/repositories/knora-prod/statements?infer=false" -o knora-prod-$(date +'%Y-%m-%d').trig |
As we don't use the GraphDB Lucene Connector, we need to manually update the Lucene Full-Text-Index.
We do this by running the following SPARQL query after each update:
According to the docs we should instead/also use the following SPARQL query:
The first query adds all new resources to the index. If an update only changes an existing resource, then the index for this resource will not be updated. At least this is how I understand the documentation.
The text was updated successfully, but these errors were encountered: