-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hibernate-search] Implement indexing and search #6283
base: hibernate-search
Are you sure you want to change the base?
Conversation
Great work! Can you elaborate (short high level overview) how the metadata is indexed now in general? (For reference: #4266) |
Map<String, String> parameters = FacesContext.getCurrentInstance().getExternalContext() | ||
.getRequestParameterMap(); | ||
if (parameters.containsKey("input") && StringUtils.isBlank(filterInEditMode)) { | ||
filterInEditMode = parameters.get("input"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are taking over any argument from the outside without validating that the given value is correct or at least is in range of the expectations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the user can enter any string into the search slot. The input is not “checked”. But of course, if the user has entered nonsense, the search will not find any results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this clarification but I expected something different as I read the method, method content and the used written class variable filterInEditMode
. Handling of this class variable filterInEditMode
is strange in this class but this is a different story.
The metadata is stored in a text field that is indexed. The search is later carried out on the text field. "Pseudo words" are used to search on certain metadata fields, i.e. if the word Berlin is in the TitleDocMain field in the metadata, a pseudo word "titledocmainqberlin" is also indexed. If a user later enters "TitleDocMain:Berlin" in the search field, this is converted into the pseudo word and searched for. Pseudo words are generated for metadata keys, translated metadata keys according to the rule set, and domains according to the ruleset. The user can therefore also search for "Haupttitel:Berlin" (if the translation in the ruleset is this). |
c052fc4
to
5bee9cb
Compare
Thanks a lot, this is helpful! Do i understand you correctly that you make the fields searchable by their german and english label as well? So for example "Dokumenttyp:Monograph" as a possible search with the german label? <key id="docType" use="docType">
<label>Document Type</label>
<label lang="de">Dokumenttyp</label>
</key> And those combined terms (LABEL+VALUE) are all stored in the index? What happens if somebody changes the german label of a metadata key in the ruleset? |
* Creates the keywords for searching in correction messages. This is a | ||
* double-truncated search, i.e. any substring. | ||
* | ||
* @param comments | ||
* the comments of a process | ||
* @return keywords | ||
*/ | ||
private static final Set<String> initCommentKeywords(List<Comment> comments) { | ||
Set<String> tokens = new HashSet<>(); | ||
for (Comment comment : comments) { | ||
String message = comment.getMessage(); | ||
if (StringUtils.isNotBlank(message)) { | ||
for (String splitWord : splitValues(message)) { | ||
String word = normalize(splitWord); | ||
for (int i = 0; i < word.length(); i++) { | ||
for (int j = i + 1; j <= word.length(); j++) { | ||
tokens.add(word.substring(i, j)); | ||
} | ||
} | ||
} | ||
} | ||
} | ||
return tokens; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the reasoning here? Do i read this correctly and you are indexing every possible substring of all the words in a comment? Is this really necessary? Normal indexing should make it possible to search by word. Is this not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wiki page requires:
Eine Suche nach
Meldung
findet alle Vorgänge, die einen Kommentar mit einer Nachricht wieDas ist eine Korrekturmeldung!
haben.
To achieve this, we need both-side subword search. I only follow the expected behaviour here. I would prefer a simple-word search, too, but this is what the requirement actually looks like. Maybe we can discuss this, if it is really needed, as it inflates the index. I don’t know a clever way of splitting a German compound-word to its single words, to index just these. (This is possible, but it would be a sophisticated task of computational linguistics, typically involving a dictionary file, too, that I don’t have.)
Yes.
The search will only find the process with the old label, until the process is saved again, or all processes are re-indexed. Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!) |
I understand. I fear a situation, where correcting a typo in a label would require a reindex of everything, otherwise old data is only found with the misspelled label. I do not want to increase your workload too much, but would it be the option to allow the label-based-filters in the UI, but only index and search under the common key in the index? By e.g. having a mapping function, which maps the labels to the actual key and searches for that. |
A more general remark here. I have not really tested anything so far, but it appears to me that almost everything gets indexed. My assumption would be that functionality like displaying list of processes (with all the necessary icons, which require lookups) is faster from the index since we can store data there in denormalized form. It is nice, that we can run Kitodo without a running index for certain functionality but my intuition would be to always use the index if that gives us an edge over database-based retrieval. Maybe i am overlooking something, but it might be worth considering to rely more on the index to have better performance when displaying 100 processes at once, which is very slow in Kitodo 3.7. (which probably uses a mixture of index and database-based retrieval) This is not necessarily adressed at the current PR, but more a general remark which can be considered. |
In my experience, rulesets are rarely changed at all during a running project, except perhaps the addition of a field; and I know librarians are very good at not making typos.
Perhaps it would be interesting to discuss this case among users. Such a change can be introduced later. |
Yes, a lot of indexing is still going on at the moment and I don't think that's necessary. But the task here was not to make more than necessary changes, but to keep the application as most as it was before. But I think that in a separate development, it will be possible to clean this up so that in the end only processes and tasks are indexed, as only these contain metadata. The other objects are never searched, so they don’t do something sensible in the index by now.
No, it's exactly the other way round: In Production 3.7, the display is taken from the index alone, as you describe here as a wish. And it is slow. In this version, this behavior was exactly inverted and now everything except the keyword search is taken from the database. On my developer laptop, the application has become significantly faster as a result. Don't ask me why it behaves like this. This is another example of how performance is less about well-intentioned programming than about actual measurements. I suspect that both database queries and restoring Java in-memory objects from database rows are so much more sophisticated, having been researched for several decades longer, that it beats out indexing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a code only reading review. I'm not sure if the word splitting for any kind of indexing type and there content is really needed to be done on our side - this should be done on an internal way in ES/OS itself.
Kitodo-DataManagement/src/main/java/org/kitodo/data/database/beans/IndexingKeyworder.java
Outdated
Show resolved
Hide resolved
Kitodo-DataManagement/src/main/java/org/kitodo/data/database/persistence/BaseDAO.java
Show resolved
Hide resolved
Kitodo/src/test/java/org/kitodo/production/services/command/CommandServiceTest.java
Outdated
Show resolved
Hide resolved
Kitodo-DataManagement/src/main/java/org/kitodo/data/database/beans/IndexingKeyworder.java
Outdated
Show resolved
Hide resolved
96af397
to
b64a28a
Compare
@@ -470,6 +472,7 @@ public List<Process> loadData(int offset, int limit, String sortField, SortOrder | |||
.collect(Collectors.toList()); | |||
query.restrictToProjects(projectIDs); | |||
query.defineSorting(SORT_FIELD_MAPPING.get(sortField), sortOrder); | |||
query.performIndexSearches(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an observation: The strategy here is to retrieve the process IDs for which a specific condition e.g. stepopen:Archivierung
is met from the Search Index. Those IDs are used to extend the sql query with a IN-statement containing all the IDs retrieved from the search index. This can lead to REALLY large queries when you have 10.000 processes in your database and 9000 of them are in the state "Strukturierung". If you search for that exact stage the SQL IN clause will contain 9000 process IDs. With more processes in Kitodo the queries can become really large, maybe exhausting MySQL here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an observation: The strategy here is to retrieve the process IDs for which a specific condition e.g. stepopen:Archivierung is met from the Search Index. Those IDs are used to extend the sql query with a IN-statement containing all the IDs retrieved from the search index
This was necessary in the past or with ES / OS as ES / OS can only retrieve JSON based data which are no objects and are not so good accessible. For this reason the ES / OS result was given to Hibernate to retrieve the objects from the database which could contain more information and accessing them was a lot easier. As the used ES / OS API could never result more than 10.000 hits there was a limit to the Hibernate query itself.
This is not needed anymore and should be removed. As after the query is send to Hibernate / Hibernate-Search the results are objects which are easy accessible and hold all information which are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a situation, where some information (metadata, but also filter settings like "stepopen") can only be queried using the index. So we need the index to answer the query and therefor we have to query the index somewhere in the process. In my review i will try to concentrate on functionality so that we can focus on performance later, but the general remark i have is this:
We have seen in different places (e.g. #4378) that raw SQL / dedicated index queries often outperforms Hibernate standard queries, which do sometimes hundred of subsequent queries to fetch data which could be fetched in one go. (If Hibernate supports the optimized SQL query)
The process list search is one of the most used features and it should be optimized as much as possible (within the constraints of Hibernate) in my oppinion. When scrolling through the process list i still see A LOT of SQL queries. And queries like here
Line 35 in b6fb9ba
styleClass="#{ProcessForm.createProcessAsChildPossible(process) ? '' : 'ui-state-disabled'}" |
are executed for every single process leading to the classical n+1 query.
I am wondering wether we can not construct something like a ProcessListDTO Object (or ProcessListObject, this is not about the DTO concept), which contains only the information needed to construct the list. (and the statistics). The objects in the list could already have the necessary properties like hasChildren
precomputed, so there is no reason to do additional database queries to calculate them hundred of times.
And since we are already using the index, why not store the necessary properties in the index? We do not have to store all projects associated with a process in the index, but we should store the information which is needed for display like hasChildren
. If we have this, we ideally only need one query insted of hundreds to get all the data for the list and have good performance. This will also make issues like #6348 less annoying, although we can do something on usability there, too.
The necessary features of Hibernate Search have already been referenced in other places
https://docs.jboss.org/hibernate/search/7.1/reference/en-US/html_single/#projections
https://docs.jboss.org/hibernate/search/7.1/reference/en-US/html_single/index.html#search-mapping-associated
https://docs.jboss.org/hibernate/search/7.1/reference/en-US/html_single/index.html#mapping-reindexing-derivedfrom
I think it is fantastic that we now reduce indexing time by only index the processes and using the Hibernate abstractions simplifies a lot of logic. But when necessary for performance it would be great if we could use dedicated custom logic to achieve the perfomance in places where it is mostly needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic design decision was that all data is loaded from the database, and the index is only used when necessary. This is the right decision for a business system with live data and a few users. For a system with thousands of users and less time-critical (for example a travel portal where it doesn't matter if newly added offers only appear after a few minutes), a completely different design would make sense.
Secondly, with this design we wanted to ensure that the application always works, even without an available index. There is no downtime in productivity due to a broken index. The only exception is that you cannot create child processes under existing parent processes, and most search queries will remain empty.
I would like to assure you at this point that I have thought this through very carefully and made this decision to the best of my knowledge and belief.
Technically, only the IDs are queried here, the objects are not loaded. I was very disappointed to find that such functionality is not available in Hibernate Search in the Hibernate Query Language. I had expected that it would be possible to search on indexed fields using a specific HQL keyword or syntax, but that is not the case, I have not found anything like that. So here the IDs of the hits from the index query are passed in HQL; yes, this can possibly cause queries that are several KB in size in some cases, but should not be a problem on a server with GBs of memory. Comparing integers with IDs for the database engine should not be a big deal either.
The many queries when building the process list page are a design flaw. The form classes that underpin the JSP pages should actually be initialized when the page is called up or when a user action is taken. The fact that a getter from the JSP triggers a network action is not the right solution. This is certainly something that we have already inherited from the legacy code and that we should clean up at some point. I have often thought about this. But it should be seen as a separate cause-effect field and not part of this development.
Requirement checklist; as discussed in the Community Board:
Works in general. Some problems a) Searching in the top left box does not reset the process list filter, but appends. A search should reset the current search.
a) Searching for The same thing sometimes happens when searching for After that it is not really possible to return to the process list without logging out. b) The requirements outlined above seem to be working
|
b0bb515
to
c8a653b
Compare
Issue #5760 2c), 2d), 2e) and 2f)
Follow-up pull request to #6218 (immediate diff)
The filters work. Metadata is indexed and searchable. The metadata search syntax works as documented in the wiki.