LuceneSearchService - sort order of German 'Umlaute' #426

chrisw14 · 2018-09-23T13:55:41Z

Hi,
is it possible in the implementation of imixs-workflow to tell lucene that it should order the german Umlaute correctly? Like 'Ü' should be 'Ue' and so on...

I think there are possibilities for lucene but I don't know how to implement it in imixs-workflow:
https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html
https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilterFactory.html

Thank you!

rsoika · 2018-09-23T16:17:19Z

This is an interesting question which I can't answer for now.
In Imixs-Workflow we have this property

"lucence.analyzerClass" which defaults to the org.apache.lucene.analysis.standard.ClassicAnalyzer

You can overwrite this property with the imixs.properties file.

For my first understanding if you want to use the GermanNormalizationFilter you need to implement your own Analyzer. For example:

package my.project;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;

public class MyCustomAnalyzer extends Analyzer {
	@Override
	protected TokenStreamComponents createComponents(String fieldName) {
		StandardTokenizer standardTokenizer = new StandardTokenizer();
		GermanNormalizationFilter germanNormalizeationFilter = new GermanNormalizationFilter(standardTokenizer);
		return new TokenStreamComponents(standardTokenizer, germanNormalizeationFilter);
	}
}

Than you can activate the custom analyzer in your project with a imixs.properties entry :

# add custom lucene analyzer
lucence.analyzerClass=my.project.MyCustomAnalyzer

But I am not sure if this is an workable solution. Can you check this?

chrisw14 · 2018-09-29T19:16:53Z

I tried it out but can't found the mentioned imported classes:
org.apache.lucene.analysis.Analyzer;
org.apache.lucene.analysis.de.GermanNormalizationFilter;
org.apache.lucene.analysis.standard.StandardTokenizer;

In my pom I have:

	<dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-analyzers-common</artifactId>
		<version>6.3.0</version>
	</dependency>

Where can I found these classes?

rsoika · 2018-09-30T07:58:06Z

The lucene dependencies needed are:

			<dependency>
				<groupId>org.apache.lucene</groupId>
				<artifactId>lucene-core</artifactId>
				<version>${lucene.version}</version>
			</dependency>
			<dependency>
				<groupId>org.apache.lucene</groupId>
				<artifactId>lucene-analyzers-common</artifactId>
				<version>${lucene.version}</version>
			</dependency>
			<dependency>
				<groupId>org.apache.lucene</groupId>
				<artifactId>lucene-queryparser</artifactId>
				<version>${lucene.version}</version>
			</dependency>

chrisw14 · 2018-09-30T09:50:34Z

Yes I already had this dependencies in the pom.xml but the packages mentioned above cannot be found.
It looks like this:

rsoika · 2018-09-30T12:48:49Z

look if you can build your project with mvn install. It looks like you have project setup problem in eclipse

chrisw14 · 2018-09-30T13:34:17Z

Mvn install has built the project with success but in the class the imports cannot be resolved.
Several lucene packages are included in the java build path:

rsoika · 2018-10-01T07:54:59Z

So it's a eclipse problem. You can try the project command "maven -> update project".
If this does not help please ask in the eclipse community. Let's concentrate here on the lucene question.

chrisw14 · 2018-10-01T15:25:13Z

Okay I tried it with the compile errors and added the line to the imixs.properties file but the sort order of the german "Umlaute" hasn't changed. There were no other errors.

Now I added a line in the class with an output in the console but it never appears, so I think the class has never been called.

rsoika · 2018-10-01T20:23:36Z

Yes you are right. This is indeed a bug of the LuceneUpdateService. I opened a new issue #429

rsoika · 2018-10-01T20:25:45Z

I fixed this now in release 4.4.1-SNAPSHOT. Please try if this works for you.

chrisw14 · 2018-10-03T11:53:15Z

Yes, the class will be called in 4.4.1-SNAPSHOT.
Then I really have this compiler problem:

java.lang.Error: Unresolved compilation problems: 
	The import org.apache.lucene.analysis.Analyzer cannot be resolved
	The import org.apache.lucene.analysis.de.GermanNormalizationFilter cannot be resolved
	The import org.apache.lucene.analysis.standard.StandardTokenizer cannot be resolved
	Analyzer cannot be resolved to a type
	TokenStreamComponents cannot be resolved to a type
	The method createComponents(String) of type LuceneGerman must override or implement a supertype method
	StandardTokenizer cannot be resolved to a type
	StandardTokenizer cannot be resolved to a type
	GermanNormalizationFilter cannot be resolved to a type
	GermanNormalizationFilter cannot be resolved to a type
	TokenStreamComponents cannot be resolved to a type

Okay, now I added the jars manually to the build path.
The class is called but the sort order is the same as before!?

rsoika · 2018-10-04T06:51:11Z

concerning the compilation problems: you need to check your project and IDE setup I think.

concerning the sort order problem: how is the call of the find method looking now? How looks your imixs.properties 'lucence.indexFieldListNoAnalyze'. This is a important setting for sorting. Is your sorting field listed there?

chrisw14 · 2018-10-04T15:12:10Z

Okay, I will set it up again.

The call of the find method:
results = workflowService.getDocumentService().find(pageQuery, -1, 0, "title", false);
And of course I added the "title" to the lucence.indexFieldListNoAnalyze and reindexed the lucene index over the imixs-admin panel.

rsoika · 2018-10-04T17:07:27Z

Ok - do you know the lucene tool 'luke'
https://github.com/DmitryKey/luke

This is a very cool application which allows you to test your lucene index with different settings. Maybe we can figure out if the index is correctly written.

chrisw14 · 2018-10-05T15:19:35Z

I tested the sort order with 'luke' and the analyzer org.apache.lucene.analysis.de.GermanAnalyzer.
The sort order is the same as in my application: "Ü" comes after "Z".
Maybe org.apache.lucene.analysis.de.GermanAnalyzer doesn't support the German umlauts?

rsoika · 2018-10-05T16:35:48Z

But I understand the 'Ü' should be replaced with 'Ue'. So for my understanding the fields should not have an value with 'Ü'. The GermanAnalyzer should replace the tokens. Can you check this with Luke? If you know the unqiueid you can lookup the lucene documetn in Luke with all its itmes.

chrisw14 · 2018-10-05T17:23:56Z

The doc value is with "Ü" and the stored value is not available (only available for $uniqueid):

And should it be possible to sort after a item which name starts with a '$' ? Doesn't work for me, too.

rsoika · 2018-10-05T21:09:15Z

yes of course that's right, the fields are not stored in lucene - so we can not look into that detail.....

rsoika · 2018-10-05T21:10:21Z

sorting by item names starting wit "$" works . For example in the admin client you can verify this by sorting the result by '$created'

rsoika · 2018-10-05T21:14:00Z

chrisw14 · 2018-10-05T21:28:20Z

Okay, sorting will work if all letters of the key are in lower case.
But you have no umlauts in your application?

rsoika · 2018-10-05T21:42:00Z

hm... this all seems to be not so easy...

Maybe we should take more focus on the lines 242-248 in LuceneSearchService.search().


			if (sortOrder != null) {
				// sorted by sortoder
				logger.finest("......lucene result sorted by sortOrder= '" + sortOrder + "' ");
				// MAX_SEARCH_RESULT is limiting the total number of hits
				collector = TopFieldCollector.create(sortOrder, maxSearchResult, false, false, false);

			}

Maybe the TopFieldCollector need to be applied with the correct filter class.....

If I remember correctly, search and sorting in Lucne are two separate processes. If so, it would not make sense to do this filtering when the index is written....
The Sort object is only used to sort the result after the search result was computed....

chrisw14 · 2018-10-06T11:08:53Z

Have I to add the analyzer to imixs-admin as well? There I create the lucene-index.

Do I really need an analyzer? Or how can I use a replacement of the umlauts like this?

                result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
                result = new ISOLatin2AccentFilter(result);
                result = new org.apache.lucene.analysis.LowerCaseFilter(result);

chrisw14 · 2018-10-13T16:36:41Z

If I remember correctly, search and sorting in Lucne are two separate processes.

I think this is right because the search doesn't work using MyCustomAnalyzer.
How can I manipulate the sort order without the search? What is your idea with the TopFieldCollector?

Sorting numbers is in my opinion also wrong. It looks like this:
1, 100, 101, 2, 3, ...
I had expected the order: 1, 2, 3, 100, 101

rsoika · 2018-10-13T18:42:56Z

I read about the TopFieldCollectior and it does not look like this class is responsible for the search order. So I was on the wrong path...

I am not sure if lucene is able to sort numbers. Did you have asked that question in the lucene forum already?

chrisw14 · 2018-12-19T18:48:03Z

Now I got an answer to this topic:
https://stackoverflow.com/questions/53438426/apache-lucene-sorting-numbers-and-german-umlauts/53837207#53837207

Is it possible to use this with imixs workflow? Thank you.

rsoika · 2018-12-20T11:26:50Z

Concerning the sorting by number, I think the problem for now is the LucenUpdateService:

imixs-workflow/imixs-workflow-engine/src/main/java/org/imixs/workflow/engine/lucene/LuceneUpdateService.java

Line 684 in 22dce06

doc.add(new SortedDocValuesField(itemName, new BytesRef(sValue)));

Here we create the index based on a SortedDocValuesField. Maybe we can use in some cases a SortedNumericDocValuesField.

We can try here the following in this case:

check the value type (e.g. String, Integer, Double....)
create the corresponding lucene file type (if a mapping is possible)
https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/document/Field.html

We must see if this works as expected

rsoika · 2018-12-20T23:11:24Z

I think I have now found a working solution for this problem.

I added a new CDI bean called LuceneItemAdapter. This bean does the converting of Item values and also the creation of SortedDocValuesFields.

And now with this solution your application can simply replace this adapter by an CDI alternative.
So in your case you have to move the solution from the ICU library into your alternative CDI bean.

And hopefully we can integrate your bean later back into the Imixs-Workflow project.
Are you familiar with CDI alternatives? You have to annotate your bean like this:

@Named("luceneItemAdapter")
@Alternative
public class MyGermanLuceneItemAdapter {
....

and add your bean into the beans.xml of your application:

	<alternatives>
		<class>com.foo.GermanItemValueAdapter</class>
	</alternatives>

chrisw14 · 2018-12-21T16:23:55Z

Oh great. Is there a 4.5.0-SNAPSHOT with the LuceneItemApapter?

org.imixs.workflow:imixs-workflow-engine:jar:4.5.0-SNAPSHOT is missing, no dependency information available

As far as I understood the implementation, it should work like this?

import java.text.Collator;

import javax.enterprise.inject.Alternative;
import javax.inject.Named;

import org.apache.lucene.collation.ICUCollationDocValuesField;

@Named("luceneItemAdapter")
@Alternative
public class MyGermanLuceneItemAdapter extends LuceneItemAdapter {
	
	@Override
	public String convertItemValue(Object itemValue) {
		final Collator instance = Collator.getInstance(ULocale.GERMAN);
		final ICUCollationDocValuesField contents = new ICUCollationDocValuesField("contents", instance);
		contents.setStringValue(super.convertItemValue(itemValue));
		return contents.toString();
	}
}

Here we create the index based on a SortedDocValuesField. Maybe we can use in some cases a SortedNumericDocValuesField.

I tried sorting the $taskid which is really represented by a String?!

rsoika · 2018-12-21T20:41:15Z

I deployed the snapshot now.

Lets concentrate on the GERMN Umlaute problem. I think your need only to overwrite the method adaptSortableItemValue

	public SortedDocValuesField adaptSortableItemValue(String itemName, Object itemValue) {
		String stringValue = convertItemValue(itemValue);
		final Collator instance = Collator.getInstance(ULocale.GERMAN);
		contents.setStringValue(stringValue);
		return new ICUCollationDocValuesField(itemName, instance);
	}

chrisw14 · 2018-12-22T15:02:19Z

I tried it out with the following method but the change order hasn't changed, even after rebuilding the index.

	@Override
	public SortedDocValuesField adaptSortableItemValue(String itemName, Object itemValue) {
		logger.severe("GermanItemValueAdapter");
		String stringValue = convertItemValue(itemValue);
		final Collator instance = Collator.getInstance(ULocale.GERMAN);
		final ICUCollationDocValuesField contents = new ICUCollationDocValuesField("contents", instance);
		contents.setStringValue(stringValue);
		return new SortedDocValuesField(itemName, new BytesRef(instance.toString()));
	}

The logger message doesn't appear in the console.

rsoika added the question label Sep 23, 2018

rsoika mentioned this issue Oct 1, 2018

LuceneUpdateService - lucence.analyzerClass is not read by indexwriter #429

Closed

rsoika mentioned this issue Dec 20, 2018

LuceneUpdateService - provide ItemAdapter class #463

Closed

rsoika mentioned this issue Dec 6, 2019

Lucene rebuild index #606

Closed

rsoika closed this as completed Feb 3, 2021

imixs locked and limited conversation to collaborators Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

LuceneSearchService - sort order of German 'Umlaute' #426

LuceneSearchService - sort order of German 'Umlaute' #426

chrisw14 commented Sep 23, 2018

rsoika commented Sep 23, 2018

chrisw14 commented Sep 29, 2018

rsoika commented Sep 30, 2018

chrisw14 commented Sep 30, 2018

rsoika commented Sep 30, 2018

chrisw14 commented Sep 30, 2018

rsoika commented Oct 1, 2018

chrisw14 commented Oct 1, 2018 •

edited

Loading

rsoika commented Oct 1, 2018

rsoika commented Oct 1, 2018

chrisw14 commented Oct 3, 2018

rsoika commented Oct 4, 2018

chrisw14 commented Oct 4, 2018

rsoika commented Oct 4, 2018 •

edited

Loading

chrisw14 commented Oct 5, 2018

rsoika commented Oct 5, 2018

chrisw14 commented Oct 5, 2018 •

edited

Loading

rsoika commented Oct 5, 2018

rsoika commented Oct 5, 2018

rsoika commented Oct 5, 2018

chrisw14 commented Oct 5, 2018

rsoika commented Oct 5, 2018

chrisw14 commented Oct 6, 2018 •

edited

Loading

chrisw14 commented Oct 13, 2018

rsoika commented Oct 13, 2018

chrisw14 commented Dec 19, 2018

rsoika commented Dec 20, 2018

rsoika commented Dec 20, 2018

chrisw14 commented Dec 21, 2018 •

edited

Loading

rsoika commented Dec 21, 2018

chrisw14 commented Dec 22, 2018

This issue was moved to a discussion.

This issue was moved to a discussion.

LuceneSearchService - sort order of German 'Umlaute' #426

LuceneSearchService - sort order of German 'Umlaute' #426

Comments

chrisw14 commented Sep 23, 2018

rsoika commented Sep 23, 2018

chrisw14 commented Sep 29, 2018

rsoika commented Sep 30, 2018

chrisw14 commented Sep 30, 2018

rsoika commented Sep 30, 2018

chrisw14 commented Sep 30, 2018

rsoika commented Oct 1, 2018

chrisw14 commented Oct 1, 2018 • edited Loading

rsoika commented Oct 1, 2018

rsoika commented Oct 1, 2018

chrisw14 commented Oct 3, 2018

rsoika commented Oct 4, 2018

chrisw14 commented Oct 4, 2018

rsoika commented Oct 4, 2018 • edited Loading

chrisw14 commented Oct 5, 2018

rsoika commented Oct 5, 2018

chrisw14 commented Oct 5, 2018 • edited Loading

rsoika commented Oct 5, 2018

rsoika commented Oct 5, 2018

rsoika commented Oct 5, 2018

chrisw14 commented Oct 5, 2018

rsoika commented Oct 5, 2018

chrisw14 commented Oct 6, 2018 • edited Loading

chrisw14 commented Oct 13, 2018

rsoika commented Oct 13, 2018

chrisw14 commented Dec 19, 2018

rsoika commented Dec 20, 2018

rsoika commented Dec 20, 2018

chrisw14 commented Dec 21, 2018 • edited Loading

rsoika commented Dec 21, 2018

chrisw14 commented Dec 22, 2018

This issue was moved to a discussion.

chrisw14 commented Oct 1, 2018 •

edited

Loading

rsoika commented Oct 4, 2018 •

edited

Loading

chrisw14 commented Oct 5, 2018 •

edited

Loading

chrisw14 commented Oct 6, 2018 •

edited

Loading

chrisw14 commented Dec 21, 2018 •

edited

Loading