Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LuceneSearchService - sort order of German 'Umlaute' #426

Closed
chrisw14 opened this issue Sep 23, 2018 · 31 comments
Closed

LuceneSearchService - sort order of German 'Umlaute' #426

chrisw14 opened this issue Sep 23, 2018 · 31 comments
Labels

Comments

@chrisw14
Copy link

Hi,
is it possible in the implementation of imixs-workflow to tell lucene that it should order the german Umlaute correctly? Like 'Ü' should be 'Ue' and so on...

I think there are possibilities for lucene but I don't know how to implement it in imixs-workflow:
https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html
https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilterFactory.html

Thank you!

@rsoika
Copy link
Member

rsoika commented Sep 23, 2018

This is an interesting question which I can't answer for now.
In Imixs-Workflow we have this property

"lucence.analyzerClass" which defaults to the org.apache.lucene.analysis.standard.ClassicAnalyzer

You can overwrite this property with the imixs.properties file.

For my first understanding if you want to use the GermanNormalizationFilter you need to implement your own Analyzer. For example:

package my.project;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;

public class MyCustomAnalyzer extends Analyzer {
	@Override
	protected TokenStreamComponents createComponents(String fieldName) {
		StandardTokenizer standardTokenizer = new StandardTokenizer();
		GermanNormalizationFilter germanNormalizeationFilter = new GermanNormalizationFilter(standardTokenizer);
		return new TokenStreamComponents(standardTokenizer, germanNormalizeationFilter);
	}
}

Than you can activate the custom analyzer in your project with a imixs.properties entry :

# add custom lucene analyzer
lucence.analyzerClass=my.project.MyCustomAnalyzer

But I am not sure if this is an workable solution. Can you check this?

@chrisw14
Copy link
Author

I tried it out but can't found the mentioned imported classes:
org.apache.lucene.analysis.Analyzer;
org.apache.lucene.analysis.de.GermanNormalizationFilter;
org.apache.lucene.analysis.standard.StandardTokenizer;

In my pom I have:

	<dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-analyzers-common</artifactId>
		<version>6.3.0</version>
	</dependency>

Where can I found these classes?

@rsoika
Copy link
Member

rsoika commented Sep 30, 2018

The lucene dependencies needed are:

			<dependency>
				<groupId>org.apache.lucene</groupId>
				<artifactId>lucene-core</artifactId>
				<version>${lucene.version}</version>
			</dependency>
			<dependency>
				<groupId>org.apache.lucene</groupId>
				<artifactId>lucene-analyzers-common</artifactId>
				<version>${lucene.version}</version>
			</dependency>
			<dependency>
				<groupId>org.apache.lucene</groupId>
				<artifactId>lucene-queryparser</artifactId>
				<version>${lucene.version}</version>
			</dependency>

@chrisw14
Copy link
Author

Yes I already had this dependencies in the pom.xml but the packages mentioned above cannot be found.
It looks like this:

grafik

@rsoika
Copy link
Member

rsoika commented Sep 30, 2018

look if you can build your project with mvn install. It looks like you have project setup problem in eclipse

@chrisw14
Copy link
Author

Mvn install has built the project with success but in the class the imports cannot be resolved.
Several lucene packages are included in the java build path:

grafik

@rsoika
Copy link
Member

rsoika commented Oct 1, 2018

So it's a eclipse problem. You can try the project command "maven -> update project".
If this does not help please ask in the eclipse community. Let's concentrate here on the lucene question.

@chrisw14
Copy link
Author

chrisw14 commented Oct 1, 2018

Okay I tried it with the compile errors and added the line to the imixs.properties file but the sort order of the german "Umlaute" hasn't changed. There were no other errors.

Now I added a line in the class with an output in the console but it never appears, so I think the class has never been called.

@rsoika
Copy link
Member

rsoika commented Oct 1, 2018

Yes you are right. This is indeed a bug of the LuceneUpdateService. I opened a new issue #429

@rsoika
Copy link
Member

rsoika commented Oct 1, 2018

I fixed this now in release 4.4.1-SNAPSHOT. Please try if this works for you.

@chrisw14
Copy link
Author

chrisw14 commented Oct 3, 2018

Yes, the class will be called in 4.4.1-SNAPSHOT.
Then I really have this compiler problem:

java.lang.Error: Unresolved compilation problems: 
	The import org.apache.lucene.analysis.Analyzer cannot be resolved
	The import org.apache.lucene.analysis.de.GermanNormalizationFilter cannot be resolved
	The import org.apache.lucene.analysis.standard.StandardTokenizer cannot be resolved
	Analyzer cannot be resolved to a type
	TokenStreamComponents cannot be resolved to a type
	The method createComponents(String) of type LuceneGerman must override or implement a supertype method
	StandardTokenizer cannot be resolved to a type
	StandardTokenizer cannot be resolved to a type
	GermanNormalizationFilter cannot be resolved to a type
	GermanNormalizationFilter cannot be resolved to a type
	TokenStreamComponents cannot be resolved to a type

Okay, now I added the jars manually to the build path.
The class is called but the sort order is the same as before!?

@rsoika
Copy link
Member

rsoika commented Oct 4, 2018

concerning the compilation problems: you need to check your project and IDE setup I think.

concerning the sort order problem: how is the call of the find method looking now? How looks your imixs.properties 'lucence.indexFieldListNoAnalyze'. This is a important setting for sorting. Is your sorting field listed there?

@chrisw14
Copy link
Author

chrisw14 commented Oct 4, 2018

Okay, I will set it up again.

The call of the find method:
results = workflowService.getDocumentService().find(pageQuery, -1, 0, "title", false);
And of course I added the "title" to the lucence.indexFieldListNoAnalyze and reindexed the lucene index over the imixs-admin panel.

@rsoika
Copy link
Member

rsoika commented Oct 4, 2018

Ok - do you know the lucene tool 'luke'
https://github.com/DmitryKey/luke

This is a very cool application which allows you to test your lucene index with different settings. Maybe we can figure out if the index is correctly written.

@chrisw14
Copy link
Author

chrisw14 commented Oct 5, 2018

I tested the sort order with 'luke' and the analyzer org.apache.lucene.analysis.de.GermanAnalyzer.
The sort order is the same as in my application: "Ü" comes after "Z".
Maybe org.apache.lucene.analysis.de.GermanAnalyzer doesn't support the German umlauts?

@rsoika
Copy link
Member

rsoika commented Oct 5, 2018

But I understand the 'Ü' should be replaced with 'Ue'. So for my understanding the fields should not have an value with 'Ü'. The GermanAnalyzer should replace the tokens. Can you check this with Luke? If you know the unqiueid you can lookup the lucene documetn in Luke with all its itmes.

@chrisw14
Copy link
Author

chrisw14 commented Oct 5, 2018

The doc value is with "Ü" and the stored value is not available (only available for $uniqueid):
image

And should it be possible to sort after a item which name starts with a '$' ? Doesn't work for me, too.

@rsoika
Copy link
Member

rsoika commented Oct 5, 2018

yes of course that's right, the fields are not stored in lucene - so we can not look into that detail.....

@rsoika
Copy link
Member

rsoika commented Oct 5, 2018

sorting by item names starting wit "$" works . For example in the admin client you can verify this by sorting the result by '$created'

@rsoika
Copy link
Member

rsoika commented Oct 5, 2018

grafik

@chrisw14
Copy link
Author

chrisw14 commented Oct 5, 2018

Okay, sorting will work if all letters of the key are in lower case.
But you have no umlauts in your application?

@rsoika
Copy link
Member

rsoika commented Oct 5, 2018

hm... this all seems to be not so easy...

Maybe we should take more focus on the lines 242-248 in LuceneSearchService.search().


			if (sortOrder != null) {
				// sorted by sortoder
				logger.finest("......lucene result sorted by sortOrder= '" + sortOrder + "' ");
				// MAX_SEARCH_RESULT is limiting the total number of hits
				collector = TopFieldCollector.create(sortOrder, maxSearchResult, false, false, false);

			} 

Maybe the TopFieldCollector need to be applied with the correct filter class.....

If I remember correctly, search and sorting in Lucne are two separate processes. If so, it would not make sense to do this filtering when the index is written....
The Sort object is only used to sort the result after the search result was computed....

@chrisw14
Copy link
Author

chrisw14 commented Oct 6, 2018

Have I to add the analyzer to imixs-admin as well? There I create the lucene-index.

Do I really need an analyzer? Or how can I use a replacement of the umlauts like this?

                result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
                result = new ISOLatin2AccentFilter(result);
                result = new org.apache.lucene.analysis.LowerCaseFilter(result); 

@chrisw14
Copy link
Author

If I remember correctly, search and sorting in Lucne are two separate processes.

I think this is right because the search doesn't work using MyCustomAnalyzer.
How can I manipulate the sort order without the search? What is your idea with the TopFieldCollector?

Sorting numbers is in my opinion also wrong. It looks like this:
1, 100, 101, 2, 3, ...
I had expected the order: 1, 2, 3, 100, 101

@rsoika
Copy link
Member

rsoika commented Oct 13, 2018

I read about the TopFieldCollectior and it does not look like this class is responsible for the search order. So I was on the wrong path...

I am not sure if lucene is able to sort numbers. Did you have asked that question in the lucene forum already?

@chrisw14
Copy link
Author

Now I got an answer to this topic:
https://stackoverflow.com/questions/53438426/apache-lucene-sorting-numbers-and-german-umlauts/53837207#53837207

Is it possible to use this with imixs workflow? Thank you.

@rsoika
Copy link
Member

rsoika commented Dec 20, 2018

Concerning the sorting by number, I think the problem for now is the LucenUpdateService:

doc.add(new SortedDocValuesField(itemName, new BytesRef(sValue)));

Here we create the index based on a SortedDocValuesField. Maybe we can use in some cases a SortedNumericDocValuesField.

We can try here the following in this case:

We must see if this works as expected

@rsoika
Copy link
Member

rsoika commented Dec 20, 2018

I think I have now found a working solution for this problem.

I added a new CDI bean called LuceneItemAdapter. This bean does the converting of Item values and also the creation of SortedDocValuesFields.

And now with this solution your application can simply replace this adapter by an CDI alternative.
So in your case you have to move the solution from the ICU library into your alternative CDI bean.

And hopefully we can integrate your bean later back into the Imixs-Workflow project.
Are you familiar with CDI alternatives? You have to annotate your bean like this:

@Named("luceneItemAdapter")
@Alternative
public class MyGermanLuceneItemAdapter {
....

and add your bean into the beans.xml of your application:

	<alternatives>
		<class>com.foo.GermanItemValueAdapter</class>
	</alternatives>

@chrisw14
Copy link
Author

chrisw14 commented Dec 21, 2018

Oh great. Is there a 4.5.0-SNAPSHOT with the LuceneItemApapter?

org.imixs.workflow:imixs-workflow-engine:jar:4.5.0-SNAPSHOT is missing, no dependency information available

As far as I understood the implementation, it should work like this?

import java.text.Collator;

import javax.enterprise.inject.Alternative;
import javax.inject.Named;

import org.apache.lucene.collation.ICUCollationDocValuesField;

@Named("luceneItemAdapter")
@Alternative
public class MyGermanLuceneItemAdapter extends LuceneItemAdapter {
	
	@Override
	public String convertItemValue(Object itemValue) {
		final Collator instance = Collator.getInstance(ULocale.GERMAN);
		final ICUCollationDocValuesField contents = new ICUCollationDocValuesField("contents", instance);
		contents.setStringValue(super.convertItemValue(itemValue));
		return contents.toString();
	}
}

Here we create the index based on a SortedDocValuesField. Maybe we can use in some cases a SortedNumericDocValuesField.

I tried sorting the $taskid which is really represented by a String?!

@rsoika
Copy link
Member

rsoika commented Dec 21, 2018

I deployed the snapshot now.

Lets concentrate on the GERMN Umlaute problem. I think your need only to overwrite the method adaptSortableItemValue

	public SortedDocValuesField adaptSortableItemValue(String itemName, Object itemValue) {
		String stringValue = convertItemValue(itemValue);
		final Collator instance = Collator.getInstance(ULocale.GERMAN);
		contents.setStringValue(stringValue);
		return new ICUCollationDocValuesField(itemName, instance);
	}

@chrisw14
Copy link
Author

I tried it out with the following method but the change order hasn't changed, even after rebuilding the index.

	@Override
	public SortedDocValuesField adaptSortableItemValue(String itemName, Object itemValue) {
		logger.severe("GermanItemValueAdapter");
		String stringValue = convertItemValue(itemValue);
		final Collator instance = Collator.getInstance(ULocale.GERMAN);
		final ICUCollationDocValuesField contents = new ICUCollationDocValuesField("contents", instance);
		contents.setStringValue(stringValue);
		return new SortedDocValuesField(itemName, new BytesRef(instance.toString()));
	}

The logger message doesn't appear in the console.

@rsoika rsoika closed this as completed Feb 3, 2021
@imixs imixs locked and limited conversation to collaborators Feb 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

2 participants