Flattening high value parameters onto parent entity #37

patrick-austin · 2023-05-18T16:39:21Z

Taken from notes of ICAT F2F, in response to question asked by @antolinos:

Can the entities indexed be controlled? If only interested in Datasets, and specific DatasetParameters (~6 valuable ones, the rest are not interesting)?

entitiesToIndex is a config option in ICAT server. Only these will be indexed by Lucene/the search engine backend. This config option will still be present in the "new" version of free text search.

Currently, all Parameters are stored in their own index (one for Investigation, Dataset, Datafile and Sample). When searching/faceting, under the hood we "join" the main entity index to the Parameter index.

Joining has a negative performance impact, but is the only way to retain nested lists of objects (i.e. the only way to keep the type.name, type.units associated with the same numericValue)

In your use case, where there are certain valuable Parameters, it would be better to (as you have already done) "flatten" these parameters into fields on the Dataset document, as you do not need to worry about needing to be able to update these Parameters or an explosion in the number of Parameter fields.

This is not currently possible in either the icat.lucene or the OS/ES backend support, however in principle it should be possible to do by writing additional logic (and would be more performant) providing you don't mind the following drawbacks:

Parameters need to be reduced to key:value pairs, so units would need to be embedded into either the key or the value, rangeTop/rangeBottom would need to be mapped to a single value etc.

You would not be able to easily modify the "ParameterType" information - e.g. to change the ParameterType.name would mean changing the mapping of the entire index, or adding an alias for the field which would need very specific logic compared to the rest of the functionality

To implement this, changes would also be needed in icat.server and DataGateway

antolinos · 2023-05-19T04:51:45Z

Hi @patrick-austin

Thanks for compiling this.

All DatasetParameters are useful or interesting. We are trying to make the effort to add as many parameters as possible. However, I think that in the context of generic/global search, the set of common parameters that are shared among datasets is very limited (technique, sample name, dataset name, wavelength?, energy?, investigation.title, investigation.summary)
This is what I said that having a global and quick search with ~6 parameters would be already a good starting point for exploring the data.

Once said that if the system can handle all dataset parameters and return the requests in a quick way will be excellent. No need to artificially limit the number of dataset parameters indexed.

However, I do not want to index all datafiles (If I can avoid so) this is why I find very good that the configuration allows to choose the entities that will be indexed.

For the records, I did some documentation about the flatten structure we use here:
https://icat.gitlab-pages.esrf.fr/icat-plus/elasticsearch/

I would like to give a try, where do you recommend me to start with? Do I need DataGateway?

Thanks!

patrick-austin · 2023-05-19T13:36:36Z

In comparison to your Dataset structure, the code in icat.server that will do this:
https://github.com/icatproject/icat.server/blob/906ba63b943fb0ffe8a9480168a79f13278cfee7/src/main/java/org/icatproject/core/entity/Dataset.java#L236-L275
It flattens far less - Investigation name, title and visitId are included as fields on the Dataset, along with Sample and Type information. Becasue each Dataset can only have 1 Investigation, Type etc., so the flattening is simple. This would cover some of the fields you mention, but not "technique", "energy" or "wavelength".

Since in general we don't know how many Parameter(Types), or DatasetTechniques might exist, these use the "nested" type in ES/OS, which in Lucene terms is a separate index which is joined when searching. The Elasticsearch documentation makes a reference to the limitations of nested fields (in pure Lucene, this is achieved with the JoinUtil). This is how I would expect to see things like "energy" or "wavelength" represented.

In this example, (Investigation) Parameters are used to filter the data after searching for the word carbon: ral-facilities/datagateway#1401 (comment). In my opinion this is quick enough, but we are only dealing with a small number of ParameterTypes (2) and the Investigation index. Based on the tested we did, I would expect performance to be worse if there were more ParameterTypes or it was the Dataset/Datafile index (more search results). But it's difficult to predict how much slower it will be and whether that will be noticable to an end user since it will depend on the number of documents in the index.

Summary:

if the system can handle all dataset parameters and return the requests in a quick way

The system can dynamically (i.e. without needing to configure the possible parameters) handle all DatasetParameters, but whether it will be quick is hard to say, you'd probably need to test in a dev/pre-prod environment to be sure.

a global and quick search with ~6 parameters

If the code in icat.server was changed, you could extend (based on configuration) the fields we already flatten onto the Dataset to include "energy", "wavelength" or other common, searchable parameters. This is what the notes I wrote in HackMD and copied above address. In this case, the performance should always be good, as it does not require "nesting" or "joining".

Testing:

If you want to test the icat.lucene implementation, you will need to install the following branches of the ICAT components:
https://github.com/icatproject/icat.utils/tree/19_unit_conversion
https://github.com/icatproject/icat.lucene/tree/18_search_improvements
https://github.com/icatproject/icat.client/tree/27_search_improvements
https://github.com/icatproject/icat.server/tree/267_opensearch_support

If you want to test it against an Opensearch/Elasticsearch cluster, you don't need the icat.lucene component and should change the engine and provide the url of your cluster in the settings:
https://github.com/icatproject/icat.server/blob/906ba63b943fb0ffe8a9480168a79f13278cfee7/src/main/resources/run.properties#LL20C1-L21C49

In theory you can send requests direct to icat.server (without DataGateway) using https://github.com/icatproject/icat.server/blob/906ba63b943fb0ffe8a9480168a79f13278cfee7/src/main/java/org/icatproject/exposed/ICATRest.java#L1264

If you index new data into the ICAT while the search engine is set up, it will be indexed. To index existing data, you can use https://github.com/icatproject/icat.server/blob/906ba63b943fb0ffe8a9480168a79f13278cfee7/src/main/java/org/icatproject/exposed/ICATRest.java#LL1626C3-L1626C3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flattening high value parameters onto parent entity #37

Flattening high value parameters onto parent entity #37

patrick-austin commented May 18, 2023

antolinos commented May 19, 2023

patrick-austin commented May 19, 2023

Flattening high value parameters onto parent entity #37

Flattening high value parameters onto parent entity #37

Comments

patrick-austin commented May 18, 2023

antolinos commented May 19, 2023

patrick-austin commented May 19, 2023

Summary:

Testing: