-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flattening high value parameters onto parent entity #37
Comments
Thanks for compiling this. All DatasetParameters are useful or interesting. We are trying to make the effort to add as many parameters as possible. However, I think that in the context of generic/global search, the set of common parameters that are shared among datasets is very limited (technique, sample name, dataset name, wavelength?, energy?, investigation.title, investigation.summary) Once said that if the system can handle all dataset parameters and return the requests in a quick way will be excellent. No need to artificially limit the number of dataset parameters indexed. However, I do not want to index all datafiles (If I can avoid so) this is why I find very good that the configuration allows to choose the entities that will be indexed. For the records, I did some documentation about the flatten structure we use here: I would like to give a try, where do you recommend me to start with? Do I need DataGateway? Thanks! |
In comparison to your Dataset structure, the code in icat.server that will do this: Since in general we don't know how many Parameter(Types), or DatasetTechniques might exist, these use the "nested" type in ES/OS, which in Lucene terms is a separate index which is joined when searching. The Elasticsearch documentation makes a reference to the limitations of nested fields (in pure Lucene, this is achieved with the JoinUtil). This is how I would expect to see things like "energy" or "wavelength" represented. In this example, (Investigation) Parameters are used to filter the data after searching for the word carbon: ral-facilities/datagateway#1401 (comment). In my opinion this is quick enough, but we are only dealing with a small number of ParameterTypes (2) and the Investigation index. Based on the tested we did, I would expect performance to be worse if there were more ParameterTypes or it was the Dataset/Datafile index (more search results). But it's difficult to predict how much slower it will be and whether that will be noticable to an end user since it will depend on the number of documents in the index. Summary:
The system can dynamically (i.e. without needing to configure the possible parameters) handle all DatasetParameters, but whether it will be quick is hard to say, you'd probably need to test in a dev/pre-prod environment to be sure.
If the code in icat.server was changed, you could extend (based on configuration) the fields we already flatten onto the Dataset to include "energy", "wavelength" or other common, searchable parameters. This is what the notes I wrote in HackMD and copied above address. In this case, the performance should always be good, as it does not require "nesting" or "joining". Testing:If you want to test the icat.lucene implementation, you will need to install the following branches of the ICAT components: If you want to test it against an Opensearch/Elasticsearch cluster, you don't need the icat.lucene component and should change the engine and provide the url of your cluster in the settings: In theory you can send requests direct to icat.server (without DataGateway) using https://github.com/icatproject/icat.server/blob/906ba63b943fb0ffe8a9480168a79f13278cfee7/src/main/java/org/icatproject/exposed/ICATRest.java#L1264 If you index new data into the ICAT while the search engine is set up, it will be indexed. To index existing data, you can use https://github.com/icatproject/icat.server/blob/906ba63b943fb0ffe8a9480168a79f13278cfee7/src/main/java/org/icatproject/exposed/ICATRest.java#LL1626C3-L1626C3 |
Taken from notes of ICAT F2F, in response to question asked by @antolinos:
To implement this, changes would also be needed in icat.server and DataGateway
The text was updated successfully, but these errors were encountered: