Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Solr fields via an API call (investigation) #5989

Closed
pkiraly opened this issue Jul 3, 2019 · 13 comments
Closed

Add new Solr fields via an API call (investigation) #5989

pkiraly opened this issue Jul 3, 2019 · 13 comments
Assignees
Labels

Comments

@pkiraly
Copy link
Member

pkiraly commented Jul 3, 2019

When it comes to adding custom metadata block, there is a manual step involved: adding Solr fields to schema.xml, however Solr provides an API (Schema API) to make this manual step unnecessary, however it invloves some changes in Dataverse.
Note: this ticket is about investigation instead on implementation. First we have to understand every aspect of this change to make sure the existing technologies are reliable and fully support the request.

Solr has a Schema API, which lets you to modify the Solr schema (the list of fields and their properties). Solr can handle the schema in two different ways, and it can be controlled in the solrconfig.xml file. There is a "classic" way, which is based on schema.xml file, and a newer way, called managed schema (its materialization is the "managed-schema" file, and it is editable via the Solr user interface or via API, but it is not advised to edit this file manually).

In the Dataverse provided solrconfig.xml you have this:

The schema API doesn't work with the ClassicIndexSchemaFactory. If you try, Solr returns an error message: "schema is not editable". To enable Schema API, we have to change this setting:

Set ManagedIndexSchemaFactory in solrconfig.xml:

 <schemaFactory class="ManagedIndexSchemaFactory"/> 

After this you have to restart Solr, and the Schema API will work this way:

curl -X POST -H 'Content-type:application/json' \
  http://localhost:8983/api/cores/collection1/schema --data-binary '{
  "add-field":{
     "name":"title", "type":"text_en", "multiValued":false,
     "stored":true, "indexed":true
  },
  "add-copy-field":{"source":"title", "dest":"_text_", "maxChars":"3000"}
}'

The details of the Schema API can be found here:
https://lucene.apache.org/solr/guide/7_3/schema-api.html

The details of change from classic schema:

https://lucene.apache.org/solr/guide/7_3/schema-factory-definition-in-solrconfig.html#SchemaFactoryDefinitioninSolrConfig-Switchingfromschema.xmltoManagedSchema

The problems:
The documentation says: "Once Solr is restarted and it detects that a schema.xml file exists, but the managedSchemaResourceName file (i.e., “managed-schema”) does not exist, the existing schema.xml file will be renamed to schema.xml.bak and the contents are re-written to the managed schema file." When I tried it, the schema.xml were not copied, and not renamed. However since the same searches, even fielded searches are working.
When I use Schema API to retrieve fields, it contains only the default Solr fields, and not those Dataverse added via schema.xml.

I asked help from a Solr expert.

(I added @4tikhonov as watcher)

@pdurbin
Copy link
Member

pdurbin commented Jul 3, 2019

@pkiraly thanks for opening this issue! Creating fields in Solr programmatically via API would be a huge improvement over what we do know, which is to manually update schema.xml from time to time.

As you say, it would be especially useful when custom metadata blocks are created. The documented procedure at http://guides.dataverse.org/en/4.15/admin/metadatacustomization.html#updating-the-solr-schema (screenshot below) if quite manual.

Screen Shot 2019-07-03 at 11 38 32 AM

Please let me know if I can help at all. Thanks again.

@4tikhonov
Copy link
Contributor

Hi @pdurbin,

Yes, it's on the Roadmap of SSHOC DataverseEU project and we'll try to find a solution together. But probably someone already knows "how to unlock the closed door".

@pkiraly
Copy link
Member Author

pkiraly commented Jul 3, 2019

@pdurbin

It would be great help to figure it out if it generally doesn't do what is described in Solr manual or it is just for me (due to some confounding factors of my system configuration).

To do the experiment, do the following steps (on a test machine).

  1. make a copy of your existing Solr instance
  2. stop Solr
  3. on solrconfig.xml find this line:
  <schemaFactory class="ClassicIndexSchemaFactory"/>

and replace to this:

  <schemaFactory class="ManagedIndexSchemaFactory"/>
  1. restart Solr
  2. check if schema.xml.bak file were created and if the Dataverse related fields have been copied to managed-schema file.
  3. report the result here

@pdurbin
Copy link
Member

pdurbin commented Jul 3, 2019

@pkiraly one observation is that even before I do anything there's a managed-schema file at the following location:

/usr/local/solr/server/solr/collection1/conf/managed-schema

I made a copy of it with this:

cp -a /usr/local/solr/server/solr/collection1/conf/managed-schema /usr/local/solr/server/solr/collection1/conf/managed-schema.backup.pdurbin

I'm confirming if Solr is up or down with these:

curl http://localhost:8983/solr/collection1/schema/fields

systemctl status solr

Stopping Solr with this:

systemctl stop solr

Backing up the file before editing:

cp -a /usr/local/solr/server/solr/collection1/conf/solrconfig.xml /usr/local/solr/server/solr/collection1/conf/solrconfig.xml.backup.pdurbin

Start Solr again:

systemctl start solr

A file named schema.xml.bak was not created. Nothing from this:

find /usr/local/solr | grep bak

No change to /usr/local/solr/server/solr/collection1/conf/managed-schema . This diff shows no changes:

diff /usr/local/solr/server/solr/collection1/conf/managed-schema /usr/local/solr/server/solr/collection1/conf/managed-schema.backup.pdurbin

@pkiraly
Copy link
Member Author

pkiraly commented Jul 4, 2019

I asked @erikhatcher a well known Lucene/Solr contributor, author and speaker.

He suggested that if this issue occurs, run the following procedure:

  1. stop Solr
  2. cp schema.xml managed-schema
  3. start Solr

I tried it, and it works.

@pdurbin
Copy link
Member

pdurbin commented Sep 5, 2019

@joelmarkanderson would benefit from a solution. He recently reported the following at https://groups.google.com/d/msg/dataverse-community/lr26VTP8lhs/5JoZ-IdnBQAJ

"I have successfully populated a controlled vocabulary metadata block, and the list of 38 Values correctly shows under the "Add + Edit Metadata" configuration screen. However, selecting and saving a tag results in an webpage error message: "Error – The metadata could not be updated. If you believe this is an error, please contact Support for assistance."

...

Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=dataset_758_draft] unknown field 'tag' "

@pkiraly I have not yet tried your latest suggestion above. Mostly I'm just posting the error above so people can find this issue in the future.

@pdurbin
Copy link
Member

pdurbin commented Sep 5, 2019

@pkiraly heads up that as a stop gap measure @poikilotherm and I have cooked up a new plan in a new issue: Make Solr schema.xml configuration more flexible, still using Classic Schema Factory #6142

You (and others) are also welcome to read our conversation about it at http://irclog.iq.harvard.edu/dataverse/2019-09-05#i_104531

To be clear, Oliver and I and others still want the solution you are proposing in this issue. We just think the proposal in the other issue will be less effort. It's a short term solution. Your idea (this issue) is the longer term solution. 😄

@poikilotherm
Copy link
Contributor

Thanks @pdurbin for cross-referencing 👍 As you already outlined, I'm totally with you @pkiraly that it makes sense to switch to managed schema factory in the long run.

@pkiraly
Copy link
Member Author

pkiraly commented Oct 16, 2019

Is there anybody who could reproduce the process I suggested (see my comments #5989 (comment) and before that)?

@pdurbin Do you have some label for "help needed"? I do not have right to add labels.

@pdurbin
Copy link
Member

pdurbin commented Oct 16, 2019

@pkiraly I have not tried playing with Managed Schema. I can add some "help wanted" labels.

Maybe @poikilotherm can help? I believe that the future pull request will be a doc change and some scripts that add fields to the Solr schema dynamically based on the fields metadata blocks that have been loaded into Dataverse.

@pdurbin
Copy link
Member

pdurbin commented Oct 16, 2019

By the way, if anyone wants a real custom metadata block to play with, a new one called "codemeta.tsv" is attached to the "CodeMeta-Metadata for Software and displayFormat for controlledVocabularies" thread at https://groups.google.com/d/msg/dataverse-community/nDMbMv4fKf4/P5YxHJzDBgAJ

poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Mar 31, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 9, 2021
It remains unclear why this data transfer object, used for file asset
facets only so far has ever been treated as a bean. The facet
labels are used to render facet query links within the Web UI
and retrieved from a `Map` in `DatasetPage` UI backing bean.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 9, 2021
…eldType IQSS#5989

Moving: while living inside the search package, the functionality is
much extended to be used for validation and schema definition. The
schema related stuff needs to live on its own.

Renaming: make the class more representative of Solr terminology
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 9, 2021
The SolrFieldProperty class is used as a POJO to wrap Boolean or String
properties of Solr <field>, <fieldType> and <dynamicField> definitions.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 14, 2021
This is a base class to depict <field>, <dynamicField> and <fieldType>
in implementing subclasses.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 14, 2021
Introducing SolrFieldType to be kind of an enum class of available
types. As Java enums do not allow extending a base class, using the
good old pre Java 5 style here. Adding a field "ALL" to make
all types available as `List` via reflection.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 19, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 19, 2021
It remains unclear why this data transfer object, used for file asset
facets only so far has ever been treated as a bean. The facet
labels are used to render facet query links within the Web UI
and retrieved from a `Map` in `DatasetPage` UI backing bean.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 19, 2021
…eldType IQSS#5989

Moving: while living inside the search package, the functionality is
much extended to be used for validation and schema definition. The
schema related stuff needs to live on its own.

Renaming: make the class more representative of Solr terminology
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue May 21, 2021
…5989

Per @pkiraly, the standard configurations should be more easy to read
for humans by using self-explainatory names instead of abbrevs.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue May 21, 2021
…ype IQSS#5989

This commit introduces a few changes:
- The type enum in DatasetFieldType is enhanced to map to SolrFieldType
  types of Solr fields.
- The retrieval of dataverse.search.SolrField from a DatasetFieldType is
  much simplified due to this usage of SolrFieldType in both areas of
  the code.
- As there might be no mapping existant for some types (like email),
  the DatasetFieldType.getSolrField() has been refactored to return an
  Optional<SolrField>. All usages of the method have been updated
  aligned to this change.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue May 21, 2021
This EJB stores the Solr schema we need to follow as a single
source of truth. We will rely on this in-memory model to validate,
update and manage the schema inside the backing Solr instance.

It will be usable via API to reload the schema model from the database
(and code) plus will do so automatically during startup of Dataverse.
This is necessary to have empty Solr instances bootstrapped by us.

Thanks to @pkiraly @pdurbin and @rtreacy most of this code
was done during a "hacky friday" code-with-me session.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue May 21, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Aug 26, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Aug 26, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Aug 26, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Aug 26, 2021
…elds from Solr IQSS#5989

Some dynamic field definition in Solr are present by default, but completely unused by us.
We ignore those, as we don't want to depict all the types they require in SolrFieldType.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Aug 26, 2021
This adds a minimal prototype to work with the schema retrieved from
Solr. Does not yet do any comparison, but shows good results in
converting the schema present in Solr into the comparable in-memory model.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Aug 26, 2021
This adds a minimal prototype to work with the schema retrieved from
Solr. Does not yet do any comparison, but shows good results in
converting the schema present in Solr into the comparable in-memory model.
@pdurbin
Copy link
Member

pdurbin commented Oct 14, 2022

Hmm. I see some commits from a little over a year ago from @poikilotherm above.

I think it's safe to say that at least @pkiraly @poikilotherm and I (and probably others) are still interested in this but no one has had the time to code it up.

@cmbz
Copy link

cmbz commented Aug 20, 2024

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

@cmbz cmbz closed this as completed Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants