Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Clarify how managed/classic schema.xml works, use cases, how to operationalize it. #7306

Closed
kcondon opened this issue Oct 6, 2020 · 12 comments

Comments

@kcondon
Copy link
Contributor

kcondon commented Oct 6, 2020

As time goes on some technical details become murky vis a vis updating solr schema.
@poikilotherm @pdurbin apologies but can we have some clarification on how admins should be maintaining schema.xml when they have 1. a legacy system, 2. a new system in cases where a field is added or changed? Also any considerations around a classic schema config and a managed one.

Current instructions:
http://guides.dataverse.org/en/5.1/admin/metadatacustomization.html?highlight=updateschemamdb

Current 5.1 release notes (https://github.com/IQSS/dataverse/releases/tag/v5.1) considering schema change:
Update Biomedical Metadata Block (if used), Reload Solr, ReExportAll

-wget https://github.com/IQSS/dataverse/releases/download/v5.1/biomedical.tsv
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @biomedical.tsv -H "Content-type: text/tab-separated-values"

-copy schema_dv_mdb_fields.xml and schema_dv_mdb_copies.xml to solr server, for example into /usr/local/solr/solr-7.7.2/server/solr/collection1/conf/ directory

-if your installation employs any custom metadatablocks, you should also run the script updateSchemaMDB.sh to preserve those fields. See http://guides.dataverse.org/en/5.1/admin/metadatacustomization.html?highlight=updateschemamdb . Otherwise it is unnecessary.

-Restart Solr, or tell Solr to reload its configuration:

curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=collection1"

-Run ReExportall to update JSON Exports
http://guides.dataverse.org/en/5.1/admin/metadataexport.html?highlight=export#batch-exports-through-the-api

Thanks

@djbrooke
Copy link
Contributor

djbrooke commented Oct 7, 2020

@poikilotherm is this something that you're willing to take on? If not, we'll plan to pull it into a future sprint. Thanks!

@kcondon
Copy link
Contributor Author

kcondon commented Oct 7, 2020

Just an update, was digging back in the 4.17 release notes and there were some instructions there that might provide part of the answer. Also some discussions here about that update script checking blocks in db made sense.

https://github.com/IQSS/dataverse/releases/tag/v4.17

Flexible Solr Schema, optionally reconfigure Solr
With this release, we moved all fields in Solr search index that relate to the default metadata schemas from schema.xml to separate files. Custom metadata block configuration of the search index can be more easily automated that way. For details, see admin/metadatacustomization.html#updating-the-solr-schema.

This is optional, but all future changes will go to these files. It might be a good idea to reconfigure Solr now or be aware to look for changes to these files in the future, too. Here's how:

You will need to replace or modify your schema.xml with the recent one (containing XML includes)
Copy schema_dv_mdb_fields.xml and schema_dv_mdb_copies.xml to the same location as the schema.xml
A re-index is not necessary as long no other changes happened, as this is only a reorganization of Solr fields from a single schema.xml file into multiple files.
In case you use custom metadata blocks, you might find the new updateSchemaMDB.sh script beneficial. Again,
see http://guides.dataverse.org/en/4.17/admin/metadatacustomization.html#updating-the-solr-schema

I think goals would be 1. what is the path for an older installation using single schema.xml to switch to managed and 2. is there a single, normalized way that we can recommend describing how to update schema in release notes going forward. If not single, then describe use cases.

For instance, for 1. follow the above instructions in 4.17 release notes. for 2. developers post changes to the sub schema files, those are moved over and then update script is run as in v5.1 release notes but whether to selectively run update script. adding a new block would be normal add block api instructions per guide, including running update script per guide. Does this sound correct?

@kcondon
Copy link
Contributor Author

kcondon commented Oct 14, 2020

@poikilotherm Replied via Slack:
Hi Kevin, sorry for the late reply. Slack didn't send a notification 😑
There is no obligation to run the script.
As it simply takes all the fields Dataverse knows of and stuffs into the XML files, you may use it for any case you mentioned.
If people prefer managing manually, they are welcome.
Everyone not wanting to think about that, yet using Dataverse as the one source of truth might just use the script. Plain dead to use, almost fire and forget.
Of course the script would override any manual changes.
So manual tweaks to the schema config would be lost.
Hope that helps 🙂

@kcondon
Copy link
Contributor Author

kcondon commented Oct 16, 2020

@poikilotherm @pdurbin
So we need to update demo and we still need to update production schema, in addition to providing clear, correct instructions to our community going forward, both developers making changes and writing release notes and to our fellow administrators.

Thinking more about out current system, using includes to separate out the default metadata block fields from schema.xml, did we go from a simple, convenient system to a more precise but complicated one?

In our prior system, all source controlled metadata block fields were added to a single schema.xml. Changes were also made there and instructions to update were to simply replace schema.xml and restart or reload solr. This worked for all community members who used default and custom metadata blocks, as long as they were under project source control.

In our current system, if I understand correctly and please comment if I do not, only default metadata blocks are included as part of the schema include files. This means that we do not source control non default index field changes such as mra collection? Also, does it means that any installation, eg. Harvard, that uses non default metadata blocks, now needs to always manually update their schemas, either schema.xml if using that method or the newer include schema files?

So would a more correct set of update instructions for release notes look something like this?

First, assuming index changes are for default metadata blocks only, not sure what to do with non default blocks.

  1. If you are using default metadata blocks, ie. you have not added any metadata blocks after installation, and you are not a legacy installation still using a single schema.xml, replace any modified schema include file (schema_dv_mdb_copies.xml and/or schema_dv_mdb_fields.xml) with the provided files, then either reload solr core or restart solr for changes to take effect.
  2. If you are a legacy installation that wishes to retain a single schema.xml, please note the changes to the schema include files by looking at git hub history for the files, then add those changes manually to your schema.xml and reload or restart solr.
  3. If you use metadata blocks that are not default but may still be under Dataverse source control, you must manually add any changes to your schema include files and restart or reload solr. To preserve your additional metadata blocks in your schema, it is advisable to first make copies of your schema files (schema.xml, schema_dv_mdb_copies.xml, schema_dv_mdb_fields.xml), run the script, updateSchemaMDB.sh, using appropriate command line arguments (especially -t) as explained in the guides to create versions of the schema include files that contain your current not default metadata block information, then manually add any changes to those files by looking at the most recent git hub history, then reload/restart solr.

Second, if changes were made to non default metadata blocks via .tsv files:

  1. If you use or intend to use the changed metadata block, using the api to add metadata blocks, (re)add the changed non default metadata block
  2. run the updateSchemaMDB.sh to generate complete schema include files based on currently configured metadata blocks*
  3. copy the latest schema include files for this release from github
  4. do a diff between the latest include files and your local custom include files
  5. manually copy the differences from source controlled include files to your in place custom include files.
  6. restart/reload solr

Note: 5.1 Release note instructions provided by the developer would have worked for installations that do not use custom metadata blocks. If Harvard had followed the developers release note instructions for updating schema for 5.1, we would have dropped our schema config for non default metadata blocks. The updated instructions added a note to run updateSchemaMDB.sh first but that alone would effectively ignore any new changes in 5.1. Complete instructions would be run updateSchemaMDB.sh to generate complete schema include files, identify changes to include files from 5.1, then manually add them to the generated schema include files and reload solr.

Another note: I'm not sure what the cost/benefit analysis was for making this change but if the goal is to simply separate out schema items that are in less general use, might we add a 3rd include file that preserves those items and would allow source control for them as long as their metadata blocks are also under project source control. It seems like this gets more and more complicated and I wonder whether it is worth it.

A third note: A question, really. Are there any other changes someone might reasonably be expected to make to schema.xml that are not field related? I know config options like search weighting are made to solrconfig.xml but not sure if any are in schema.

@landreev
Copy link
Contributor

(just to repeat what I said on slack earlier:)

I have updated the prod solr config; using the schema.xml file from the distribution (with the <xi:include lines); and using the script provided (updateSchemaMDB.sh) to generate the schema_dv_mdb_fields.xml and ...copies.xml files.
The point of rearranging this config was to isolate all the <field> and <copyField> entries that are derived from the metadata blocks on the dataverse side, in these standalone files; so that they can simply be overwritten by the update script every time an update is required, going forward.

@kcondon
Copy link
Contributor Author

kcondon commented Oct 19, 2020

@landreev I didn't see your slack post, where was it posted? Is this updated to v5.1.1 or just to base line managed schema config? I mention it because all Harvard fields are removed from these files so they need special (manual) handling -update script will write them out to files but then what method do you use to combine them with newer source controlled files? I am also wondering what it may mean for other installations. See my comments above.

Ok, I saw your comments in slack, was in a thread. I've followed up there with essentially the same question I posted here.

@kcondon
Copy link
Contributor Author

kcondon commented Oct 20, 2020

So, after discussion with Leonid and others, it looks like the following is how it is working:
As of v4.17, we've moved from a single schema.xml file to a basic schema.xml file and two include files where we have moved the indexed fields: schema_dv_mdb_fields.xml and schema_dv_mdb_copies.xml. We have removed the Harvard-specific metadata blocks from the default list that new installations get, currently including 6 blocks.
At installation, the setup-all.sh uses the checked in versions of the schema files, schema.xml and the two include files to populate solr with these default indexed fields.
When any indexing related updates to metadata blocks occur or a new metadata block is added that has indexed fields, the procedure is the same for everyone: load or reload the the metadata block using the .tsv file, run the updateSchemaMDB.sh script to extract the currently configured metadata block indexed fields to the local schema include files, schema_dv_mdb_fields.xml and schema_dv_mdb_copies.xml, ensuring they're written to the correct place for Solr to find them, then the script automatically reloads Solr to update the schema.
Developers making changes to indexed fields need to update the .tsv file for the changed metadata block and usually the checked in schema include files if the changed metadata block is one of the default ones loaded at installation. Note that release notes instructions should not use these include files as running the update script handles all changes.

An example of release notes from v5.1 that illustrates the above: https://github.com/IQSS/dataverse/releases/tag/v5.1

Additional Upgrade Steps
Update Biomedical Metadata Block (if used), Reload Solr, ReExportAll

wget https://github.com/IQSS/dataverse/releases/download/v5.1/biomedical.tsv
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @biomedical.tsv -H "Content-type: text/tab-separated-values"

Check if your Solr installation is running with the latest schema.xml config file (https://github.com/IQSS/dataverse/releases/download/v5.1/schema.xml), update if needed.

Run the script updateSchemaMDB.sh to generate updated solr schema files and preserve any other custom fields in your Solr configuration.
For example: (modify the path names as needed)
cd /usr/local/solr-7.7.2/server/solr/collection1/conf
wget https://github.com/IQSS/dataverse/releases/download/v5.1/updateSchemaMDB.sh
chmod +x updateSchemaMDB.sh
./updateSchemaMDB.sh -t .
See http://guides.dataverse.org/en/5.1/admin/metadatacustomization.html?highlight=updateschemamdb for more information.

Thanks everyone for clarifying what appears simple once known but was a bit mystifying to some of us.

@kcondon kcondon closed this as completed Oct 20, 2020
@poikilotherm
Copy link
Contributor

Hey @kcondon,

thanks for writing all of this down.

Yet, I'm not sure I get what the underlying problem is.

When deploying a Dataverse installation, people using custom metadata blocks need to take care of their Solr config.

No one stops people from not using the script and instead put a file in source control.

But why should the upstream project, that Dataverse is now, force people to load metadata blocks they don't use and are customized for Harvard's usage?

Before #6146 was merged, everyone had to cleanup the upstream files and put their field config inside the configuration manually.

Now people can use a script to have a less error prone way to adjust their config.

But no one has to stop using a source controlled version of the configuration file! Instead, they could use the script to update the file, add the patch to version control and deploy from there in multiple ways.

So why should Harvard Dataverse be in the exclusive position to source control it's Solr config in the upstream project? I bet there is some private configuration space for Harvard Dataverse anyway, so you could basically regain the simplicity you favor by source controlling it there.

However, the script had been invented because community members had troubles with maintaining their Solr config in terms of how to add their custom metadata to the Solr config. The API endpoint invented by @pdurbin was a great relief with that.

The overall goal should be solving #5989 and throwing away this script, which had always meant to be a temporary solution.

@kcondon
Copy link
Contributor Author

kcondon commented Oct 20, 2020

@poikilotherm My issues were based on a misunderstanding of how updates were managed and the idea that "extra" fields that were legacy were not such a burden on anyone that it required a rearchitecture if it resulted in other maintenance issues. Turns out it does not cause maintenance issues, that was my misunderstanding. We do not perform updates by manually modifying those include files to account for those changes but can just use the update script as long as our metadata blocks are loaded. I was confused by the developer checking in changes to those files and as part of deployment instructions, indicating to manually place them in place and reloading solr. If you combine that with running update script, there is an apparent conflict since update overwrites the include file but what I missed was if you first update the block with the change, the update script is all that is needed. Honestly I am not trying to create an us/them scenario, more of a was working/now complicated scenario but that again is not the case. Thanks for responding and if you want to chat off line, we can! Kevin

@djbrooke
Copy link
Contributor

Hey @poikilotherm thanks for comment, and I'm sure some of this can continue to be revisited as the project grows (and we look at options for flexible metadata). We just need to prioritize it, which we haven't been able to do so yet. This was mostly focused at understanding current state.

@djbrooke
Copy link
Contributor

Yeah. What @kcondon said. :)

@poikilotherm
Copy link
Contributor

Thanks @kcondon @djbrooke for clarifying! All good here now 😊 Keep on rocking Dataverse 🤟

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants